Dataset Request Accounting Tool

UCSD's dataset request accounting tool website is located at:
https://uaf-2.t2.ucsd.edu/datarequests/datareq.php

Purpose

The purpose of this tool is to keep track of information related to datasets delivered via phedex, and dataset account information. The tool gets nearly all of this information from the phedex emails which are sent to data managers when a user requests transfer or deletion of data, and the update (confirmation) emails for both of these actions. The website displays disk usage, number of datasets, and number of files based on account. This makes the account field useful for keeping track of disk space. The site also displays a table of all datasets processed.

Overview

The emails are processed by a perl script which parses the emails, extracts the pertinent information, and puts this information into a mySQL database created for this purpose. The database is then read by a php script. Its output is accessible at:
https://uaf-2.t2.ucsd.edu/datarequests/datareq.php

This website allows anyone to view the dataset accounting information. It also allows certain users to edit that information. (See 'Website' below.)

If you would like any of these scripts/files for use at your T2, please email Warren Andrews: wandrews -at- physics -dot- ucsd -dot- edu. Please send questions related to the scripts described below and how they work to Warren. Please direct questions related to installation of php, mySQL, or perl, and network or T2 administration to Terrence Martin: tmartin -at- physics -dot- ucsd -dot- edu.

For Information on updates to the scripts, see Updates? below.

Perl script for phedex email processing

We have created an email account which is listed as a phedex datamanager and therefore automatically receives the phedex transfer request, deletion request, and update emails. When any email arrives in this account, a perl script is run automatically which parses the email for information that will be stored (see Emails below).

Running script automatically

The following information is provided by Terrence Martin (tmartin -at-physics -dot- ucsd -dot- edu).

In order to have the perl script 'datareq.pl' automatically run on incoming email, use the following instructions:

  1. Install sendmail, and make sure it can send email
  2. Edit /etc/aliases (RHEL and sendmail), and add the following line:
datareq: "|/etc/smrsh/datareq.pl"
  1. Place the scripts 'datareq.pl' and 'aSearchCLI.py' (see below) in /etc/smrsh
  2. Restart the server

The email address will be: datareq@hostname.domain. Note that you do not need to create an account with this name as it is an alias.

Email parsing

The perl script 'datareq.pl' runs via the pipe described above. It parses the email, gathering the following information, which varies by type of email and includes:

  • Transfer or Deletion request emails:
    • Name and email of person who made the request
    • Name of all datasets requested
    • The account (see below)
    • Size of dataset if there is only one (see below)
    • Comments made by requestor
    • The request number (needed to match to update)
  • Update emails:
    • Name of person who acted on the request (see below)
    • The request number
    • Whether or not the request was approved
    • Account (see below)

The script distinguishes between these different types of email via the words 'Transfer', 'Update', or 'Deletion' in the subject line of the emails. Also, note that currently update emails do not distinguish between transfer or deletion updates, so the script matches updates via request number, which should be unique.

Because the script is run automatically, it has no stdout. Instead, all information is logged using the perl Sys::Syslog module. This information includes how many datasets were processed, if names are added (see 'users table' below), and any errors. It is currently set to log in 'local4', but this can be changed at the top of the script.

Account information

The account is a new feature of this software that, to my knowledge, does not exist in phedex. It therefore has to be manually added by either the person requesting the data, or the person approving it. (The later replacing the former if applicable.) This is done via the comment field on the phedex website. In the comment field, insert the following text: "|| account = name ||". You must include the double bar (or pipe) on either side, and the 'account=', but spaces are not important. Please do not include double quotes (") anywhere in the comment since I'm using the double quotes that the phedex email automatically puts around the comment as a parsing identifier.

You may include anything else on the outside of the double bars--this will be entered in the database under the comment field. The name can be anything you want it to be. The name will be stored in the database under account for later retrieval.

Size of dataset information

The phedex request emails list the names of all the datasets requested, but list only one size: the total. This is insufficient for our purposes since we would like to know the individual size of each dataset, and store each as an entry in the mySQL database.

Our solution to this problem is to do a query to CERN DBS for each dataset if there is more than one in the transfer. The query is done via the python script 'aSearchCLI.py' which forms the python API for DBS. The perl script needs the full path of this script to be set to the '$asearchpath' variable at the top of the script. The default is /etc/smrsh.

Request approval persons

The script automatically keeps track of who has permission to approve/deny requests by reading the update email. When an update arrives, the person who acted on the request is marked as an approval person in the database (see 'Users Table' below).

mySQL database

A mySQL database is necessary for storing the information gathered by the script described above. The name, location, login, and password for the database will have to be entered at the beginning of the php and perl scripts. The default login username is 'datareq', and the default location is 'localhost'.

The database currently uses the following structure:

Name of database: datarequests. Tables: requests, users.

The requests table has all of the datasets in it, and the users table has all of the people who requested data and those who acted on those requests.

Request Table

The request table has the following columns. The mySQL datatype is given, and a description where appropriate:

  • idx : bigint(20) : a unique record for indexing
  • requestor : bigint(20) : The person who approved/denied the request, as an index referring to the users table (see 'Users Table' below)
  • requesty : bigint(20) : The person who requested the dataset, also referring to the users table (see 'Users Table' below)
  • account : varchar(255) : see above
  • dataname : text
  • sizetb : float : size of dataset in TB
  • filecount : bigint(20)
  • active : tinyint(1) : See below
  • approved : tinyint(1) : See below
  • tra_req_comm : text : transfer request email comment
  • tra_app_comm : text : transfer approval email comment
  • del_req_comm : text : deletion request email comment
  • del_app_comm : text : deletion approval email comment
  • inserted : timestamp : for when this entry was created, automatically handled by mysql
  • updated : timestamp : for any updates

active/approved

Both the 'active' and 'approved' fields are booleans. When a transfer request comes in, the dataset is entered into the database, but it is marked inactive and unapproved (both are zero). A dataset is marked active and approved once its approval email has been processed and the request was approved. If the request was denied, both active and approved remain zero. Note that the dataset will not have actually arrived at your T2 yet when it is marked active.

A dataset is considered deleted when the deletion request is approved. When this occurs, the active status is set to zero, but it remains approved. This will let you know if a dataset is either approved and still on disk (active), approved but deleted, or denied transfer.

Users Table

The users table has the following columns. The mySQL datatype is given, and a description where appropriate:

  • users_idx : bigint(20) : a unique record for indexing (see below)
  • name : varchar(255) : the actual name of a user
  • email : varchar(255) : the email of the user
  • requestor : tinyint(1) : boolean for whether or not this user can approve or deny requests (see below)
  • password : varchar(255) : using SHA1(the mySQL encryption function), an encrypted password (see 'Website' below)
  • insert_date: timestamp

users_idx

The users_idx of a person's entry in this table is used as the 'requestor' and 'requesty' fields in the request table. For example, when a transfer request comes in, the name of the person who made the request is compared to the names in the users table. If the name is found, the users_idx corresponding to this name is inserted into the 'requesty' field of the dataset in the request table. If the name is not found, it is inserted into the users table, and the new users_idx is put in the 'requesty' field of the new entry in the request table.

requestor

The requestor field of the users table is a boolean which keeps track of whether or not this user can approve requests. If a user is inserted into the users table when a transfer request email arrives, the user's requestor field is set to zero. If the user is inserted during an approval email, the requestor field is set to one becuase the user has acted on a request, and therefore must have approval permission in phedex. If a user is inserted as a non-requestor, but later makes an approval, the user's requestor field is then set to one.

Website

UCSD's dataset request accounting tool website is located at:
https://uaf-2.t2.ucsd.edu/datarequests/datareq.php

The website is both a tool for viewing the datasets and account information, and editing this information. Editing is only enabled for users with a password in the users table of the database, and have successfully logged in.

The website is basically self-explanatory. The only feature which requires some explanation is the editing/logging in.

Currently, users cannot request login privilidges for the website. This is not considered a problem since there will only be a couple datamanagers per site, and these managers can manually set a password for themselves. In order to do this, place your password encrypted using the SHA1() SQL function in the 'password' field under your name in the users table. Then, on the website, the user will enter his/her name and password in the log-in box. The php script then checks the SHA1 key generated from this password with the password entry in the users table. If they match, the user is then logged in, and can click the 'edit' buttons on the right hand column of the dataset table on the website.

This button brings you to a page 'editdata.php'. On this page, you are free to change most properties of the dataset for which you clicked the 'edit' button. If you click the 'update' button on this page, you will enter the changes you made into the database. If you click 'create new', you will create a new entry in the 'requests' table for the information displayed. If you click 'delete', the active status will be set to zero. The entry will not be removed from the database because you may want a record of deleted (inactive) datasets. The 'cancel edit' button will bring you back to the main page with no changes to the database.

Updates

So far, the only significant update was made to the perl parsing and database entry script:

Version 1 includes the following new features:

  • Support for blocks: Previously, if a user requested only a block belonging to a particular dataset and not the entire dataset, the DBS query could not handle this case. Version 1 has the ability to find the size and number of files for a block by using the same script (aSearchCLI.py) but with different input.
  • Handling of the same dataset being requested multiple times: It may happen that different people request the same dataset at different times for legitimate reasons (commissioning, for instance), not just a mistake. If this occurs, the first instance of the dataset is marked deleted (inactive), and a new entry is made for the second instance.
Version 1 also includes some minor bug fixes including email parsing problems and error handling.

No further major revisions are planned. However, if there is a need or desire, there may be more versions in the future.

As a final note, I would like to describe a feature which we have decided not to implement yet because of time constraints. We currently have no way of dealing with 'open' datasets and 'open' transfers. Open datasets are datasets whose size changes as it is inserted into DBS, and open transfers similiarly are not fixed in size. This does not crash anything, but it makes size accounting inaccurate: the size the DBS query sees is the current size, and that may change for open datasets.

A work around for this problem is a script that updates the size of all the datasets in this database. This is a simple script which I can send to you if you want.

-- WarrenAndrews - 24 Jul 2008

Topic revision: r18 - 2008/10/02 - 22:52:27 - WarrenAndrews
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback