Modifications to the NFS Lite installation to Support LIGO VO
Contents
Introduction
The following document is meant to provide a reference set of modifications to a local sites NFS Lite deployment so that it can support the LIGO VO. Due to differences between site implementations of NFS Lite and other core CE and WN components this document should not to be used as a howto. Instead please use this document as a reference on how to adapt your local jobmanager and wrapper script to support the LIGO VO in NFS Lite. In writing this document it became clear that supporting the capability required by LIGO was as much a site policy issue as a technical one. To reflect that I have included details of how UCSD approached supporting the LIGO VO both from a technical and a site policy perspective.
Where possible detailed code and script examples are provided that while could be used as drop in replacements I recommend that each site admin examine the code provided to determine how best to integrate them into their individual sites.
NOTE: As of this writing the effectiveness of these modifications has been tested by UCSD using condor-g and DAGman submissions. These tests were based on scripts sent to UCSD by LIGO.
UPDATE: LIGO has now successfully tested the creation of their directories in
InitialDir? as specified by Remote_InitialDir and have transferred 5.1GB of Wave files successfully to UCSD.
Included below are a standalone condor-g script and a series of dag scripts used during testing. The dag scripts are based on scripts sent to UCSD by LIGO.
Glossary
- wrapper: Refers to the USER_JOB_WRAPPER as specified by condor configuration
- jobmanager: Refers to the condor.pm jobmanager script used in OSG Computer Elements to generate a condor submission script. Usually found in $VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/condor.pm
The Problem...Supporting Remote_InitialDir
When UCSD examined the problem of supporting the LIGO VO in NFS Lite installations it was determined that the feature missing was the honoring of the condor grid parameter
Remote_InitialDir. Examination of the Condor documentation indicated that this parameter is intended to allow the submitter to indicate the initial working directory of the job on the worker node.
By setting
Remote_InitialDir the condor globus submission interacts with the CE jobmanager to set condors
InitialDir parameter for the job script submitted to the cluster. Unfortunately
InitialDir is only honored by condor for certain universes if the cluster is using a shared file system. NFS Lite explicitly disables shared disk mode in condor which results in the local condor cluster ignoring the
InitialDir parameter.
So while none of the jobmanager modifications with NFS Lite explicitly disabled support of the
Remote_InitialDir parameter NFS Lite does explicitly disabled shared file systems and therefore
Remote_InitialDir is ignored.
The Solution
The solution to the problem is fairly straight forward. Modify the jobmanager and the job wrapper on the worker node to change the jobs initial directory to that specified by
Remote_InitialDir. This parameter is available in the jobmanager as the return value of the $description->directory() method. This value then just needs to be passed to the worker node wrapper script so that the correct directory change can be made.
What Would Condor do if the Wrapper Honored InitialDir? ?
Once it was determined that this solution should be effective we examined potential concerns. The primary concern was that while
Remote_IntialDir could be honored we were not certain what implicit features were assumed about this directory other than that the job would start there.
For example the copying of files specified by
transfer_input_files parameter on the globus submitter. LIGO indicated that they did not think this would effect them if UCSD did not support copying the files specified by
transfer_input_files into their starting directory. As it turned out condor appears to copy the files to the
InitialDir where the job wrapper starts the job even if the job came to be in that directory via the wrapper. The specifics behind what may be happening has not been yet investigated but either behaviour should work for LIGO.
Site Policy: Architectural and Performance Concerns and Restricting Remote_InitialDir Support
One of the primary goals of the NFS Lite approach to OSG sites is to eliminate a scalability issue that can severely degrade or disable the Computer Element of an OSG site. Specifically the use by submitted jobs of a shared NFS directory for file IO intensive activities. Even though the OSG has some features that assist users in moving the bulk of their IO off of the shared NFS this does not eliminate File IO over NFS completely ie excessive standard IO. UCSD also found in its own operational experience and those of other site admins that a site cannot always depend on the user starting their job in the best location as far as File IO is concerned. To address this UCSD explicity places the users job in a local disk working directory that is dynamically created for each individual job and eliminated the NFS mounts from worker node to CE. This configuration was called NFS Lite. Unfortunately this configuration turned out to caus problems for LIGO VO as they required the initial working directory to be set for their job at the submitter side.
To balance the needs of LIGO, local site policy and the fact that condor explicity ignores
InitialDir for non-shared file system cluster UCSD have implemented support for
Remote_InitialDir on a per VO basis or per User. This support is accomplished with modifications to the jobmanager and the wrapper that starts user jobs on the worker nodes. UCSD uses a pattern match based on the jobmanager $logname variable to determine if it should honor
Remote_InitialDir for the current submitter. Based on the fact that condor does not support the
InitialDir parameter for nonshared file systems as of this writing UCSD feels the appropriate default policy is to ignore
Remote_InitialDir but allow exceptions.
UCSD site policy will be updated to reflect this change.
Currently only LIGO is supported for
Remote_IntialDir at UCSD.
Note: It may be possible to capture whether Remote_InitialDir was set by the submitter and from there assume that the VO knows what they are doing and honor the parameter. UCSD did not investigate this and opted for a VO identity pattern match approach.
Jobmanager Modifications
Some basic modifications to the condor.pm jobmanager are required for the jobmanager to support
Remote_IntialDir. The modifications including detecting that
Remote_InitialDir should be honored for the submitting VO and then passing the required information to the USER_JOB_WRAPPER so that it can perform a directory relocation to the path specified by the submitter.
Passing Arguments to the Wrapper from the Job Manager
UCSD has a fairly complex perl based to execute a variety of pre-job steps. The wrapper has the ability to accept information using specially crafted command line arguments recognized by the wrapper. These arguments are removed from the jobs command line prior to execution of the job itself so that they do not interfere with the jobs own arguments.
When the jobmanager detects a submitter for which it should honor the
Remote_InitialDir it sets the associated command line argument to the wrapper. $logname is always set to the user name running the jobmanager which is mapped from GUMS or the gridmapfile depending on the local site authentication configuration. You can be as simple or as complex as you want with the pattern match. This is a pretty greedy match and could be replaced with something less broad, and quicker.
These changes should occur before the actual submit script is produced by the jobmanager.
map {
if ($_->[0] eq "LOGNAME") {$logname = $_->[1]; }
} @environment;
if ($logname =~ /.*ligo.*/) {
$wrapper_arguments .= " -wrapper_iwd " . $description->directory();
}
else {
$wrapper_arguments .= " -wrapper_iwd " . ' $_CONDOR_SCRATCH_DIR';
}
The job manager then appends the wrapper specific arguments to the end of the jobs argument string.
# START UCSD Modification
print SCRIPT_FILE "Arguments = $argument_string $wrapper_arguments\n";
# END UCSD Modification
Alternative: Setting the Environment of the Wrapper
*NOT TESTED* If you use this approach please let me know and I will update this wiki.
The following is an untested approach that should just work if your wrapper takes its input via environment variables. This is probably the best approach for most sites as it is fairly easy to pick up environment variables in the job wrapper.
map {
if ($_->[0] eq "LOGNAME") {$logname = $_->[1]; }
} @environment;
$environment_string = join(';', map {$_->[0] . "=" . $_->[1]} @environment);
# Added to detect LIGO VO for initial dir
if ($logname =~ /.*ligo.*/) {
$environment_string .= ";MY_INITIAL_DIR=". $description->directory();
}
# end of added lines
Once you change the $environment_string it is automatically used as the environment for the job so you should not need to do anything else.
Condors USER_JOB_WRAPPER
Condor supports the ability to run a site defined job wrapper. The name is somewhat inaccurate as the script specified in the worker node condor configuration is not really a wrapper but a script that can perform certain initializations before it performs an exec call to start the submitted job. As a result the submitted job completely replaces the USER_JOB_WRAPPER in memory and the wrapper script ceases to exist in active memory.
NFS Lite uses the USER_JOB_WRAPPER feature of condor to make the final directory relocation of the job as specified by the
Remote_IntialDir parameter.
Example condor configuration for a job wrapper
This condor configuration parameter must be available to the worker nodes and the script accessible and executable by the local worker node condor process.
USER_JOB_WRAPPER=/usr/bin/myjobwrapper
Simple Job Wrapper
This job wrapper will do nothing but execute the command sent by the user job to condor including all of its parameters.
#!/bin/sh
exec "$@"
Job Wrapper That Gets the InitialDir? from the Environment
#!/bin/sh
if
if [ -n "$MY_INITIAL_DIR" ]
then
cd $MY_INITIAL_DIR
fi
exec "$@"
Job Wrapper That Parses its Command Line
warning!! perl code
You probably do not want to do things this way, but you can. It requires that you detect your arguments and then remove them so as not to interfere with the job being run.
foreach $i (0 .. $#ARGV) {
if ($ARGV[$i] eq "-wrapper_iwd") {
$wrapper_iwd = $ARGV[$i+1];
delete $ARGV[$i];
delete $ARGV[$i+1];
$ourargs = 1;
}
}
#....
chdir $wrapper_iwd;
Jobmanager Output Script
The following is the script generated by a modified jobmanager to support
Remote_InitialDir. You can see how the
Arguments parameter includes the added parameter to be passed to the wrapper. The environment variable approach would update the jobs environment to included the required information. Also you might noticed that
InitialDir parameter is correctly set as well although condor should ignore this since the file systems are not shared between the worker and the CE. Does this have something to do why condor is copying the files correctly??
#
# description file for condor submission
#
Universe = vanilla
Notification = Never
Executable = /osglocal/users/cms/uscms001/.globus/.gass_cache/local/md5/8d/6c90b8c481ea15d364ba4f3b29b8ba/md5/e1/24b786af3b11e1157c6d24e293ad07/data
Requirements = OpSys == "LINUX" && Arch == "X86_64"
X509UserProxy = /osglocal/users/cms/uscms001/.globus/job/osg-gw-3.local/32758.1158172385/x509_up
Environment = OSG_GANGLIA_HOST=t2gw01.local;OSG_DATA=/osgfs/data;OSG_SITE_LONGITUDE=-117.26;GRID3_TMP_WN_DIR=/state/data/osgtmp;OSG_LOCATION=/osglocal/osgcore;OSG_JOB_MANAGER_HOME=/condor/release;OSG_JOB_MANAGER=condor;GRID3_TRANSFER_CONTACT=;GRID3_SITE_NAME=osg-gw-3.t2.ucsd.edu;OSG_JOB_CONTACT=osg-gw-3.t2.ucsd.edu/jobmanager-condor;GRID3_DATA_DIR=/osgfs/data;OSG_GANGLIA_PORT=8649;OSG_GANGLIA_SUPPORT=y;OSG_SITE_INFO=https://tier2.ucsd.edu/t2/index.php?option=com_content&task=view&id=2&Itemid=6;OSG_DEFAULT_SE=gsiftp://osg-gw-3.t2.ucsd.edu:2811/;OSG_GRID=/wn-client;LOGNAME=uscms001;OSG_SITE_NAME=osg-gw-3.t2.ucsd.edu;GRID3_JOB_CONTACT=osg-gw-3.t2.ucsd.edu/jobmanager-condor;OSG_GROUP=OSG;GRID3_USER_VO_MAP=/osglocal/osgcore/monitoring/grid3-user-vo-map.txt;OSG_LSF_LOCATION=;OSG_USER_VO_MAP=/osglocal/osgcore/monitoring/grid3-user-vo-map.txt;OSG_WN_TMP=/state/data/osgtmp;GRID3_GRIDFTP_LOG=/osglocal/osgcore/globus/var/gridftp.log;OSG_MONALISA_SERVICE=y;OSG_UTIL_CONTACT=osg-gw-3.t2.ucsd.edu/jobmanager;OSG_SITE_READ=dcap://dcopy-1.local:22137//pnfs/sdsc.edu/;GRID3_SITE_INFO=https://tier2.ucsd.edu/t2/index.php?option=com_content&task=view&id=2&Itemid=6;OSG_FBS_LOCATION=;OSG_SITE_CITY=La Jolla;HOME=/osglocal/users/cms/uscms001;OSG_SITE_COUNTRY=USA;OSG_CONTACT_NAME=Terrence Martin;LD_LIBRARY_PATH=/osglocal/osgcore/MonaLisa/Service/VDTFarm/pgsql/lib:/osglocal/osgcore/voms/lib:/osglocal/osgcore/prima/lib:/osglocal/osgcore/mysql/lib/mysql:/osglocal/osgcore/jdk1.4/jre/lib/i386:/osglocal/osgcore/jdk1.4/jre/lib/i386/server:/osglocal/osgcore/jdk1.4/jre/lib/i386/client:/osglocal/osgcore/berkeley-db/lib:/osglocal/osgcore/expat/lib:/osglocal/osgcore/globus/lib:;GRID3_TMP_DIR=/osgfs/data;OSG_GRIDFTP_LOG=/osglocal/osgcore/globus/var/gridftp.log;OSG_PBS_LOCATION=;OSG_SGE_LOCATION=;OSG_SGE_ROOT=;OSG_STORAGE_ELEMENT=y;OSG_APP=/code/osgcode;GRID3_BASE_DIR=/osglocal/osgcore;OSG_CONDOR_LOCATION=/condor/release;GLOBUS_GRAM_JOB_CONTACT=https://osg-gw-3.local:51797/32758/1158172385/;GLOBUS_LOCATION=/wn-client/globus;OSG_CONDOR_CONFIG=/etc/condor/condor_config;GLOBUS_REMOTE_IO_URL=/osglocal/users/cms/uscms001/.globus/job/osg-gw-3.local/32758.1158172385/remote_io_url;OSG_SPONSOR=cms:50 cdf:50;OSG_SITE_LATITUDE=32.85;GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://osg-gw-3.local:51798/;OSG_SITE_WRITE=srm://t2data2.t2.ucsd.edu:8443/;GRID3_SPONSOR=cms:50 cdf:50;GRID3_APP_DIR=/code/osgcode;CHANGED_X509=/osglocal/users/cms/uscms001/.globus/job/osg-gw-3.local/32758.1158172385/x509_up;GRID3_UTIL_CONTACT=osg-gw-3.t2.ucsd.edu/jobmanager;OSG_CONTACT_EMAIL=tmartin@physics.ucsd.edu;OSG_VO_MODULES=y
Arguments = 120 230 -wrapper_iwd /osgfs/data/tmartin/inspiral-0-20060911T161959-0700/
InitialDir = /osgfs/data/tmartin/inspiral-0-20060911T161959-0700/
Input = /dev/null
Log = /osglocal/osgcore/globus/tmp/gram_job_state/gram_condor_log.32758.1158172385
log_xml = True
+AccountingGroup = "group_cms.uscms001"
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output = true
transfer_input_files =
#Extra attributes specified by client
Output = /osglocal/users/cms/uscms001/.globus/job/osg-gw-3.local/32758.1158172385/stdout
Error = /osglocal/users/cms/uscms001/.globus/job/osg-gw-3.local/32758.1158172385/stderr
I still expect the proxy to be read from $_CONDOR_SCRATCH_DIR and for condor to set X509_USER_PROXY to reflect this. This appears to be what happens based on this snippet from the environment dump of my DAG test of the
Remote_InitiaDir changes.
X509_USER_PROXY=/state/data/condor_local/execute/dir_15457/x509_up
Test Scripts
These scripts are provided as is and will need to be modified to support your own local cluster.
Condor-G test script
This condor-g test script was used to test the
Remote_InitialDir functionality at UCSD. You may wish to use this script, with modifications, to test your own implementations of the above changes. This job was submitted via condor_submit.
my-script.cmd
universe=globus
GlobusScheduler=osg-gw-3.t2.ucsd.edu:/jobmanager-condor
executable=/home/users/tmartin/Cluster_Tests/ENV_test/var1.sh
stream_output = False
stream_error = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /bin/ps,/bin/hostname,/code/osgcode/tmartin/HitsTest-condor.sh,/code/osgcode/tmartin/OscarTest-condor.sh,/code/osgcode/tmartin/ValidateCMSSWSoftware-condor.sh
remote_initialdir = /osgfs/data/tmartin/
arguments=120 230
output = ./output/initial_dir-2.out
error = ./output/initial_dir-2.err
log = ./output/initial_dir-2.log
queue
DAG Test Script
These DAG tests scripts were based on an initial script sent to UCSD by LIGO to be similar to a LIGO submitted job. UCSD used this test to confirm functionality when using DAGman which is what LIGO uses. These scripts were submitted using condor_submit_dag.
mydag.dag
Job setup startup.cmd
Job run middle.cmd
Job clean finish.cmd
PARENT setup CHILD run clean
PARENT run CHILD clean
startup.cmd
######################################################################
# GRIPHYN VDS SUBMIT FILE GENERATOR
# DAG : inspiral, Index = 0, Count = 1
# SUBMIT FILE NAME : inspiral_0_osg_gw_3.t2.ucsd.edu_cdir.sub
######################################################################
environment = app=/code/osgcode;data=/osgfs/data;grid3=/code/osgcode/wn-client;tmp=/osgfs/data;wntmp=/state/data/osgtmp;
arguments = -n dirmanager -N Pegasus::dirmanager:1.0 -R osg_gw_3.t2.ucsd.edu /code/osgcode/wn-client/vds/bin/dirmanager --create --dir /osgfs/data/tmartin/inspiral-0-20060911T161959-0700
copy_to_spool = false
error = ./output/vds.err
executable = /code/osgcode/wn-client/vds/bin/kickstart
globusrsl = (jobtype=single)
globusscheduler = osg-gw-3.t2.ucsd.edu/jobmanager-condor
log = ./output/vds.log
notification = NEVER
output = ./output/vds.out
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3)
remote_initialdir = /osgfs/data/tmartin/
submit_event_user_notes = pool:osg_gw_3.t2.ucsd.edu
transfer_error = true
transfer_executable = false
transfer_output = true
universe = globus
+vds_generator = "Pegasus"
+vds_version = "1.4.7cvs"
+vds_wf_name = "inspiral-0"
+vds_wf_time = "20060911T161959-0700"
+vds_wf_xformation = "dirmanager"
+vds_wf_derivation = "Pegasus::dirmanager:1.0"
+vds_job_class = 6
+vds_job_id = "inspiral_0_osg_gw_3.t2.ucsd.edu_cdir"
+vds_site = "osg_gw_3.t2.ucsd.edu"
queue
middle.cmd
universe=globus
GlobusScheduler=osg-gw-3.t2.ucsd.edu:/jobmanager-condor
executable=/home/users/tmartin/Cluster_Tests/DAGtest/middle.sh
stream_output = False
stream_error = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /code/osgcode/tmartin/HitsTest-condor.sh,/code/osgcode/tmartin/OscarTest-condor.sh,/code/osgcode/tmartin/ValidateCMSSWSoftware-condor.sh
remote_initialdir = /osgfs/data/tmartin/inspiral-0-20060911T161959-0700/
arguments=120 230
output = ./output/initial_dir-1.out
error = ./output/initial_dir-1.err
log = ./output/initial_dir-1.log
queue
middle.sh This sh scripts is a UCSD regression test script we often use to test basic cluster functionality. Please replace with your own script.
#!/bin/sh
host=`/bin/hostname`
date=`/bin/date`
who=`/usr/bin/whoami`
ps=`/bin/ps awx`
pwd=`/bin/pwd`
echo "Where is perl"
echo $GLOBUS_LOCATION
echo "$who\@$host on $date"
echo
echo "PWD $pwd"
echo $X509_USER_PROXY
echo
echo "Scratch Dir : $_CONDOR_SCRATCH_DIR"
ls -alF $_CONDOR_SCRATCH_DIR
cp -fv /etc/group $_CONDOR_SCRATCH_DIR/myoutput.txt
ls -alF $_CONDOR_SCRATCH_DIR
echo "ls -alF $OSG_DATA"
ls -alF $OSG_DATA
#echo "cp -fv /etc/group $OSG_DATA/$RANDOM.$RANDOM.file"
#cp -fv /etc/group $OSG_DATA/$RANDOM.$RANDOM.file
echo
echo "-----------------------------"
echo
echo "Checking for srmcp"
which srmcp
echo "Sourcing setup.sh from wn-client"
ls -l $OSG_GRID/setup.sh
source $OSG_GRID/setup.sh
echo "Checking for srmcp again"
which srmcp
echo "Checking path"
echo $PATH
echo "Attempting to run srmcp"
$OSG_GRID/srmclient/bin/srmcp --help
echo
echo "-----------------------------"
echo
ls -alF $_CONDOR_SCRATCH_DIR
date >> $_CONDOR_SCRATCH_DIR/myoutput.txt
hostname >> $_CONDOR_SCRATCH_DIR/myoutput.txt
whoami >> $_CONDOR_SCRATCH_DIR/myoutput.txt
ls -alF $_CONDOR_SCRATCH_DIR
echo
echo "-----------------------------"
echo
ls -l $OSG_APP/cmssoft/cms/Releases/CMSSW/CMSSW_0_7_0/bin/slc3_ia32_gcc323/cmsRun
#echo "Sleeping for $1"
#sleep $1
echo
echo "=========="
echo "Checking for /uaf/clustertmp/"
ls -l /uaf/clustertmp/
echo "========="
echo "=========="
echo "testing running srmcp"
DATE=`date +%s`
FILE="$RANDOM-$DATE.out"
time $OSG_GRID/srmclient/bin/srmcp --debug=true srm://t2data2.t2.ucsd.edu:8443//data4/cms/userdata/tmartin/9298.out file://localhost//dev/null
time rm -fv $FILE
echo "=========="
echo "Running ValidateCMSSWSoftware-condor.sh"
echo "======================================="
chmod 755 ./ValidateCMSSWSoftware-condor.sh
./ValidateCMSSWSoftware-condor.sh
echo
echo
echo
echo "Running OscarTest-condor.sh"
echo "======================================="
chmod 755 ./OscarTest-condor.sh
./OscarTest-condor.sh
echo
echo
echo
echo "Running HitsTest-condor.sh"
echo "======================================="
chmod 755 ./HitsTest-condor.sh
./HitsTest-condor.sh
echo
echo
echo
finish.cmd
universe=globus
GlobusScheduler=osg-gw-3.t2.ucsd.edu:/jobmanager-condor
executable=/home/users/tmartin/Cluster_Tests/DAGtest/finish.sh
stream_output = False
stream_error = False
WhenToTransferOutput = ON_EXIT
transfer_input_files =
remote_initialdir = /osgfs/data/tmartin/inspiral-0-20060911T161959-0700/
arguments=120 230
output = ./output/initial_dir-2.out
error = ./output/initial_dir-2.err
log = ./output/initial_dir-2.log
queue
finish.sh
#!/bin/sh
/bin/hostname
/bin/date
/usr/bin/whoami
/bin/pwd
ls -la .
Final Thoughts and Comments
Condor documentation is fairly clear that
IntialDir is not supported in non-shared file system configurations. Since the OSG does not require shared file systems and sites have clearly shown they benefit from NFS Lite it is probably correct behaviour that
Remote_InitialDir is guaranteed to work and VOs should take this into account. However since the documentation for condor is not explicit that Remote_InitialDir works only if the remote sites implements a particular file system configuration it is probably reasonable that Remote_InitialDir functionality be supported in some circumstances.
For now it is UCSD site policy that
Remote_InitialDir will be supported for the LIGO VO and will consider requests for support of this parameter on a VO by VO basis.
References and External Documents
USER_JOB_WRAPPER:
http://www.cs.wisc.edu/condor/manual/v6.8/7_3Running_Condor.html
Remote_InitialDir and
InitialDir:
http://www.cs.wisc.edu/condor/manual/v6.8/condor_submit.html
Authors
--
TerrenceMartin - 19 Sep 2006