Difference: UCSDUserDocPCF (1 vs. 19)

Revision 192017/01/25 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 10 to 10
  The UCSD Physics Computing Facility (PCF) provides access to multiple high-throughput computing resources that are made available to students, faculty, and staff in the Department of Physics as well as those in the broader scientific community at UCSD. This document describes how to get an account on PCF and begin submitting jobs to its computing resources.
Added:
>
>
Please note that this documentation is currently under construction and may not be complete in some parts.
 This document follows the general Open Science Grid (OSG) documentation conventions:

  1. A User Command Line is illustrated by a green box that displays a prompt:
     [user@client ~]$ 
Line: 67 to 69
  request_disk = 8000000 request_memory = 1024 +ProjectName = "PCFOSGUCSD"
Changed:
<
<
+local = TRUE +site_local = FALSE
>
>
+local = FALSE +site_local = TRUE
  +sdsc = FALSE
Changed:
<
<
+uc = FALSE
>
>
+uc = TRUE
  queue 10

The first line here

 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments in a submit description file should be placed on their own line.
Line: 100 to 102
 HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, a set of custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.

The next set of custom job attributes in the sample submit description file

Changed:
<
<
 +local = TRUE
 +site_local = FALSE

>
>
 +local = FALSE
 +site_local = TRUE

  +sdsc = FALSE
Changed:
<
<
+uc = FALSE
>
>
+uc = TRUE
 are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to TRUE, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to FALSE, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.

Job Routing Flag Default Value Computing Resource Accessibility
Line: 112 to 114
 
+sdsc FALSE Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc FALSE Open Science Grid Open to all PCF users
Changed:
<
<
We see here that the sample submit description file is only targeted to run the job locally on PCF itself.
>
>
We see here that the sample submit description file has targeted the job to run either at the CMS Tier 2 Cluster or out on the Open Science Grid.
  Finally, the sample submit description file ends with the queue command, which as shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
Deleted:
<
<
requirements
 
Added:
>
>
requirements
 It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.

Job Status

Line: 245 to 247
 

Job Removal

Changed:
<
<
Occasionally, you you may need remove a job that has already been submitted to the PCF queue. For example, maybe the job has been misconfigured in some way or goes held for some reason. To remove a job in the queue, you can use the condor_rm command. To remove a job from the queue, provide the both the ClusterId and ProcId of the job you would like to remove.
>
>
Occasionally, you may need remove a job that has already been submitted to the PCF queue. For example, maybe the job has been misconfigured in some way or goes held for some reason. To remove a job in the queue, you can use the condor_rm command. To remove a job from the queue, provide the both the ClusterId and ProcId of the job you would like to remove.
 
 [youradusername@pcf-osg ~]$ condor_q youradusername

Revision 182017/01/13 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 112 to 112
 
+sdsc FALSE Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc FALSE Open Science Grid Open to all PCF users
Changed:
<
<
As such, we see here that the sample submit description file is only targeted to run the job locally on PCF itself.
>
>
We see here that the sample submit description file is only targeted to run the job locally on PCF itself.
  Finally, the sample submit description file ends with the queue command, which as shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
Line: 170 to 170
  10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended
Changed:
<
<
The status of each submitted job in the queue is provided in the column labeled ST in the standard output of the condor_q command. In general, you will only find 3 different status codes in this column, namely:
>
>
The status of each submitted job in the queue is provided in the column labeled ST in the standard output of the condor_q command. In general, you will only find 3 different job status codes in this column, namely:
 
  • R: The job is currently running.
  • I: The job is idle. It is not running right now, because it is waiting for a machine to become available.
Line: 198 to 198
  1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Changed:
<
<
In this case, for some reason you purposely placed the job on hold using the condor_hold command. However, if you find a more unusual HOLD_REASON given and you are unable to resolve the issue yourself, please contact the PCF system administrators to help you investigate the problem.
>
>
In this case, for some reason you placed the job on hold using the condor_hold command. However, if you find a more unusual HOLD_REASON given and are unable to resolve the issue yourself, please contact the PCF system administrators to help you investigate the problem.
 
Changed:
<
<
If you find that your job has been sitting idle (I) for an unusually long period of time, you can run condor_q with the -analyze (or -better-analyze) option to attempt to diagnose the problem.
>
>
If instead you find that your job has been sitting idle (I) for an unusually long period of time, you can run condor_q with the -analyze (or -better-analyze) option to attempt to diagnose the problem.
 
 [youradusername@pcf-osg ~]$ condor_q -analyze 16250.0

Line: 242 to 241
 
---------- local change to undefined
Changed:
<
<
mkandes@pcf-osg ~$ condor_q 16662.4 -l | less

MATCH_EXP_JOB_GLIDEIN_Entry_Name = "Unknown" MATCH_EXP_JOB_GLIDEIN_Schedd = "Unknown" MaxHosts? = 1 MATCH_EXP_JOBGLIDEIN_ResourceName = "UCSD" User = "mkandes@pcf-osg.t2.ucsd.edu" EncryptExecuteDirectory? = false MATCH_GLIDEIN_ClusterId = "Unknown" OnExitHold? = false CoreSize? = 0 JOB_GLIDEIN_SiteWMS = "$$(GLIDEIN_SiteWMS:Unknown)" MATCH_GLIDEIN_Factory = "Unknown" MachineAttrCpus0? = 1 WantRemoteSyscalls? = false MyType? = "Job" Rank = 0.0 CumulativeSuspensionTime? = 0 MinHosts? = 1 MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = "Unknown" PeriodicHold? = false PeriodicRemove? = false Err = "pi.err.16662.4" ProcId? = 4

-analyze

>
>
Again, if you are unable to resolve the issue yourself, please contact the PCF system administrators to help you investigate the problem.
 

Job Removal

Changed:
<
<
[1514] mkandes@pcf-osg ~$ condor_rm 16662.4 Job 16662.4 marked for removal
>
>
Occasionally, you you may need remove a job that has already been submitted to the PCF queue. For example, maybe the job has been misconfigured in some way or goes held for some reason. To remove a job in the queue, you can use the condor_rm command. To remove a job from the queue, provide the both the ClusterId and ProcId of the job you would like to remove.
 
Added:
>
>
 [youradusername@pcf-osg ~]$ condor_q youradusername

 
Changed:
<
<
condor_q 16662
>
>
-- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16665.0 youradusername 1/13 08:55 0+01:24:38 R 0 122.1 bash_pi.sh -b 8 -r 16665.1 youradusername 1/13 08:55 0+01:24:38 R 0 26.9 bash_pi.sh -b 8 -r 16665.2 youradusername 1/13 08:55 0+01:24:38 R 0 26.9 bash_pi.sh -b 8 -r 16665.3 youradusername 1/13 08:55 0+01:24:38 R 0 26.9 bash_pi.sh -b 8 -r 16665.4 youradusername 1/13 08:55 0+01:24:38 R 0 26.9 bash_pi.sh -b 8 -r 16665.5 youradusername 1/13 08:55 0+01:24:38 R 0 26.9 bash_pi.sh -b 8 -r 16665.6 youradusername 1/13 08:55 0+01:24:37 R 0 26.9 bash_pi.sh -b 8 -r 16665.7 youradusername 1/13 08:55 0+01:24:37 R 0 26.9 bash_pi.sh -b 8 -r 16665.8 youradusername 1/13 08:55 0+01:24:37 R 0 26.9 bash_pi.sh -b 8 -r 16665.9 youradusername 1/13 08:55 0+01:24:37 R 0 26.9 bash_pi.sh -b 8 -r

10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended

[youradusername@pcf-osg ~]$ condor_rm 16665.0 16665.2 16665.4 16665.6 16665.8

Job 16665.0 marked for removal Job 16665.2 marked for removal Job 16665.4 marked for removal Job 16665.6 marked for removal Job 16665.8 marked for removal

 
Added:
>
>
[youradusername@pcf-osg ~]$ condor_q youradusername
  -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Changed:
<
<
16662.0 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.1 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.2 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.3 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.5 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.6 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.7 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.8 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.9 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s
>
>
16665.1 youradusername 1/13 08:55 0+01:26:04 R 0 26.9 bash_pi.sh -b 8 -r 16665.3 youradusername 1/13 08:55 0+01:26:04 R 0 26.9 bash_pi.sh -b 8 -r 16665.5 youradusername 1/13 08:55 0+01:26:04 R 0 26.9 bash_pi.sh -b 8 -r 16665.7 youradusername 1/13 08:55 0+01:26:03 R 0 26.9 bash_pi.sh -b 8 -r 16665.9 youradusername 1/13 08:55 0+01:26:03 R 0 26.9 bash_pi.sh -b 8 -r
 
Changed:
<
<
9 jobs; 0 completed, 0 removed, 0 idle, 9 running, 0 held, 0 suspended
>
>
5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
 
Added:
>
>
However, if you need to remove a whole cluster of jobs, then just use the ClusterId of the jobs.
 

Job History

Revision 172017/01/13 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 40 to 40
  Password: ENTERYOURADPASSWORDHERE
Changed:
<
<

Managing Jobs with HTCondor

>
>

Managing Jobs

 

Job Submission

Line: 50 to 50
  where job.condor is the name of a UNIX formatted plain ASCII file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
Changed:
<
<

Submit Description Files

>
>

Submit Description Files

  A sample HTCondor submit description file (bash_pi.condor) is shown below.
Line: 95 to 95
 
 request_cpus = 1 
 request_disk = 8000000
 request_memory = 1024 
Changed:
<
<
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
>
>
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. If this information is inaccurate, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
 
Changed:
<
<
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, some of these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
>
>
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, a set of custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
  The next set of custom job attributes in the sample submit description file
 +local = TRUE

Line: 114 to 114
  As such, we see here that the sample submit description file is only targeted to run the job locally on PCF itself.
Changed:
<
<
Finally, the sample submit description file ends with the queue command, which in the form shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
>
>
Finally, the sample submit description file ends with the queue command, which as shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
  requirements
Line: 122 to 122
 

Job Status

Changed:
<
<
Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in the queue besides your own. Therefore, in general, you will want to issue the command by providing your username as an argument.
>
>
Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in PCF's queue besides your own. Therefore, in general, you will want to issue the condor_q command by providing your username as an argument.
 
 [youradusername@pcf-osg ~]$ condor_q youradusername

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               

Changed:
<
<
16661.0 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.1 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.2 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.3 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.4 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.5 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.6 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.7 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.8 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.9 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.0 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.1 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.2 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.3 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.4 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.5 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.6 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.7 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.8 youradusername 1/12 14:51 0+00:00:01 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.9 youradusername 1/12 14:51 0+00:00:01 R 0 0.0 pi.sh -b 8 -r 7 -s
>
>
16663.0 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.1 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.2 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.3 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.4 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.5 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.6 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.7 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.8 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.9 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16664.0 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.1 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.2 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.3 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.4 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.5 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.6 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.7 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.8 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.9 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r
 
Changed:
<
<
20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended
>
>
20 jobs; 0 completed, 0 removed, 10 idle, 10 running, 0 held, 0 suspended
 
Changed:
<
<
This will limit the status information returned condor_q to your user jobs only. However, if there is a particular subset of your jobs you're interested in checking up on, you can also limit the status information by providing the specific job ClusterId as an argument to condor_q.
>
>
This will limit the job status information returned by condor_q to your jobs only. You may also limit the job status information to a particular subset of jobs you're interested in by providing the ClusterId of the subset as an argument to condor_q.
 
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_q 16662

>
>
 [youradusername@pcf-osg ~]$ condor_q 16663

  -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Changed:
<
<
16662.0 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.1 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.2 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.3 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.4 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.5 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.6 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.7 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.8 mkandes 1/12 14:51 0+00:01:51 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.9 mkandes 1/12 14:51 0+00:01:51 R 0 0.0 pi.sh -b 8 -r 7 -s
>
>
16663.0 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.1 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.2 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.3 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.4 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.5 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.6 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.7 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.8 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.9 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r
  10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended
Added:
>
>
The status of each submitted job in the queue is provided in the column labeled ST in the standard output of the condor_q command. In general, you will only find 3 different status codes in this column, namely:

  • R: The job is currently running.
  • I: The job is idle. It is not running right now, because it is waiting for a machine to become available.
  • H: The job is the held state. In the held state, the job will not be scheduled to run until it is released.

If your job is running (R), you probably don't have anything to worry about. However, if the job has been idle (I) for an unusually long period of time or is found in the held (H) state, you may want to investigate why your job is not running before contacting the PCF system administrators for additional help.

If you find your job in the held state (H)

 [youradusername@pcf-osg ~]$ condor_q 16663.3

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               
 16663.3   youradusername         1/12 17:09   0+00:56:56 H  0   26.9 bash_pi.sh -b 8 -r

 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 

you can check the hold reason by appending the -held option to the condor_q command.

 [youradusername@pcf-osg ~]$ condor_q 16663.3 -held

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID       OWNER                  HELD_SINCE HOLD_REASON                                
 16663.3  youradusername         1/12 18:06 via condor_hold (by user youradusername)          

 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 

In this case, for some reason you purposely placed the job on hold using the condor_hold command. However, if you find a more unusual HOLD_REASON given and you are unable to resolve the issue yourself, please contact the PCF system administrators to help you investigate the problem.

If you find that your job has been sitting idle (I) for an unusually long period of time, you can run condor_q with the -analyze (or -better-analyze) option to attempt to diagnose the problem.

 [youradusername@pcf-osg ~]$ condor_q -analyze 16250.0

-- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
User priority for youradusername@pcf-osg.t2.ucsd.edu is not available, attempting to analyze without it.
---
16250.000:  Run analysis summary.  Of 20 machines,
     19 are rejected by your job's requirements 
      1 reject your job because of their own requirements 
      0 match and are already running your jobs 
      0 match but are serving other users 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Thu Jan 12 18:45:36 2017

	Reason for last match failure: no match found 

The Requirements expression for your job is:

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Memory >= 16384 )        1                    
2   ( TARGET.Cpus >= 8 )              1                    
3   ( TARGET.Arch == "X86_64" )       20                   
4   ( TARGET.OpSys == "LINUX" )       20                   
5   ( TARGET.Disk >= 1 )              20                   
6   ( TARGET.HasFileTransfer )        20                   

The following attributes should be added or modified:

Attribute               Suggestion
---------               ----------
local                   change to undefined 
  mkandes@pcf-osg ~$ condor_q 16662.4 -l | less
Line: 302 to 374
 module swap foo1 foo2 switches loaded module foo1 with module foo2 module unload foo reverses all changes to the environment made by previously loading module foo
Added:
>
>

Special Instructions

Running Jobs on Comet

Running Jobs on Amazon Web Services

  condor_annex is a Perl-based script that utilizes the AWS command-line interface and other AWS services to orchestrate the delivery of HTCondor execute nodes to an HTCondor pool like the one available to you on pcf-osg.t2.ucsd.edu. If you would like to try running your jobs on AWS resources, please contact Marty Kandes at mkandes@sdsc.edu. Some backend configuration of your AWS account will be necessary to get started. However, once your AWS account is configured, you will be able to order instances on-demand with one command:
Line: 322 to 397
  --config-file $AWS_USER_CONFIG"
Added:
>
>

Contact Information

 
Added:
>
>
  • Physics Help Desk
  • PCF System Administrators:
 

Additional Documentation

Added:
>
>
 
Deleted:
<
<
  • pi.condor: A sample HTCondor submit description file

  • pi.sh: A bash script that estimates the value of Pi via the Monte Carlo method.
 
Changed:
<
<
  • bash_pi.sh: A bash script uses a simple Monte Carlo method to estimate the value of Pi
>
>
  • bash_pi.sh: A bash script that uses a simple Monte Carlo method to estimate the value of Pi
 
META FILEATTACHMENT attachment="bash_pi.condor" attr="" comment="A sample HTCondor submit description file" date="1484096462" name="bash_pi.condor" path="bash_pi.condor" size="467" stream="bash_pi.condor" tmpFilename="/tmp/aq6lkzQfzs" user="MartinKandes" version="1"
META FILEATTACHMENT attachment="bash_pi.sh" attr="" comment="A bash script uses a simple Monte Carlo method to estimate the value of Pi" date="1484096507" name="bash_pi.sh" path="bash_pi.sh" size="1756" stream="bash_pi.sh" tmpFilename="/tmp/NEyD4BXYUQ" user="MartinKandes" version="1"

Revision 162017/01/12 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 24 to 24
 
Changed:
<
<
While users may submit and run jobs locally on PCF itself, in general, all computationally intensive jobs should be run only on the larger computing resources, reserving PCF's local resources for development and testing purposes only.
>
>
While users may submit and run jobs locally on PCF itself, all computationally intensive jobs should generally be run only on the larger computing resources, reserving PCF's local resources for development and testing purposes only.
 

System Status

Line: 40 to 40
  Password: ENTERYOURADPASSWORDHERE
Changed:
<
<

Running Jobs

>
>

Managing Jobs with HTCondor

 

Job Submission

Line: 67 to 67
  request_disk = 8000000 request_memory = 1024 +ProjectName = "PCFOSGUCSD"
Changed:
<
<
+local = true +site_local = false +sdsc = false +uc = false
>
>
+local = TRUE +site_local = FALSE +sdsc = FALSE +uc = FALSE
  queue 10
Changed:
<
<
Let's breakdown this sample submit description file line-by-line to provide you with some background and guidance on how to construct your own submit description files. The first line
 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments should be placed on their own line.
>
>
The first line here
 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments in a submit description file should be placed on their own line.
  Next, the universe command defines a specific type of execution environment for your job.
 universe = vanilla 
All batch jobs submitted to PCF should use the default vanilla universe.
Changed:
<
<
The executable command specifies the name of the executable you want to run.
 executable = pi.sh 
Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named bash_pi.sh, which uses a simple Monte Carlo method to estimate the value of Pi.
>
>
The executable command specifies the name of the executable you want to run.
 executable = bash_pi.sh 
Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named bash_pi.sh, which uses a simple Monte Carlo method to estimate the value of Pi.
  To successfully run this example script, a user is required to provide three command-line arguments: (1) the size of integers to use in bytes, (2) the number of decimal places to round the estimate of Pi, and (3) the number of Monte Carlo samples. These command-line arguments are passed to the script in the submit description file via the arguments command.
 arguments = -b 8 -r 5 -s 10000 
Here, the argument command indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
Changed:
<
<
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs.
 should_transfer_files = YES 
YES will cause HTCondor to always transfer input and output files for your jobs. However, total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
>
>
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs.
 should_transfer_files = YES 
YES will cause HTCondor to always transfer input and output files for your jobs. However, the total amount of input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
  The when_to_transfer_output command determines when HTCondor transfers your job's output files back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command back to PCF, as well as any other files created by the job in its remote scratch directory, but only when the job exits on its own.
 when_to_transfer_output = ON_EXIT 
If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by HTCondor for any reason prior to job completion. Any output files transferred back to PCF upon eviction are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
Changed:
<
<
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur in as a job runs.
>
>
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur as a job runs.
 
 output = pi.out.$(ClusterId).$(ProcId)
 error = pi.err.$(ClusterId).$(ProcId)
 log = pi.log.$(ClusterId).$(ProcId) 
Changed:
<
<
Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) variables to define the filenames. This will append the $(ClusterId? ) and $(ProcId? ) number of each HTCondor job to their respective output, error, and job event log files. This especially useful in separately tagging these output, error, and log files for each job when a submit description file is used to queue many jobs all at once.
>
>
Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) variables to define the filenames. This will append the $(ClusterId? ) and $(ProcId? ) number of each HTCondor job to their respective output, error, and job event log files. This especially useful for tagging the output, error, and log files for an individual job when a submit description file is used to queue many jobs all at once.
  Next in the sample submit description file are the standard resource request commands:request_cpus, request_disk, and request_memory.
 request_cpus = 1 
 request_disk = 8000000
 request_memory = 1024 
Changed:
<
<
These commands tell HTCondor what resources in terms of CPU (number of cores), disk (by default in KiB? ), and memory (by default in MiB? ) are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, you job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested.
>
>
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
 
Changed:
<
<
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
>
>
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, some of these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
  The next set of custom job attributes in the sample submit description file
Changed:
<
<
 +local = true
 +site_local = false
 +sdsc = false
 +uc = false 
are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to true, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to false, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.
>
>
 +local = TRUE
 +site_local = FALSE
 +sdsc = FALSE
 +uc = FALSE 
are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to TRUE, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to FALSE, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.
 
Job Routing Flag Default Value Computing Resource Accessibility
Changed:
<
<
+local true pcf-osg.t2.ucsd.edu Open to all PCF users
+site_local true CMS Tier 2 Cluster Open to all PCF users
+sdsc false Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc false Open Science Grid Open to all PCF users
>
>
+local TRUE pcf-osg.t2.ucsd.edu Open to all PCF users
+site_local TRUE CMS Tier 2 Cluster Open to all PCF users
+sdsc FALSE Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc FALSE Open Science Grid Open to all PCF users
 
Added:
>
>
As such, we see here that the sample submit description file is only targeted to run the job locally on PCF itself.
 
Added:
>
>
Finally, the sample submit description file ends with the queue command, which in the form shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
  requirements

It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.

Changed:
<
<

Querying Job Status

>
>

Job Status

Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in the queue besides your own. Therefore, in general, you will want to issue the command by providing your username as an argument.

 [youradusername@pcf-osg ~]$ condor_q youradusername

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               
 16661.0   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.1   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.2   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.3   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.4   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.5   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.6   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.7   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.8   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.9   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.0   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.1   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.2   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.3   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.4   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.5   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.6   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.7   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.8   youradusername         1/12 14:51   0+00:00:01 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.9   youradusername         1/12 14:51   0+00:00:01 R  0   0.0  pi.sh -b 8 -r 7 -s

 20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended 

This will limit the status information returned condor_q to your user jobs only. However, if there is a particular subset of your jobs you're interested in checking up on, you can also limit the status information by providing the specific job ClusterId as an argument to condor_q.

 [youradusername@pcf-osg ~]$ condor_q 16662

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 16662.0   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.1   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s 
 16662.2   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.3   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.4   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.5   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.6   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.7   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.8   mkandes         1/12 14:51   0+00:01:51 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.9   mkandes         1/12 14:51   0+00:01:51 R  0   0.0  pi.sh -b 8 -r 7 -s

 10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended 

mkandes@pcf-osg ~$ condor_q 16662.4 -l | less

MATCH_EXP_JOB_GLIDEIN_Entry_Name = "Unknown" MATCH_EXP_JOB_GLIDEIN_Schedd = "Unknown" MaxHosts? = 1 MATCH_EXP_JOBGLIDEIN_ResourceName = "UCSD" User = "mkandes@pcf-osg.t2.ucsd.edu" EncryptExecuteDirectory? = false MATCH_GLIDEIN_ClusterId = "Unknown" OnExitHold? = false CoreSize? = 0 JOB_GLIDEIN_SiteWMS = "$$(GLIDEIN_SiteWMS:Unknown)" MATCH_GLIDEIN_Factory = "Unknown" MachineAttrCpus0? = 1 WantRemoteSyscalls? = false MyType? = "Job" Rank = 0.0 CumulativeSuspensionTime? = 0 MinHosts? = 1 MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = "Unknown" PeriodicHold? = false PeriodicRemove? = false Err = "pi.err.16662.4" ProcId? = 4

-analyze

Job Removal

[1514] mkandes@pcf-osg ~$ condor_rm 16662.4 Job 16662.4 marked for removal

condor_q 16662

-- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16662.0 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.1 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.2 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.3 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.5 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.6 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.7 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.8 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.9 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s

9 jobs; 0 completed, 0 removed, 0 idle, 9 running, 0 held, 0 suspended

 
Changed:
<
<

Removing Jobs

>
>

Job History

 
Changed:
<
<

Software Available

>
>

Available Software

  Environment modules provide users with an easy way to access different versions of software and to access various libraries, compilers, and software. All user jobs running on computing resources accessible to PCF should have access to the

Revision 152017/01/11 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 42 to 42
 

Running Jobs

Changed:
<
<
PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command. The command as follows:
>
>

Job Submission

PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:

 
 [youradusername@pcf-osg ~]$ condor_submit job.condor 

where job.condor is the name of a UNIX formatted plain ASCII file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.

Changed:
<
<

HTCondor Submit Description Files

>
>

Submit Description Files

 
Changed:
<
<
A sample submit description file (pi.condor) is shown below.
>
>
A sample HTCondor submit description file (bash_pi.condor) is shown below.
 
Changed:
<
<
 # =====================================================================
 # A sample HTCondor submit description file
 # ---------------------------------------------------------------------

>
>
 # A sample HTCondor submit description file

  universe = vanilla
Changed:
<
<
executable = pi.sh
>
>
executable = bash_pi.sh
  arguments = -b 8 -r 5 -s 10000 should_transfer_files = YES when_to_transfer_output = ON_EXIT
Changed:
<
<
output = pi.out.$(ClusterId? ).$(ProcId? ) error = pi.err.$(ClusterId? ).$(ProcId? ) log = pi.log.$(ClusterId? ).$(ProcId? )
>
>
output = bash_pi.out.$(ClusterId? ).$(ProcId? ) error = bash_pi.err.$(ClusterId? ).$(ProcId? ) log = bash_pi.log.$(ClusterId? ).$(ProcId? )
  request_cpus = 1 request_disk = 8000000 request_memory = 1024
Line: 71 to 71
  +site_local = false +sdsc = false +uc = false
Changed:
<
<
queue 10 # =================================================================
>
>
queue 10
 
Changed:
<
<
In a submit description file, the universe command defines a specific type of execution environment for the job. All batch jobs submitted to PCF should use the default vanilla universe as indicated here.
>
>
Let's breakdown this sample submit description file line-by-line to provide you with some background and guidance on how to construct your own submit description files. The first line
 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments should be placed on their own line.
 
Changed:
<
<
Next, the executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named pi.sh, which estimates the value of Pi using a Monte Carlo method.
>
>
Next, the universe command defines a specific type of execution environment for your job.
 universe = vanilla 
All batch jobs submitted to PCF should use the default vanilla universe.
 
Changed:
<
<
To successfully run this example script, a user is required to provide three command-line arguments: (1) the size of integers to use in bytes, (2) the number of decimal places to round the estimate of Pi, and (3) the number of Monte Carlo samples. These command-line arguments are passed to the script in the submit description file via the arguments command. Here, the argument command indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
>
>
The executable command specifies the name of the executable you want to run.
 executable = pi.sh 
Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named bash_pi.sh, which uses a simple Monte Carlo method to estimate the value of Pi.
 
Changed:
<
<
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs. YES will cause HTCondor to always transfer input and output files for your jobs. However, total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
>
>
To successfully run this example script, a user is required to provide three command-line arguments: (1) the size of integers to use in bytes, (2) the number of decimal places to round the estimate of Pi, and (3) the number of Monte Carlo samples. These command-line arguments are passed to the script in the submit description file via the arguments command.
 arguments = -b 8 -r 5 -s 10000 
Here, the argument command indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
 
Changed:
<
<
The when_to_transfer_output command tells HTCondor when output files are to be transferred back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command, as well as any other files created by the job in its remote scratch directory, back to PCF, but only when the job exits on its own. If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by HTCondor for any reason prior to job completion. Any output files transferred back to PCF upon evection are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
>
>
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs.
 should_transfer_files = YES 
YES will cause HTCondor to always transfer input and output files for your jobs. However, total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
 
Changed:
<
<
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur in as a job runs. Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) automatic variables to define their respective filenames. This will append the $(ClusterId? ) and $(ProcId? ) of each HTCondor job to their output, error, and job event log files, which becomes quite useful when a submit description file is used to queue many jobs all at once.
>
>
The when_to_transfer_output command determines when HTCondor transfers your job's output files back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command back to PCF, as well as any other files created by the job in its remote scratch directory, but only when the job exits on its own.
 when_to_transfer_output = ON_EXIT 
If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by HTCondor for any reason prior to job completion. Any output files transferred back to PCF upon eviction are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
 
Changed:
<
<
It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.
>
>
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur in as a job runs.
 output = pi.out.$(ClusterId).$(ProcId)
 error = pi.err.$(ClusterId).$(ProcId)
 log = pi.log.$(ClusterId).$(ProcId) 
Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) variables to define the filenames. This will append the $(ClusterId? ) and $(ProcId? ) number of each HTCondor job to their respective output, error, and job event log files. This especially useful in separately tagging these output, error, and log files for each job when a submit description file is used to queue many jobs all at once.
 
Added:
>
>
Next in the sample submit description file are the standard resource request commands:request_cpus, request_disk, and request_memory.
 request_cpus = 1 
 request_disk = 8000000
 request_memory = 1024 
These commands tell HTCondor what resources in terms of CPU (number of cores), disk (by default in KiB? ), and memory (by default in MiB? ) are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, you job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested.
 
Changed:
<
<
HTCondor Submit Description File Command Description
universe = vanilla A universe in HTCondor defines an execution environment. All jobs submitted to PCF should use the default 'vanilla' universe.
executable = [nameofexe] The name of the executable for this batch job. Only one executable command within a submit description file should be specified. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command is issued.
arguments = [argument_list] List of arguments to be supplied to the executable as part of the command line.
should_transfer_files = [ YES / NO / IF_NEEDED ] The should_transfer_files setting determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. NO disables the file transfer mechanism. IF_NEEDED will not transfer files for the job if it is matched with a local resource that shares the same file system as the submit machine. If the job is matched with a remote resource, which does not have a shared file system, then HTCondor will transfer the necessary files.
transfer_input_files = [file1,file2,file...] A comma-delimited list of all the files and directories to be transferred into the working directory for the job, before the job is started. By default, the file specified in the executable command and any file specified in the input command are transferred.
when_to_transfer_output = [ ON_EXIT / ON_EXIT_OR_EVICT ] Setting when_to_transfer_output equal to ON_EXIT will cause HTCondor to transfer the job's output files back to the submitting machine only when the job completes (exits on its own). The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.
output = [outputfile] Specifies standard output (stdout) for batch job.
error = [errorfile] Specifies standard error (stderr) for batch job.
log = [logfile] Contains events the batch job had during its lifetime inside of HTCondor.
requirements = [booleanexpression] The requirements command is a boolean expression which uses C-like operators. In order for any job to run on a given machine, this requirements expression must evaluate to true on the given machine.
request_cpus = [numberofcpus] A requested number of CPUs (cores). If not specified, the number requested will be 1. If specified, the expression && (RequestCpus? <= Target.Cpus) is appended to the requirements expression for the job.
request_disk = [amountofdisk] The requested amount of disk space in KiB? requested for this job. If not specified, it will be set to the job ClassAd? attribute DiskUsage. However, if specified, then the expression && (RequestDisk? <= Target.Disk) is appended to the requirements expression for the job. Characters may be appended to a numerical value to indicate units. K or KB indicates KiB? , M or MB indicates MiB? , G or GB indicates GiB? , and T or TB indicates TiB? .
request_memory = [amountofmemory] The required amount of memory in MiB? that this job needs to avoid excessive swapping. The actual amount of memory used by a job is represented by the job ClassAd? attribute MemoryUsage? . If specified, the expression && (RequestMemory? <= Target.Memory) is appended to the requirements expression for the job. If not specified, a default of 1024 MiB? defined by in the PCF system configuration will be used. Characters may be appended to a numerical value to indicate units. K or KB indicates KiB? , M or MB indicates MiB? , G or GB indicates GiB? , and T or TB indicates TiB? .
periodic_remove = [booleanexpression] This expression is checked periodically at an interval of the number of seconds set by the HTCondor configuration variable PERIODIC_EXPR_INTERVAL. If it becomes True, the job is removed from the queue. If unspecified, the default value is False.
on_exit_hold = [booleanexpression] This expression is checked when the job exits, and if True, places the job into the HTCondor Hold state. If False (the default value when not defined), then nothing happens and the on_exit_remove expression is checked to determine if that needs to be applied.
+[customattributename] = [customattributevalue] HTCondor allows users to add their own HTCondor job ClassAd? attributes at submission. On PCF, these custom attributes are used to mark jobs for special routing and accounting purposes, which will be explained further below.
queue [integer] Places zero or more copies of the job into the HTCondor queue.
>
>
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
 
Added:
>
>
The next set of custom job attributes in the sample submit description file
 +local = true
 +site_local = false
 +sdsc = false
 +uc = false 
are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to true, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to false, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.
 
Added:
>
>
Job Routing Flag Default Value Computing Resource Accessibility
+local true pcf-osg.t2.ucsd.edu Open to all PCF users
+site_local true CMS Tier 2 Cluster Open to all PCF users
+sdsc false Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc false Open Science Grid Open to all PCF users
 
Deleted:
<
<
The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions: flag default description +local true run on the brick +site_local true run in your own local site batch system +sdsc false run at Comet +uc false run at all other UCs
 
Added:
>
>
requirements

It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.

 
Changed:
<
<
All computationally intensive batch jobs should be targeted to run only on these remote computing resources and not locally on PCF itself. The PCF own local resources should be reserved for development and testing purposes only. Do not run any test workloads interactively on PCF.
>
>

Querying Job Status

 
Changed:
<
<
Note, however, users MUST have an XSEDE allocation on Comet to access it.
>
>

Removing Jobs

 
Changed:
<
<

Modules

>
>

Software Available

  Environment modules provide users with an easy way to access different versions of software and to access various libraries, compilers, and software. All user jobs running on computing resources accessible to PCF should have access to the
Line: 222 to 220
  --config-file $AWS_USER_CONFIG"
Deleted:
<
<

Additional Documentation

 
Deleted:
<
<
Important Note For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

This section shows the basics needed to start submitting jobs through HTCondor. For more detailed instructions about using HTCondor, please see the link to the user manual below in the References section.

Submit File

In order to submit jobs through condor, you must first write a submit file. The name of the file is arbitrary but we will call it job.condor in this document.

Example submit file:

universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue

This example assumes job.condor and the test.sh executable are in the current directory, and a logs subdirectory is also already present in the current directory. Condor will create the test.log and send the job's stdout and stderr to test.out.$(Cluster).$(Process) and test.err.$(Cluster).$(Process) respectively.

Jobs can be submitted to condor using the following command:

condor_submit job.condor

Targeting Resources

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions:

flag default description
+local true run on the brick
+site_local true run in your own local site batch system
+sdsc false run at Comet
+uc false run at all other UCs

Example submit file to restrict jobs to only run at SDSC and not locally:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
+local = false
+site_local = false
+sdsc = true
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never queue 
<--/twistyPlugin-->

Querying Jobs

The follwing will show a list of your jobs on the queue:

 condor_q <username>

Screen dump:

<--/twistyPlugin twikiMakeVisibleInline-->
[1627] jdost@uclhc-1 ~$ condor_q jdost


-- Submitter: uclhc-1.ps.uci.edu : <192.5.19.13:9615?sock=76988_ce0d_4> : uclhc-1.ps.uci.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  29.0   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.1   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.2   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.3   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.4   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       

5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
<--/twistyPlugin-->

Detailed classads can be dumped for a particular job with the -l flag:

condor_q -l $(Cluster).$(Process)

Canceling Jobs

You can cancel all of your own jobs at any time with the following:

condor_rm <username>

Or alternatively choose a specific job with the $(Cluster).$(Process) numbers, e.g.:

condor_rm 26.0

Important Note It is needed that in the submit files the following is included

+ProjectName=“PCFOSGUCSD”

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

 
Changed:
<
<

References

>
>

Additional Documentation

 
Added:
>
>
 

Revision 142017/01/10 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 24 to 24
 
Changed:
<
<
While users may submit and run jobs locally on PCF itself, in general, all computationally intensive jobs should be run only on these larger compute resources, reserving PCF's local resources for development and testing purposes only.
>
>
While users may submit and run jobs locally on PCF itself, in general, all computationally intensive jobs should be run only on the larger computing resources, reserving PCF's local resources for development and testing purposes only.
 

System Status

Line: 42 to 42
 

Running Jobs

Changed:
<
<
PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command. For example, you would run this command as follows:
>
>
PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command. The command as follows:
 
 [youradusername@pcf-osg ~]$ condor_submit job.condor 
Changed:
<
<
where job.condor is the name of a UNIX formatted plain ASCII file known as the submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
>
>
where job.condor is the name of a UNIX formatted plain ASCII file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
 
Changed:
<
<
A sample submit description file, pi.condor, is shown below.
>
>

HTCondor Submit Description Files

A sample submit description file (pi.condor) is shown below.

 
 # =====================================================================
 # A sample HTCondor submit description file

Line: 72 to 74
  queue 10 # =================================================================
Changed:
<
<
In an HTCondor submit description file, the universe command defines a specific type of execution environment for the job. All batch jobs submitted to PCF should use the default vanilla universe as indicated here.
>
>
In a submit description file, the universe command defines a specific type of execution environment for the job. All batch jobs submitted to PCF should use the default vanilla universe as indicated here.
 
Changed:
<
<
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a bash shell script named pi.sh, which estimates the value of Pi using a Monte Carlo method.
>
>
Next, the executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named pi.sh, which estimates the value of Pi using a Monte Carlo method.
 
Changed:
<
<
To successfully run pi.sh, a user is required to provide the script with three command-line arguments: (1) the size of integers used in bytes, (2) the number of decimal places to round the estimate of pi, and (3) the number of Monte Carlo samples. In the submit description file above, these command-line arguments are passed to the script via the HTCondor arguments command, which indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
>
>
To successfully run this example script, a user is required to provide three command-line arguments: (1) the size of integers to use in bytes, (2) the number of decimal places to round the estimate of Pi, and (3) the number of Monte Carlo samples. These command-line arguments are passed to the script in the submit description file via the arguments command. Here, the argument command indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
 
Changed:
<
<
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. Total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by the jobs, processed on the remote machines where the jobs run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
>
>
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs. YES will cause HTCondor to always transfer input and output files for your jobs. However, total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
 
Changed:
<
<
The when_to_transfer_output command tells HTCondor when output files are to be transferred back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command, as well as any other files created by the job in its remote scratch directory, back to PCF, but only when the job exits on its own. If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. Any output files transferred back to PCF are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
>
>
The when_to_transfer_output command tells HTCondor when output files are to be transferred back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command, as well as any other files created by the job in its remote scratch directory, back to PCF, but only when the job exits on its own. If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by HTCondor for any reason prior to job completion. Any output files transferred back to PCF upon evection are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
 
Changed:
<
<
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the job event log, which is a chronological list of events that occur in HTCondor as a job runs. Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) automatic variables to define their respective filenames. This will append the $(ClusterId? ) and $(ProcId? ) of each HTCondor job to their output, error, and job event log files, which becomes quite useful when a submit description file is used to queue many jobs all at once.
>
>
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur in as a job runs. Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) automatic variables to define their respective filenames. This will append the $(ClusterId? ) and $(ProcId? ) of each HTCondor job to their output, error, and job event log files, which becomes quite useful when a submit description file is used to queue many jobs all at once.
  It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.
Line: 220 to 222
  --config-file $AWS_USER_CONFIG"
Deleted:
<
<

Storage

File Transfer

Software Packages

 

Additional Documentation

Added:
>
>
 Important Note For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

Line: 240 to 241
 

References

Deleted:
<
<
http://research.cs.wisc.edu/htcondor/manual/v8.4/2_Users_Manual.html
 

Revision 132017/01/10 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 42 to 42
 

Running Jobs

Changed:
<
<
PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:
>
>
PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command. For example, you would run this command as follows:
 
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_submit pi.condor 
>
>
 [youradusername@pcf-osg ~]$ condor_submit job.condor 
 
Changed:
<
<
where pi.condor is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
>
>
where job.condor is the name of a UNIX formatted plain ASCII file known as the submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
 
Changed:
<
<
Here, the sample HTCondor submit description file pi.condor is shown below.
>
>
A sample submit description file, pi.condor, is shown below.
 
 # =====================================================================

Deleted:
<
<
#
  # A sample HTCondor submit description file
Deleted:
<
<
#
  # --------------------------------------------------------------------- universe = vanilla executable = pi.sh
Changed:
<
<
arguments = -b 8 -r 7 -s 10000
>
>
arguments = -b 8 -r 5 -s 10000
  should_transfer_files = YES when_to_transfer_output = ON_EXIT
Changed:
<
<
output = pi.out.$(Cluster).$(Process) error = pi.err.$(Cluster).$(Process) log = pi.log.$(Cluster).$(Process)
>
>
output = pi.out.$(ClusterId? ).$(ProcId? ) error = pi.err.$(ClusterId? ).$(ProcId? ) log = pi.log.$(ClusterId? ).$(ProcId? )
  request_cpus = 1 request_disk = 8000000 request_memory = 1024
Line: 71 to 69
  +site_local = false +sdsc = false +uc = false
Changed:
<
<
queue 1
>
>
queue 10
  # =================================================================
Changed:
<
<
In an HTCondor submit description file, the universe command in defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe.
>
>
In an HTCondor submit description file, the universe command defines a specific type of execution environment for the job. All batch jobs submitted to PCF should use the default vanilla universe as indicated here.
 
Changed:
<
<
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a bash shell script named pi.sh, which
>
>
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a bash shell script named pi.sh, which estimates the value of Pi using a Monte Carlo method.
 
Changed:
<
<
To successfully run pi.sh, a user is required to provide the script with three command-line arguments: (1) the size of integers used in bytes, (2) the number of decimal places to round the estimate of pi, and (3) the number of Monte Carlo samples. In the submit description file pi.condor, these command-line arguments are passed to the script via the HTCondor arguments command.
>
>
To successfully run pi.sh, a user is required to provide the script with three command-line arguments: (1) the size of integers used in bytes, (2) the number of decimal places to round the estimate of pi, and (3) the number of Monte Carlo samples. In the submit description file above, these command-line arguments are passed to the script via the HTCondor arguments command, which indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
 
Changed:
<
<
# The should_transfer_files command determines if HTCondor transfers # files to and from the remote machine where a job runs. YES will cause # HTCondor to always transfer files for the job. PCF users should always # say YES, unless you are working with large input and/or output # datasets, which will require a different file transfer mechanism. # Please consult PCF system adminstrators for help if this is the case.
>
>
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. Total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by the jobs, processed on the remote machines where the jobs run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
 
Added:
>
>
The when_to_transfer_output command tells HTCondor when output files are to be transferred back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command, as well as any other files created by the job in its remote scratch directory, back to PCF, but only when the job exits on its own. If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. Any output files transferred back to PCF are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.

The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the job event log, which is a chronological list of events that occur in HTCondor as a job runs. Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) automatic variables to define their respective filenames. This will append the $(ClusterId? ) and $(ProcId? ) of each HTCondor job to their output, error, and job event log files, which becomes quite useful when a submit description file is used to queue many jobs all at once.

  It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.
Line: 256 to 252
 
  • pi.sh: A bash script that estimates the value of Pi via the Monte Carlo method.
Changed:
<
<
META FILEATTACHMENT attachment="pi.condor" attr="" comment="A sample HTCondor submit description file" date="1483995239" name="pi.condor" path="pi.condor" size="876" stream="pi.condor" tmpFilename="/tmp/TlMicjBJ62" user="MartinKandes" version="2"
META FILEATTACHMENT attachment="pi.sh" attr="" comment="A bash script that estimates the value of Pi via the Monte Carlo method." date="1483995596" name="pi.sh" path="pi.sh" size="1768" stream="pi.sh" tmpFilename="/tmp/xRuViHCFhA" user="MartinKandes" version="1"
>
>
META FILEATTACHMENT attachment="pi.condor" attr="" comment="A sample HTCondor submit description file" date="1484007193" name="pi.condor" path="pi.condor" size="872" stream="pi.condor" tmpFilename="/tmp/MoiLsOvwh7" user="MartinKandes" version="6"
META FILEATTACHMENT attachment="pi.sh" attr="" comment="A bash script that estimates the value of Pi via the Monte Carlo method." date="1484007216" name="pi.sh" path="pi.sh" size="1746" stream="pi.sh" tmpFilename="/tmp/3IlLAQN1H3" user="MartinKandes" version="3"

Revision 122017/01/09 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 44 to 44
  PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_submit runjob.vanilla 
>
>
 [youradusername@pcf-osg ~]$ condor_submit pi.condor 
 
Changed:
<
<
where runjob.vanilla is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
>
>
where pi.condor is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
 
Changed:
<
<
A sample HTCondor submit description file shown below.
>
>
Here, the sample HTCondor submit description file pi.condor is shown below.
 
 # =====================================================================
 #

Line: 56 to 56
  # # --------------------------------------------------------------------- universe = vanilla
Changed:
<
<
executable = jobwrapper.sh arguments = 4 6
>
>
executable = pi.sh arguments = -b 8 -r 7 -s 10000
  should_transfer_files = YES
Deleted:
<
<
transfer_input_files = job.x, job.input
  when_to_transfer_output = ON_EXIT
Changed:
<
<
output = job.out.$(Cluster).$(Process) error = job.err.$(Cluster).$(Process) log =job.log.$(Cluster).$(Process) requirements = OSGVO_OS_STRING = "RHEL 6" && Arch = "X86_64" && HAS_MODULES = True && NumJobStarts? = 0 request_cpus = 48
>
>
output = pi.out.$(Cluster).$(Process) error = pi.err.$(Cluster).$(Process) log = pi.log.$(Cluster).$(Process) request_cpus = 1
  request_disk = 8000000
Changed:
<
<
request_memory = 24576 periodic_remove = JobStatus? == 1 && NumJobStarts? > 0 on_exit_hold = (ExitBySignal? = True) || (ExitCode? 0)
>
>
request_memory = 1024
  +ProjectName = "PCFOSGUCSD" +local = true +site_local = false
Line: 78 to 74
  queue 1 # =================================================================
Changed:
<
<
Here, the universe command in HTCondor defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe.
>
>
In an HTCondor submit description file, the universe command in defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe.
 
Changed:
<
<
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named jobwrapper.sh.
>
>
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a bash shell script named pi.sh, which
 
Changed:
<
<
It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.

The arguments command is a list of user-specified arguments representing the command line arguments that should be passed to the job as-if when run from the command line. For example, if jobwrapper.sh is created to run job.x as a hybrid MPI/OpenMP job, it may look

>
>
To successfully run pi.sh, a user is required to provide the script with three command-line arguments: (1) the size of integers used in bytes, (2) the number of decimal places to round the estimate of pi, and (3) the number of Monte Carlo samples. In the submit description file pi.condor, these command-line arguments are passed to the script via the HTCondor arguments command.
  # The should_transfer_files command determines if HTCondor transfers # files to and from the remote machine where a job runs. YES will cause
Line: 94 to 88
 # Please consult PCF system adminstrators for help if this is the case.
Added:
>
>
It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.
 
HTCondor Submit Description File Command Description
universe = vanilla A universe in HTCondor defines an execution environment. All jobs submitted to PCF should use the default 'vanilla' universe.
Line: 255 to 252
 
  • Set VO_LOWER = osg
  • Set UC_LOWER = ucsds
--> \ No newline at end of file
Added:
>
>
  • pi.condor: A sample HTCondor submit description file

  • pi.sh: A bash script that estimates the value of Pi via the Monte Carlo method.

META FILEATTACHMENT attachment="pi.condor" attr="" comment="A sample HTCondor submit description file" date="1483995239" name="pi.condor" path="pi.condor" size="876" stream="pi.condor" tmpFilename="/tmp/TlMicjBJ62" user="MartinKandes" version="2"
META FILEATTACHMENT attachment="pi.sh" attr="" comment="A bash script that estimates the value of Pi via the Monte Carlo method." date="1483995596" name="pi.sh" path="pi.sh" size="1768" stream="pi.sh" tmpFilename="/tmp/xRuViHCFhA" user="MartinKandes" version="1"

Revision 112016/12/20 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 59 to 59
  executable = jobwrapper.sh arguments = 4 6 should_transfer_files = YES
Changed:
<
<
transfer_input_files = job.exe, job.input
>
>
transfer_input_files = job.x, job.input
  when_to_transfer_output = ON_EXIT output = job.out.$(Cluster).$(Process) error = job.err.$(Cluster).$(Process) log =job.log.$(Cluster).$(Process) requirements = OSGVO_OS_STRING = "RHEL 6" && Arch = "X86_64" && HAS_MODULES = True && NumJobStarts? = 0
Changed:
<
<
request_cpus = 1
>
>
request_cpus = 48
  request_disk = 8000000
Changed:
<
<
request_memory = 1024
>
>
request_memory = 24576
  periodic_remove = JobStatus? == 1 && NumJobStarts? > 0 on_exit_hold = (ExitBySignal? = True) || (ExitCode? 0) +ProjectName = "PCFOSGUCSD"
Line: 78 to 78
  queue 1 # =================================================================
Changed:
<
<
Here, the universe command in HTCondor defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe as shown here in this sample submit description file.
>
>
Here, the universe command in HTCondor defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe.
 
Changed:
<
<
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named yourjobname.sh.
>
>
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named jobwrapper.sh.
 
Added:
>
>
It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.
 
Changed:
<
<
While other batch systems like SLURM and PBS allow the use of standard shell scripts interspersed with batch directives for the schedulers in the same job submit file, this is not possible with HTCondor. As such, it is often the case that HTCondor users will have to "wrap" their actual science executable within a shell script. Here, that science executable is represented by yourjobname.x in the transfer_input_files command.

The arguments command is a list of user-specified arguments representing the command line arguments that should be passed to the job as-if when run from the command line. For example,

>
>
The arguments command is a list of user-specified arguments representing the command line arguments that should be passed to the job as-if when run from the command line. For example, if jobwrapper.sh is created to run job.x as a hybrid MPI/OpenMP job, it may look
  # The should_transfer_files command determines if HTCondor transfers # files to and from the remote machine where a job runs. YES will cause

Revision 102016/12/20 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 24 to 24
 
Added:
>
>
While users may submit and run jobs locally on PCF itself, in general, all computationally intensive jobs should be run only on these larger compute resources, reserving PCF's local resources for development and testing purposes only.
 

System Status

  • Access to Comet is currently unavailable from PCF, but it will again be available in early 2017.
Line: 42 to 44
  PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_submit yourjobname.vanilla 
>
>
 [youradusername@pcf-osg ~]$ condor_submit runjob.vanilla 
 
Changed:
<
<
where yourjobname.vanilla is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
>
>
where runjob.vanilla is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
  A sample HTCondor submit description file shown below.
Line: 54 to 56
  # # --------------------------------------------------------------------- universe = vanilla
Changed:
<
<
executable = yourjobname.sh arguments = 300 600
>
>
executable = jobwrapper.sh arguments = 4 6
  should_transfer_files = YES
Changed:
<
<
transfer_input_files = yourjobname.x, yourjobname.in
>
>
transfer_input_files = job.exe, job.input
  when_to_transfer_output = ON_EXIT
Changed:
<
<
output = yourjobname.out.$(Cluster).$(Process) error = yourjobname.err.$(Cluster).$(Process) log = yourjobname.log.$(Cluster).$(Process)
>
>
output = job.out.$(Cluster).$(Process) error = job.err.$(Cluster).$(Process) log =job.log.$(Cluster).$(Process)
  requirements = OSGVO_OS_STRING = "RHEL 6" && Arch = "X86_64" && HAS_MODULES = True && NumJobStarts? = 0 request_cpus = 1 request_disk = 8000000
Line: 78 to 80
  Here, the universe command in HTCondor defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe as shown here in this sample submit description file.
Changed:
<
<
Next, the executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named yourjobname.sh. While other batch systems like SLURM and PBS allow the use of standard shell scripts interspersed with batch directives for the schedulers in the same job submit file, this is not possible with HTCondor. As such, it is often the case that HTCondor users will have to "wrap" their actual science executable within a shell script. Here, that science executable is represented by yourjobname.x in the transfer_input_files command.
>
>
The executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named yourjobname.sh.

While other batch systems like SLURM and PBS allow the use of standard shell scripts interspersed with batch directives for the schedulers in the same job submit file, this is not possible with HTCondor. As such, it is often the case that HTCondor users will have to "wrap" their actual science executable within a shell script. Here, that science executable is represented by yourjobname.x in the transfer_input_files command.

  The arguments command is a list of user-specified arguments representing the command line arguments that should be passed to the job as-if when run from the command line. For example,
Line: 125 to 131
  All computationally intensive batch jobs should be targeted to run only on these remote computing resources and not locally on PCF itself. The PCF own local resources should be reserved for development and testing purposes only. Do not run any test workloads interactively on PCF.
Changed:
<
<
>
>
Note, however, users MUST have an XSEDE allocation on Comet to access it.
 

Modules

Revision 92016/12/20 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 44 to 44
 
 [youradusername@pcf-osg ~]$ condor_submit yourjobname.vanilla 
Changed:
<
<
where "yourjobname.vanilla" is the name of a UNIX formatted plain ASCII text file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input (stdin), standard output (stdout) and standard error (stderr), and the resources required to successfully run the job.
>
>
where yourjobname.vanilla is the name of a UNIX formatted plain ASCII file known as an HTCondor submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
 
Changed:
<
<
A sample HTCondor submit description file is shown below:
>
>
A sample HTCondor submit description file shown below.
 
 # =====================================================================
 #

Line: 58 to 57
  executable = yourjobname.sh arguments = 300 600 should_transfer_files = YES
Changed:
<
<
transfer_input_files = yourjobname.x, yourjobname.input
>
>
transfer_input_files = yourjobname.x, yourjobname.in
  when_to_transfer_output = ON_EXIT output = yourjobname.out.$(Cluster).$(Process) error = yourjobname.err.$(Cluster).$(Process) log = yourjobname.log.$(Cluster).$(Process)
Changed:
<
<
# request_cpus = [num-cpus] # A requested number of CPUs (cores). If not specified, the number requested will be 1.
>
>
requirements = OSGVO_OS_STRING = "RHEL 6" && Arch = "X86_64" && HAS_MODULES = True && NumJobStarts? = 0
  request_cpus = 1
Changed:
<
<
>
>
request_disk = 8000000
  request_memory = 1024
Changed:
<
<
>
>
periodic_remove = JobStatus? == 1 && NumJobStarts? > 0 on_exit_hold = (ExitBySignal? = True) || (ExitCode? 0)
  +ProjectName="PCFOSGUCSD"
Deleted:
<
<
  +local = true +site_local = false +sdsc = false
Line: 81 to 76
  queue 1 # =================================================================
Changed:
<
<
HTCondor Submit Description File Command Description
universe = vanilla A universe in HTCondor defines an execution environment. All jobs submitted to PCF should use the default 'vanilla' universe.
executable = [pathname] The name of the executable for this batch job. Only one executable command within a submit description file should be specified. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command is issued.
arguments = [argument_list] List of arguments to be supplied to the executable as part of the command line.
should_transfer_files = [ YES / NO / IF_NEEDED ] The should_transfer_files setting determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. NO disables the file transfer mechanism. IF_NEEDED will not transfer files for the job if it is matched with a local resource that shares the same file system as the submit machine. If the job is matched with a remote resource, which does not have a shared file system, then HTCondor will transfer the necessary files.
transfer_input_files = [file1,file2,file...] A comma-delimited list of all the files and directories to be transferred into the working directory for the job, before the job is started. By default, the file specified in the executable command and any file specified in the input command are transferred.
when_to_transfer_output = [ ON_EXIT / ON_EXIT_OR_EVICT ] Setting when_to_transfer_output equal to ON_EXIT will cause HTCondor to transfer the job's output files back to the submitting machine only when the job completes (exits on its own). The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.
>
>
Here, the universe command in HTCondor defines a specific type of execution environment. All batch jobs submitted to PCF should use the default vanilla universe as shown here in this sample submit description file.

Next, the executable command specifies the name of the executable you want to run. Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this case, the executable is a shell script named yourjobname.sh. While other batch systems like SLURM and PBS allow the use of standard shell scripts interspersed with batch directives for the schedulers in the same job submit file, this is not possible with HTCondor. As such, it is often the case that HTCondor users will have to "wrap" their actual science executable within a shell script. Here, that science executable is represented by yourjobname.x in the transfer_input_files command.

 
Added:
>
>
The arguments command is a list of user-specified arguments representing the command line arguments that should be passed to the job as-if when run from the command line. For example,
 
Added:
>
>
# The should_transfer_files command determines if HTCondor transfers # files to and from the remote machine where a job runs. YES will cause # HTCondor to always transfer files for the job. PCF users should always # say YES, unless you are working with large input and/or output # datasets, which will require a different file transfer mechanism. # Please consult PCF system adminstrators for help if this is the case.

HTCondor Submit Description File Command Description
universe = vanilla A universe in HTCondor defines an execution environment. All jobs submitted to PCF should use the default 'vanilla' universe.
executable = [nameofexe] The name of the executable for this batch job. Only one executable command within a submit description file should be specified. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command is issued.
arguments = [argument_list] List of arguments to be supplied to the executable as part of the command line.
should_transfer_files = [ YES / NO / IF_NEEDED ] The should_transfer_files setting determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. NO disables the file transfer mechanism. IF_NEEDED will not transfer files for the job if it is matched with a local resource that shares the same file system as the submit machine. If the job is matched with a remote resource, which does not have a shared file system, then HTCondor will transfer the necessary files.
transfer_input_files = [file1,file2,file...] A comma-delimited list of all the files and directories to be transferred into the working directory for the job, before the job is started. By default, the file specified in the executable command and any file specified in the input command are transferred.
when_to_transfer_output = [ ON_EXIT / ON_EXIT_OR_EVICT ] Setting when_to_transfer_output equal to ON_EXIT will cause HTCondor to transfer the job's output files back to the submitting machine only when the job completes (exits on its own). The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.
output = [outputfile] Specifies standard output (stdout) for batch job.
error = [errorfile] Specifies standard error (stderr) for batch job.
log = [logfile] Contains events the batch job had during its lifetime inside of HTCondor.
requirements = [booleanexpression] The requirements command is a boolean expression which uses C-like operators. In order for any job to run on a given machine, this requirements expression must evaluate to true on the given machine.
request_cpus = [numberofcpus] A requested number of CPUs (cores). If not specified, the number requested will be 1. If specified, the expression && (RequestCpus? <= Target.Cpus) is appended to the requirements expression for the job.
request_disk = [amountofdisk] The requested amount of disk space in KiB? requested for this job. If not specified, it will be set to the job ClassAd? attribute DiskUsage. However, if specified, then the expression && (RequestDisk? <= Target.Disk) is appended to the requirements expression for the job. Characters may be appended to a numerical value to indicate units. K or KB indicates KiB? , M or MB indicates MiB? , G or GB indicates GiB? , and T or TB indicates TiB? .
request_memory = [amountofmemory] The required amount of memory in MiB? that this job needs to avoid excessive swapping. The actual amount of memory used by a job is represented by the job ClassAd? attribute MemoryUsage? . If specified, the expression && (RequestMemory? <= Target.Memory) is appended to the requirements expression for the job. If not specified, a default of 1024 MiB? defined by in the PCF system configuration will be used. Characters may be appended to a numerical value to indicate units. K or KB indicates KiB? , M or MB indicates MiB? , G or GB indicates GiB? , and T or TB indicates TiB? .
periodic_remove = [booleanexpression] This expression is checked periodically at an interval of the number of seconds set by the HTCondor configuration variable PERIODIC_EXPR_INTERVAL. If it becomes True, the job is removed from the queue. If unspecified, the default value is False.
on_exit_hold = [booleanexpression] This expression is checked when the job exits, and if True, places the job into the HTCondor Hold state. If False (the default value when not defined), then nothing happens and the on_exit_remove expression is checked to determine if that needs to be applied.
+[customattributename] = [customattributevalue] HTCondor allows users to add their own HTCondor job ClassAd? attributes at submission. On PCF, these custom attributes are used to mark jobs for special routing and accounting purposes, which will be explained further below.
queue [integer] Places zero or more copies of the job into the HTCondor queue.
 
Line: 109 to 127
 
Changed:
<
<

Environment Modules

>
>

Modules

  Environment modules provide users with an easy way to access different versions of software and to access various libraries, compilers, and software. All user jobs running on computing resources accessible to PCF should have access to the
Line: 185 to 203
 module swap foo1 foo2 switches loaded module foo1 with module foo2 module unload foo reverses all changes to the environment made by previously loading module foo
Deleted:
<
<

Available Compilers

The standard set of GNU compilers are available on PCF for users who need to compile their own custom codebase. The current version of the GNU compilers installed on PCF are:

  • gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
  • g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
  • gfortan (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Also installed on PCF for those users who need to compile (and run) MPI code is OpenMPI? . The current version of OpenMPI? installed on PCF is:

  • OpenMPI? 1.10.2

Running Jobs

The HTCondor batch system is used to manage jobs submitted to the high-throughput computing resources accessible to PCF, which currently has access to:

All computationally intensive jobs should be run only on these resources and not locally on PCF itself. Please try to reserve the use of the "+local=true" resources on PCF for development and testing purposes only. Do not run any test workloads interactively on PCF.

Jobs can be submitted to PCF using the "condor_submit" command as follows:

 [youradusername@pcf-osg ~]$ condor_submit yourjobdescriptionfile 
[user@comet-ln1]$ sbatch jobscriptfile where yourjobdescriptionfile is the name of a UNIX format file containing special statements, resource specifications, and other commands used to construct an HTCondor job classad for your job submission.

  condor_annex is a Perl-based script that utilizes the AWS command-line interface and other AWS services to orchestrate the delivery of HTCondor execute nodes to an HTCondor pool like the one available to you on pcf-osg.t2.ucsd.edu. If you would like to try running your jobs on AWS resources, please contact Marty Kandes at mkandes@sdsc.edu. Some backend configuration of your AWS account will be necessary to get started. However, once your AWS account is configured, you will be able to order instances on-demand with one command:
Line: 256 to 243
 

References

Added:
>
>
http://research.cs.wisc.edu/htcondor/manual/v8.4/2_Users_Manual.html
 

Revision 82016/12/15 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 42 to 42
  PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_submit yourjob.vanilla 
>
>
 [youradusername@pcf-osg ~]$ condor_submit yourjobname.vanilla 
 
Changed:
<
<
where "yourjob.vanilla" is the name of a UNIX formatted plain ASCII text file known as a submit description file. This file contains special directives, expressions, and variables used to specify information about your batch job, such as what executable to run, the files to use for standard input (stdin), standard output (stdout) and standard error (stderr), and the resources required to successfully run the job.
>
>
where "yourjobname.vanilla" is the name of a UNIX formatted plain ASCII text file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input (stdin), standard output (stdout) and standard error (stderr), and the resources required to successfully run the job.
 
Added:
>
>
A sample HTCondor submit description file is shown below:
 
Added:
>
>
 # =====================================================================
 #
 # A sample HTCondor submit description file
 # 
 # --------------------------------------------------------------------- 

 universe                = vanilla
 executable              = yourjobname.sh
 arguments               = 300 600
 should_transfer_files   = YES
 transfer_input_files    = yourjobname.x, yourjobname.input
 when_to_transfer_output = ON_EXIT
 output                  = yourjobname.out.$(Cluster).$(Process)
 error                   = yourjobname.err.$(Cluster).$(Process)
 log                     = yourjobname.log.$(Cluster).$(Process)

# request_cpus = [num-cpus]
# A requested number of CPUs (cores). If not specified, the number requested will be 1.
 request_cpus            = 1


 request_memory          = 1024

 +ProjectName="PCFOSGUCSD"

 +local = true
 +site_local = false
 +sdsc = false
 +uc = false

 queue 1
 # ===================================================================== 

HTCondor Submit Description File Command Description
universe = vanilla A universe in HTCondor defines an execution environment. All jobs submitted to PCF should use the default 'vanilla' universe.
executable = [pathname] The name of the executable for this batch job. Only one executable command within a submit description file should be specified. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command is issued.
arguments = [argument_list] List of arguments to be supplied to the executable as part of the command line.
should_transfer_files = [ YES / NO / IF_NEEDED ] The should_transfer_files setting determines if HTCondor transfers files to and from the remote machine where a job runs. YES will cause HTCondor to always transfer files for the job. NO disables the file transfer mechanism. IF_NEEDED will not transfer files for the job if it is matched with a local resource that shares the same file system as the submit machine. If the job is matched with a remote resource, which does not have a shared file system, then HTCondor will transfer the necessary files.
transfer_input_files = [file1,file2,file...] A comma-delimited list of all the files and directories to be transferred into the working directory for the job, before the job is started. By default, the file specified in the executable command and any file specified in the input command are transferred.
when_to_transfer_output = [ ON_EXIT / ON_EXIT_OR_EVICT ] Setting when_to_transfer_output equal to ON_EXIT will cause HTCondor to transfer the job's output files back to the submitting machine only when the job completes (exits on its own). The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.
 
Deleted:
<
<
Submit description file. Controlling the details of a job submission is a submit description file. The file contains information about the job such as what executable to run, the files to use in place of stdin and stdout, and the platform type required to run the program. T
 
Deleted:
<
<
All computationally intensive batch jobs should be targeted to run only on these remote computing resources and not locally on PCF itself. The PCF own local resources should be reserved for development and testing purposes only. Do not run any test workloads interactively on PCF.
 
Deleted:
<
<
Jobs can be submitted to PCF using the "condor_submit" command as follows:
 
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_submit yourjobdescriptionfile 
[user@comet-ln1]$ sbatch jobscriptfile where yourjobdescriptionfile is the name of a UNIX format file containing special statements, resource specifications, and other commands used to construct an HTCondor job classad for your job submission.
>
>

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions: flag default description +local true run on the brick +site_local true run in your own local site batch system +sdsc false run at Comet +uc false run at all other UCs

All computationally intensive batch jobs should be targeted to run only on these remote computing resources and not locally on PCF itself. The PCF own local resources should be reserved for development and testing purposes only. Do not run any test workloads interactively on PCF.

 

Environment Modules

Revision 72016/12/14 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 17 to 17
 

System Overview

Changed:
<
<
The PCF itself is dual-socket login node with two Intel Xeon E5-2670 v3 processors, 132 GB of RAM, and 1 TB of hard drive disk space. The system is currently running CentOS? 6.8 and uses the HTCondor batch system for user job submission and resource management.
>
>
PCF is dual-socket login node with two Intel Xeon E5-2670 v3 processors, 132 GB of RAM, and 1 TB of hard drive disk space. The system is currently running CentOS? 6.8 and uses HTCondor for batch job submission and resource management. PCF currently enables users to access the following computing resources:

 

System Status

Changed:
<
<
  • Access to Comet is currently unavailable from PCF, but it will be available again in early 2017.
>
>
  • Access to Comet is currently unavailable from PCF, but it will again be available in early 2017.

User Accounts

 
Changed:
<
<

Accounts

>
>
You may obtain a user account on PCF by contacting the Physics Help Desk (helpdesk@physics.ucsd.edu). They will need your UCSD Active Directory (AD) username to create the account. Accounts are available to any UCSD student, faculty member, or staff member running scientific computing workloads.
 
Changed:
<
<
You may obtain an account on PCF by contacting the Physics Help Desk at helpdesk@physics.ucsd.edu. They will need your UCSD Active Directory (AD) username to create the account. Once your account is created, you will be able to access PCF via SSH using your AD credentials (username/password).
>
>
Once your account is created, you will be able to access PCF via SSH using your AD credentials (username/password).
 
 [user@client ~]$ ssh youradusername@pcf-osg.t2.ucsd.edu 
 Password: ENTERYOURADPASSWORDHERE
Changed:
<
<

Compiling

>
>

Running Jobs

PCF uses HTCondor to manage batch job submission to the high-throughput computing resources its users may access. Jobs can be submitted to PCF using the condor_submit command as follows:

 [youradusername@pcf-osg ~]$ condor_submit yourjob.vanilla 

where "yourjob.vanilla" is the name of a UNIX formatted plain ASCII text file known as a submit description file. This file contains special directives, expressions, and variables used to specify information about your batch job, such as what executable to run, the files to use for standard input (stdin), standard output (stdout) and standard error (stderr), and the resources required to successfully run the job.

Submit description file. Controlling the details of a job submission is a submit description file. The file contains information about the job such as what executable to run, the files to use in place of stdin and stdout, and the platform type required to run the program. T

All computationally intensive batch jobs should be targeted to run only on these remote computing resources and not locally on PCF itself. The PCF own local resources should be reserved for development and testing purposes only. Do not run any test workloads interactively on PCF.

Jobs can be submitted to PCF using the "condor_submit" command as follows:

 [youradusername@pcf-osg ~]$ condor_submit yourjobdescriptionfile 
[user@comet-ln1]$ sbatch jobscriptfile where yourjobdescriptionfile is the name of a UNIX format file containing special statements, resource specifications, and other commands used to construct an HTCondor job classad for your job submission.

Environment Modules

Environment modules provide users with an easy way to access different versions of software and to access various libraries, compilers, and software. All user jobs running on computing resources accessible to PCF should have access to the

OSG has implemented a version based on Lmod to provide the typical module commands on any site in the OSG. You can test workflows on the OSG Connect login node and then submit the same workflow without any changes.

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

he Module package provides for the dynamic modification of a users's environment via module files. Module can be used:

to manage necessary changes to the environment, such as changing the default path or defining environment variables to manage multiple versions of applications, tools and libraries to manage software where name conflicts with other software would cause problems Modules have been created for many of the software packages installed on PSC systems. They make your job easier by defining environment variables and adding directories to your path which are necessary when using a given package.

OSG computing environment.

Environment modules have historically been used in HPC environments to provide users with an easy way to access different versions of software and to access various libraries, compilers, and software (c.f. the wikipedia reference). OSG has implemented a version based on Lmod to provide the typical module commands on any site in the OSG. You can test workflows on the OSG Connect login node and then submit the same workflow without any changes.

Loading and Unloading Modules

You must remove some modules before loading others. Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich are both loaded, running the command module unload intel will automatically unload mvapich. Subsequently issuing the module load intel command does not automatically reload mvapich.

If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (".bashrc" for bash users, ".cshrc" for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations (e.g., $WORK, $HOME), aliases (e.g., cdw, cdh) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g., $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load modulename where modulename is the name of the module to load. If you often need a specific application, see Controlling Modules Loaded at Login below for details.

Most of the package directories are in /opt/apps/ ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your shell environment by loading the fftw3 module:

login1$ module load fftw3

To look at a synopsis about using an application in the module's environment (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw3 login1$ module list Available Modules

TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the currently loaded compiler/MPI environment (configuration). Two methods exist for viewing the availability of modules: Looking at modules available with the currently loaded compiler/MPI, and looking at all of the modules installed on the system.

To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:

login1$ module avail This will allow the user to see which software packages are available with the current compiler/MPI configuration.

To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:

login1$ module spider This command will display all available packages on the system. To get specific information about a particular package, including the possible compiler/MPI configurations for that package, execute the following command:

login1$ module spider modulename

Some useful module commands are:

module avail lists all the available modules module help foo displays help on module foo module display foo indicates what changes would be made to the environment by loading module foo without actually loading it module load foo loads module foo module list displays your currently loaded modules module swap foo1 foo2 switches loaded module foo1 with module foo2 module unload foo reverses all changes to the environment made by previously loading module foo

Available Compilers

  The standard set of GNU compilers are available on PCF for users who need to compile their own custom codebase. The current version of the GNU compilers installed on PCF are:
Line: 45 to 150
 

Running Jobs

Changed:
<
<
PCF uses the HTCondor batch system to submit, run, and manage user jobs on the high-throughput computing resources it has access to. PCF currently has access to the following resources:
>
>
The HTCondor batch system is used to manage jobs submitted to the high-throughput computing resources accessible to PCF, which currently has access to:
 

Revision 62016/12/14 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 8 to 8
 

About this Document

Changed:
<
<
The UCSD Physics Computing Facility (PCF) provides access to multiple high-throughput computing resources available to students, faculty, and staff in the Department of Physics. Note, however, thethose in the broader scientific community at UCSD. PCF currently enables its users to run their scientific workloads on the CMS Tier 2 cluster (3.5k cores) housed in Mayer Hall, the Comet Supercomputer (48k cores) at SDSC, and the Open Science Grid. There is also the capability to link PCF to commercial cloud providers like Amazon Web Services (AWS) for users who wish to purchase their own computing resources on-the-fly. This document describes how to get an account on PCF and begin submitting jobs to its computing resources.
>
>
The UCSD Physics Computing Facility (PCF) provides access to multiple high-throughput computing resources that are made available to students, faculty, and staff in the Department of Physics as well as those in the broader scientific community at UCSD. This document describes how to get an account on PCF and begin submitting jobs to its computing resources.
  This document follows the general Open Science Grid (OSG) documentation conventions:
Line: 17 to 17
 

System Overview

Changed:
<
<
PCF is dual-socket login node with two Intel Xeon E5-2670 v3 processors, 132 GB of RAM, and 1 TB of hard drive disk space. The system is currently running CentOS? 6.8 and uses the HTCondor batch system for user job submission and resource management.
>
>
The PCF itself is dual-socket login node with two Intel Xeon E5-2670 v3 processors, 132 GB of RAM, and 1 TB of hard drive disk space. The system is currently running CentOS? 6.8 and uses the HTCondor batch system for user job submission and resource management.
 

System Status

Changed:
<
<
  • Access to Comet is currently unavailable from PCF, but it will soon be made available again.
>
>
  • Access to Comet is currently unavailable from PCF, but it will be available again in early 2017.
 

Accounts

Line: 33 to 33
 

Compiling

Changed:
<
<
The standard set of GNU compilers are available on PCF for users who need to compile their own custom codebase. The current version of the GNU compilers installed on PCF is gcc/g++/gfortan (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17). Also installed on PCF for those users who need to compile (and run) MPI code is OpenMPI? . The current version of OpenMPI? installed on PCF is Open MPI 1.10.2.
>
>
The standard set of GNU compilers are available on PCF for users who need to compile their own custom codebase. The current version of the GNU compilers installed on PCF are:

  • gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
  • g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
  • gfortan (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Also installed on PCF for those users who need to compile (and run) MPI code is OpenMPI? . The current version of OpenMPI? installed on PCF is:

  • OpenMPI? 1.10.2
 

Running Jobs

Changed:
<
<
PCF uses the HTCondor batch system to submit, run, and manage user jobs on high-throughput computing resources. PCF currently has access to the following resources:
>
>
PCF uses the HTCondor batch system to submit, run, and manage user jobs on the high-throughput computing resources it has access to. PCF currently has access to the following resources:
 
Changed:
<
<
In general, all computationally intensive jobs should be run only on these resources and not locally on PCF itself. Please try to reserve the use of the "+local=true" resources on PCF for development and testing purposes only. Do not run any test workloads interactively on PCF.
>
>
All computationally intensive jobs should be run only on these resources and not locally on PCF itself. Please try to reserve the use of the "+local=true" resources on PCF for development and testing purposes only. Do not run any test workloads interactively on PCF.
  Jobs can be submitted to PCF using the "condor_submit" command as follows:

Revision 52016/12/14 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Changed:
<
<

UCSD Physics Computing Facility User Documentation

>
>

UCSD Physics Computing Facility (PCF) User Guide

 
Changed:
<
<
Important Note For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

This section shows the basics needed to start submitting jobs through HTCondor. For more detailed instructions about using HTCondor, please see the link to the user manual below in the References section.

Submit File

In order to submit jobs through condor, you must first write a submit file. The name of the file is arbitrary but we will call it job.condor in this document.

Example submit file:

universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue

This example assumes job.condor and the test.sh executable are in the current directory, and a logs subdirectory is also already present in the current directory. Condor will create the test.log and send the job's stdout and stderr to test.out.$(Cluster).$(Process) and test.err.$(Cluster).$(Process) respectively.

Jobs can be submitted to condor using the following command:

condor_submit job.condor

Targeting Resources

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions:

flag default description
+local true run on the brick
+site_local true run in your own local site batch system
+sdsc false run at Comet
+uc false run at all other UCs

Example submit file to restrict jobs to only run at SDSC and not locally:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
+local = false
+site_local = false
+sdsc = true
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never queue 
<--/twistyPlugin-->

Querying Jobs

The follwing will show a list of your jobs on the queue:

 condor_q <username>

Screen dump:

<--/twistyPlugin twikiMakeVisibleInline-->
[1627] jdost@uclhc-1 ~$ condor_q jdost


-- Submitter: uclhc-1.ps.uci.edu : <192.5.19.13:9615?sock=76988_ce0d_4> : uclhc-1.ps.uci.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  29.0   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.1   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.2   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.3   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.4   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       

5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
<--/twistyPlugin-->

Detailed classads can be dumped for a particular job with the -l flag:

condor_q -l $(Cluster).$(Process)

Canceling Jobs

You can cancel all of your own jobs at any time with the following:

condor_rm <username>

Or alternatively choose a specific job with the $(Cluster).$(Process) numbers, e.g.:

condor_rm 26.0

Important Note It is needed that in the submit files the following is included

+ProjectName=“PCFOSGUCSD”

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

>
>

About this Document

The UCSD Physics Computing Facility (PCF) provides access to multiple high-throughput computing resources available to students, faculty, and staff in the Department of Physics. Note, however, thethose in the broader scientific community at UCSD. PCF currently enables its users to run their scientific workloads on the CMS Tier 2 cluster (3.5k cores) housed in Mayer Hall, the Comet Supercomputer (48k cores) at SDSC, and the Open Science Grid. There is also the capability to link PCF to commercial cloud providers like Amazon Web Services (AWS) for users who wish to purchase their own computing resources on-the-fly. This document describes how to get an account on PCF and begin submitting jobs to its computing resources.

This document follows the general Open Science Grid (OSG) documentation conventions:

  1. A User Command Line is illustrated by a green box that displays a prompt:
     [user@client ~]$ 
  2. Lines in a file are illustrated by a yellow box that displays the desired lines in a file:
     priorities=1 

System Overview

PCF is dual-socket login node with two Intel Xeon E5-2670 v3 processors, 132 GB of RAM, and 1 TB of hard drive disk space. The system is currently running CentOS? 6.8 and uses the HTCondor batch system for user job submission and resource management.

System Status

  • Access to Comet is currently unavailable from PCF, but it will soon be made available again.

Accounts

You may obtain an account on PCF by contacting the Physics Help Desk at helpdesk@physics.ucsd.edu. They will need your UCSD Active Directory (AD) username to create the account. Once your account is created, you will be able to access PCF via SSH using your AD credentials (username/password).

 
Changed:
<
<

Running on Amazon Web Services (AWS)

>
>
 [user@client ~]$ ssh youradusername@pcf-osg.t2.ucsd.edu 
 Password: ENTERYOURADPASSWORDHERE

Compiling

The standard set of GNU compilers are available on PCF for users who need to compile their own custom codebase. The current version of the GNU compilers installed on PCF is gcc/g++/gfortan (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17). Also installed on PCF for those users who need to compile (and run) MPI code is OpenMPI? . The current version of OpenMPI? installed on PCF is Open MPI 1.10.2.

Running Jobs

PCF uses the HTCondor batch system to submit, run, and manage user jobs on high-throughput computing resources. PCF currently has access to the following resources:

In general, all computationally intensive jobs should be run only on these resources and not locally on PCF itself. Please try to reserve the use of the "+local=true" resources on PCF for development and testing purposes only. Do not run any test workloads interactively on PCF.

Jobs can be submitted to PCF using the "condor_submit" command as follows:

 [youradusername@pcf-osg ~]$ condor_submit yourjobdescriptionfile 
[user@comet-ln1]$ sbatch jobscriptfile where yourjobdescriptionfile is the name of a UNIX format file containing special statements, resource specifications, and other commands used to construct an HTCondor job classad for your job submission.
  condor_annex is a Perl-based script that utilizes the AWS command-line interface and other AWS services to orchestrate the delivery of HTCondor execute nodes to an HTCondor pool like the one available to you on pcf-osg.t2.ucsd.edu. If you would like to try running your jobs on AWS resources, please contact Marty Kandes at mkandes@sdsc.edu. Some backend configuration of your AWS account will be necessary to get started. However, once your AWS account is configured, you will be able to order instances on-demand with one command:
Line: 37 to 71
  --config-file $AWS_USER_CONFIG"
Added:
>
>

Storage

File Transfer

Software Packages

Additional Documentation

Important Note For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

This section shows the basics needed to start submitting jobs through HTCondor. For more detailed instructions about using HTCondor, please see the link to the user manual below in the References section.

Submit File

In order to submit jobs through condor, you must first write a submit file. The name of the file is arbitrary but we will call it job.condor in this document.

Example submit file:

universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue

This example assumes job.condor and the test.sh executable are in the current directory, and a logs subdirectory is also already present in the current directory. Condor will create the test.log and send the job's stdout and stderr to test.out.$(Cluster).$(Process) and test.err.$(Cluster).$(Process) respectively.

Jobs can be submitted to condor using the following command:

condor_submit job.condor

Targeting Resources

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions:

flag default description
+local true run on the brick
+site_local true run in your own local site batch system
+sdsc false run at Comet
+uc false run at all other UCs

Example submit file to restrict jobs to only run at SDSC and not locally:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
+local = false
+site_local = false
+sdsc = true
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never queue 
<--/twistyPlugin-->

Querying Jobs

The follwing will show a list of your jobs on the queue:

 condor_q <username>

Screen dump:

<--/twistyPlugin twikiMakeVisibleInline-->
[1627] jdost@uclhc-1 ~$ condor_q jdost


-- Submitter: uclhc-1.ps.uci.edu : <192.5.19.13:9615?sock=76988_ce0d_4> : uclhc-1.ps.uci.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  29.0   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.1   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.2   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.3   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.4   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       

5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
<--/twistyPlugin-->

Detailed classads can be dumped for a particular job with the -l flag:

condor_q -l $(Cluster).$(Process)

Canceling Jobs

You can cancel all of your own jobs at any time with the following:

condor_rm <username>

Or alternatively choose a specific job with the $(Cluster).$(Process) numbers, e.g.:

condor_rm 26.0

Important Note It is needed that in the submit files the following is included

+ProjectName=“PCFOSGUCSD”

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

 

References

Revision 42016/11/01 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

UCSD Physics Computing Facility User Documentation

Revision 32016/11/01 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

UCSD Physics Computing Facility User Documentation

Line: 16 to 16
 

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

Added:
>
>

Running on Amazon Web Services (AWS)

condor_annex is a Perl-based script that utilizes the AWS command-line interface and other AWS services to orchestrate the delivery of HTCondor execute nodes to an HTCondor pool like the one available to you on pcf-osg.t2.ucsd.edu. If you would like to try running your jobs on AWS resources, please contact Marty Kandes at mkandes@sdsc.edu. Some backend configuration of your AWS account will be necessary to get started. However, once your AWS account is configured, you will be able to order instances on-demand with one command:

condor_annex \
   --project-id "$AWS_PROJECT_ID" \
   --region "$AWS_DEFAULT_REGION" \
   --central-manager "$AWS_CENTRAL_MANAGER"
   --vpc "$AWS_VPC_ID" \
   --subnet "$AWS_SUBNET_ID" \
   --keypair "$AWS_KEY_PAIR_NAME" \
   --instances $NUMBER_OF_INSTANCES_TO_ORDER \
   --expiry "$AWS_LEASE_EXPIRATION" \
   --password-file "$CONDOR_PASSWORD_FILE" \
   --image-ids "$AWS_AMI_ID" \
   --instance-types "$AWS_INSTANCE_TYPE" \
   --spot-prices $AWS_SPOT_BID \
   --config-file $AWS_USER_CONFIG"
 

References

Revision 22016/09/30 - Main.EdgarHernandez

Line: 1 to 1
 
META TOPICPARENT name="WebHome"

UCSD Physics Computing Facility User Documentation

Line: 8 to 8
 For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

This section shows the basics needed to start submitting jobs through HTCondor. For more detailed instructions about using HTCondor, please see the link to the user manual below in the References section.

Submit File

In order to submit jobs through condor, you must first write a submit file. The name of the file is arbitrary but we will call it job.condor in this document.

Example submit file:

universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue

This example assumes job.condor and the test.sh executable are in the current directory, and a logs subdirectory is also already present in the current directory. Condor will create the test.log and send the job's stdout and stderr to test.out.$(Cluster).$(Process) and test.err.$(Cluster).$(Process) respectively.

Jobs can be submitted to condor using the following command:

condor_submit job.condor

Targeting Resources

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions:

flag default description
+local true run on the brick
+site_local true run in your own local site batch system
+sdsc false run at Comet
+uc false run at all other UCs

Example submit file to restrict jobs to only run at SDSC and not locally:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
+local = false
+site_local = false
+sdsc = true
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never queue 
<--/twistyPlugin-->

Querying Jobs

The follwing will show a list of your jobs on the queue:

 condor_q <username>

Screen dump:

<--/twistyPlugin twikiMakeVisibleInline-->
[1627] jdost@uclhc-1 ~$ condor_q jdost


-- Submitter: uclhc-1.ps.uci.edu : <192.5.19.13:9615?sock=76988_ce0d_4> : uclhc-1.ps.uci.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  29.0   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.1   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.2   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.3   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.4   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       

5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
<--/twistyPlugin-->

Detailed classads can be dumped for a particular job with the -l flag:

condor_q -l $(Cluster).$(Process)

Canceling Jobs

You can cancel all of your own jobs at any time with the following:

condor_rm <username>

Or alternatively choose a specific job with the $(Cluster).$(Process) numbers, e.g.:

condor_rm 26.0
Changed:
<
<
>
>
Important Note It is needed that in the submit files the following is included
+ProjectName=“PCFOSGUCSD”
 

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

Revision 12016/01/29 - Main.EdgarHernandez

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"

UCSD Physics Computing Facility User Documentation

Important Note For all submission in their job submission file please don't include use_x509userproxy on the submit files

Job Submission

This section shows the basics needed to start submitting jobs through HTCondor. For more detailed instructions about using HTCondor, please see the link to the user manual below in the References section.

Submit File

In order to submit jobs through condor, you must first write a submit file. The name of the file is arbitrary but we will call it job.condor in this document.

Example submit file:

universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue

This example assumes job.condor and the test.sh executable are in the current directory, and a logs subdirectory is also already present in the current directory. Condor will create the test.log and send the job's stdout and stderr to test.out.$(Cluster).$(Process) and test.err.$(Cluster).$(Process) respectively.

Jobs can be submitted to condor using the following command:

condor_submit job.condor

Targeting Resources

The UCLHC setup allows you to chose a particular domain to run on. By default jobs will run on the slots locally in the brick, as well as in the local batch system of the site. You can further choose to run outside to all UCs and also to the SDSC Comet cluster. These are each controlled by adding special booleans to the submit file. The following table lists the flags, their defaults, and descriptions:

flag default description
+local true run on the brick
+site_local true run in your own local site batch system
+sdsc false run at Comet
+uc false run at all other UCs

Example submit file to restrict jobs to only run at SDSC and not locally:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
+local = false
+site_local = false
+sdsc = true
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never queue 
<--/twistyPlugin-->

Querying Jobs

The follwing will show a list of your jobs on the queue:

 condor_q <username>

Screen dump:

<--/twistyPlugin twikiMakeVisibleInline-->
[1627] jdost@uclhc-1 ~$ condor_q jdost


-- Submitter: uclhc-1.ps.uci.edu : <192.5.19.13:9615?sock=76988_ce0d_4> : uclhc-1.ps.uci.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  29.0   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.1   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.2   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.3   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       
  29.4   jdost           8/21 16:25   0+00:01:46 R  0   0.0  test.sh 300       

5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
<--/twistyPlugin-->

Detailed classads can be dumped for a particular job with the -l flag:

condor_q -l $(Cluster).$(Process)

Canceling Jobs

You can cancel all of your own jobs at any time with the following:

condor_rm <username>

Or alternatively choose a specific job with the $(Cluster).$(Process) numbers, e.g.:

condor_rm 26.0

Transferring Output

Since xrootd is configured as a read-only system, you should use the condor file transfer mechanism to transfer job output back home to the brick.

The following example assumes the test.sh executable generates an output file called test.out. This is an example of a condor submit file to make condor transfer the output back to the user data area. The relevant attributes are in bold:

<--/twistyPlugin twikiMakeVisibleInline-->
universe = vanilla
executable = test.sh
arguments = 300
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = test.out
transfer_output_remaps = "test.out = /data/uclhc/ucsds/user/jdost/test.out"
log = logs/test.log
output = logs/test.out.$(Cluster).$(Process)
error = logs/test.err.$(Cluster).$(Process)
use_x509userproxy = True
notification = Never
queue  
<--/twistyPlugin-->

Note that transfer_output_remaps is used here because without it, by default condor will return the output file to the working directory condor_submit was run from.

References

<-- TWIKI VARIABLES 
  • Set CONDOR_VERS = v8.2
  • Set VO_UPPER = OSG
  • Set VO_LOWER = osg
  • Set UC_LOWER = ucsds
-->
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback