Difference: UCSDUserDocPCF (15 vs. 16)

Revision 162017/01/12 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 24 to 24
 
Changed:
<
<
While users may submit and run jobs locally on PCF itself, in general, all computationally intensive jobs should be run only on the larger computing resources, reserving PCF's local resources for development and testing purposes only.
>
>
While users may submit and run jobs locally on PCF itself, all computationally intensive jobs should generally be run only on the larger computing resources, reserving PCF's local resources for development and testing purposes only.
 

System Status

Line: 40 to 40
  Password: ENTERYOURADPASSWORDHERE
Changed:
<
<

Running Jobs

>
>

Managing Jobs with HTCondor

 

Job Submission

Line: 67 to 67
  request_disk = 8000000 request_memory = 1024 +ProjectName = "PCFOSGUCSD"
Changed:
<
<
+local = true +site_local = false +sdsc = false +uc = false
>
>
+local = TRUE +site_local = FALSE +sdsc = FALSE +uc = FALSE
  queue 10
Changed:
<
<
Let's breakdown this sample submit description file line-by-line to provide you with some background and guidance on how to construct your own submit description files. The first line
 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments should be placed on their own line.
>
>
The first line here
 # A sample HTCondor submit description file 
is simply a comment line in the submit description file. Any comments in a submit description file should be placed on their own line.
  Next, the universe command defines a specific type of execution environment for your job.
 universe = vanilla 
All batch jobs submitted to PCF should use the default vanilla universe.
Changed:
<
<
The executable command specifies the name of the executable you want to run.
 executable = pi.sh 
Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named bash_pi.sh, which uses a simple Monte Carlo method to estimate the value of Pi.
>
>
The executable command specifies the name of the executable you want to run.
 executable = bash_pi.sh 
Only one executable command should be specified in any submit description file. If no path or a relative path is used, then the executable is presumed to be relative to the current working directory of the user when the condor_submit command was issued. In this example, the executable is a bash shell script named bash_pi.sh, which uses a simple Monte Carlo method to estimate the value of Pi.
  To successfully run this example script, a user is required to provide three command-line arguments: (1) the size of integers to use in bytes, (2) the number of decimal places to round the estimate of Pi, and (3) the number of Monte Carlo samples. These command-line arguments are passed to the script in the submit description file via the arguments command.
 arguments = -b 8 -r 5 -s 10000 
Here, the argument command indicates the script should use 8-byte integers, round the estimate of Pi to 5 decimal places, and take 10000 Monte Carlo samples.
Changed:
<
<
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs.
 should_transfer_files = YES 
YES will cause HTCondor to always transfer input and output files for your jobs. However, total input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
>
>
The should_transfer_files command determines if HTCondor transfers files to and from the remote machine where your job runs.
 should_transfer_files = YES 
YES will cause HTCondor to always transfer input and output files for your jobs. However, the total amount of input and output data for each job using the HTCondor file transfer mechanism should be kept to less than 5 GB to allow the data to be successfully pulled from PCF by your jobs, processed on the remote machines where they will run, and then pushed back to your home directory on PCF. If your requirements exceed this 5 GB per job limit, please consult the PCF system adminstrators to assist you with setting up an alternative file transfer mechanism.
  The when_to_transfer_output command determines when HTCondor transfers your job's output files back to PCF. If when_to_transfer_output is set to ON_EXIT, HTCondor will transfer the file listed in the output command back to PCF, as well as any other files created by the job in its remote scratch directory, but only when the job exits on its own.
 when_to_transfer_output = ON_EXIT 
If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then the output files are transferred back to PCF any time the job leaves a remote machine, either because it exited on its own, or was evicted by HTCondor for any reason prior to job completion. Any output files transferred back to PCF upon eviction are then automatically sent back out again as input files if the job restarts. This option is intended for fault tolerant jobs which periodically save their own state and are designed to restart where they left off.
Changed:
<
<
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur in as a job runs.
>
>
The output and error commands provide the paths and filenames used by HTCondor to capture any output and error messages your executable would normally write to stdout and stderr. Similarly, the log command is used to provide the path and filename for the HTCondor job event log, which is a chronological list of events that occur as a job runs.
 
 output = pi.out.$(ClusterId).$(ProcId)
 error = pi.err.$(ClusterId).$(ProcId)
 log = pi.log.$(ClusterId).$(ProcId) 
Changed:
<
<
Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) variables to define the filenames. This will append the $(ClusterId? ) and $(ProcId? ) number of each HTCondor job to their respective output, error, and job event log files. This especially useful in separately tagging these output, error, and log files for each job when a submit description file is used to queue many jobs all at once.
>
>
Note that each of these commands in the sample submit description file use the $(ClusterId) and $(ProcId) variables to define the filenames. This will append the $(ClusterId? ) and $(ProcId? ) number of each HTCondor job to their respective output, error, and job event log files. This especially useful for tagging the output, error, and log files for an individual job when a submit description file is used to queue many jobs all at once.
  Next in the sample submit description file are the standard resource request commands:request_cpus, request_disk, and request_memory.
 request_cpus = 1 
 request_disk = 8000000
 request_memory = 1024 
Changed:
<
<
These commands tell HTCondor what resources in terms of CPU (number of cores), disk (by default in KiB? ), and memory (by default in MiB? ) are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, you job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested.
>
>
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
 
Changed:
<
<
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
>
>
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, some of these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
  The next set of custom job attributes in the sample submit description file
Changed:
<
<
 +local = true
 +site_local = false
 +sdsc = false
 +uc = false 
are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to true, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to false, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.
>
>
 +local = TRUE
 +site_local = FALSE
 +sdsc = FALSE
 +uc = FALSE 
are a set of boolean job routing flags that allow you to explicitly target where your jobs may run. Each one of these boolean flags is associated with one of the different computing resources accessible from PCF. When you set the value of one of these resource flags to TRUE, you permit your jobs to run on the system associated with that flag. In contrast, when you set the value of the resource flag to FALSE, you prevent your jobs from running on that system. The relationship between each job routing flag and computing resource is provided in the following table.
 
Job Routing Flag Default Value Computing Resource Accessibility
Changed:
<
<
+local true pcf-osg.t2.ucsd.edu Open to all PCF users
+site_local true CMS Tier 2 Cluster Open to all PCF users
+sdsc false Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc false Open Science Grid Open to all PCF users
>
>
+local TRUE pcf-osg.t2.ucsd.edu Open to all PCF users
+site_local TRUE CMS Tier 2 Cluster Open to all PCF users
+sdsc FALSE Comet Supercomputer Open only to PCF users with an XSEDE allocation on Comet
+uc FALSE Open Science Grid Open to all PCF users
 
Added:
>
>
As such, we see here that the sample submit description file is only targeted to run the job locally on PCF itself.
 
Added:
>
>
Finally, the sample submit description file ends with the queue command, which in the form shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
  requirements

It is important to note here that the name of this shell script was not chosen randomly. While other batch systems like SLURM and PBS use standard shell scripts annotated with directives for both communicating the requirements of a batch job to their schedulers and how the job's executable should be run, HTCondor does not work this way. In general, an HTCondor submit description file separates the directives (or commands) to the scheduler from how the executable should be run (e.g., how it would look if run interactively from the command line). As such, it is often the case that HTCondor users will need to wrap their actual (payload) executable within a shell script as shown here in this sample submit description file. Here, that executable is represented by job.x in the transfer_input_files command.

Changed:
<
<

Querying Job Status

>
>

Job Status

Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in the queue besides your own. Therefore, in general, you will want to issue the command by providing your username as an argument.

 [youradusername@pcf-osg ~]$ condor_q youradusername

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               
 16661.0   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.1   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.2   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.3   youradusername         1/12 14:51   0+00:00:04 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.4   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.5   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.6   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.7   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.8   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16661.9   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.0   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.1   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.2   youradusername         1/12 14:51   0+00:00:03 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.3   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.4   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.5   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.6   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.7   youradusername         1/12 14:51   0+00:00:02 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.8   youradusername         1/12 14:51   0+00:00:01 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.9   youradusername         1/12 14:51   0+00:00:01 R  0   0.0  pi.sh -b 8 -r 7 -s

 20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended 

This will limit the status information returned condor_q to your user jobs only. However, if there is a particular subset of your jobs you're interested in checking up on, you can also limit the status information by providing the specific job ClusterId as an argument to condor_q.

 [youradusername@pcf-osg ~]$ condor_q 16662

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 16662.0   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.1   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s 
 16662.2   mkandes         1/12 14:51   0+00:01:53 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.3   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.4   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.5   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.6   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.7   mkandes         1/12 14:51   0+00:01:52 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.8   mkandes         1/12 14:51   0+00:01:51 R  0   0.0  pi.sh -b 8 -r 7 -s
 16662.9   mkandes         1/12 14:51   0+00:01:51 R  0   0.0  pi.sh -b 8 -r 7 -s

 10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended 

mkandes@pcf-osg ~$ condor_q 16662.4 -l | less

MATCH_EXP_JOB_GLIDEIN_Entry_Name = "Unknown" MATCH_EXP_JOB_GLIDEIN_Schedd = "Unknown" MaxHosts? = 1 MATCH_EXP_JOBGLIDEIN_ResourceName = "UCSD" User = "mkandes@pcf-osg.t2.ucsd.edu" EncryptExecuteDirectory? = false MATCH_GLIDEIN_ClusterId = "Unknown" OnExitHold? = false CoreSize? = 0 JOB_GLIDEIN_SiteWMS = "$$(GLIDEIN_SiteWMS:Unknown)" MATCH_GLIDEIN_Factory = "Unknown" MachineAttrCpus0? = 1 WantRemoteSyscalls? = false MyType? = "Job" Rank = 0.0 CumulativeSuspensionTime? = 0 MinHosts? = 1 MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = "Unknown" PeriodicHold? = false PeriodicRemove? = false Err = "pi.err.16662.4" ProcId? = 4

-analyze

Job Removal

[1514] mkandes@pcf-osg ~$ condor_rm 16662.4 Job 16662.4 marked for removal

condor_q 16662

-- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16662.0 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.1 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.2 mkandes 1/12 14:51 0+00:23:04 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.3 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.5 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.6 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.7 mkandes 1/12 14:51 0+00:23:03 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.8 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s 16662.9 mkandes 1/12 14:51 0+00:23:02 R 0 26.9 pi.sh -b 8 -r 7 -s

9 jobs; 0 completed, 0 removed, 0 idle, 9 running, 0 held, 0 suspended

 
Changed:
<
<

Removing Jobs

>
>

Job History

 
Changed:
<
<

Software Available

>
>

Available Software

  Environment modules provide users with an easy way to access different versions of software and to access various libraries, compilers, and software. All user jobs running on computing resources accessible to PCF should have access to the
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback