Difference: UCSDUserDocPCF (16 vs. 17)

Revision 172017/01/13 - Main.MartinKandes

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 40 to 40
  Password: ENTERYOURADPASSWORDHERE
Changed:
<
<

Managing Jobs with HTCondor

>
>

Managing Jobs

 

Job Submission

Line: 50 to 50
  where job.condor is the name of a UNIX formatted plain ASCII file known as a submit description file. This file contains special commands, directives, expressions, statements, and variables used to specify information about your batch job to HTCondor, such as what executable to run, the files to use for standard input, standard output, and standard error, as well as the resources required to successfully run the job.
Changed:
<
<

Submit Description Files

>
>

Submit Description Files

  A sample HTCondor submit description file (bash_pi.condor) is shown below.
Line: 95 to 95
 
 request_cpus = 1 
 request_disk = 8000000
 request_memory = 1024 
Changed:
<
<
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. Otherwise, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
>
>
These commands tell HTCondor what resources --- CPUs in number of cores, disk in KiB? (default), and memory in MiB? (default) --- are required to successfully run your job. It is important to provide this information in your submit description files as accurately as possible since HTCondor will use these requirements to match your job to a machine that can provides such resources. If this information is inaccurate, your job may fail when it is matched with and attempts to run on a machine without sufficient resources. All jobs submitted to PCF should contain these request commands. In general, you may assume that any job submitted to PCF can safely use up to 8 CPU-cores, 20 GB of disk space, and 2 GB of memory per CPU-core requested. Note: You can avoid using the default units of KiB? and MiB? for the request_disk and request_memory commands by appending the characters K (or KB), M (or MB), G (or GB), or T (or TB) to their numerical value to indicate the units to be used.
 
Changed:
<
<
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, some of these custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
>
>
HTCondor allows users (and system administrators) to append custom attributes to any job at the time of submission. On PCF, a set of custom attributes are used to mark jobs for special routing and accounting purposes. For example,
 +ProjectName = "PCFOSGUCSD" 
is a job attribute used by the Open Science Grid (OSG) for tracking resource usage by group. All jobs submitted to PCF, including yours, should contain this +ProjectName = "PCFOSGUCSD" attribute, unless directed otherwise.
  The next set of custom job attributes in the sample submit description file
 +local = TRUE

Line: 114 to 114
  As such, we see here that the sample submit description file is only targeted to run the job locally on PCF itself.
Changed:
<
<
Finally, the sample submit description file ends with the queue command, which in the form shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
>
>
Finally, the sample submit description file ends with the queue command, which as shown here simply places an integer number of copies (10) of the job in the HTCondor queue upon submission. If no integer value is given with the queue command, the default value is 1. Every submit description file must contain at least one queue command.
  requirements
Line: 122 to 122
 

Job Status

Changed:
<
<
Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in the queue besides your own. Therefore, in general, you will want to issue the command by providing your username as an argument.
>
>
Once you submit a job to PCF, you can periodically check on its status by using the condor_q command. There will likely always be other user jobs in PCF's queue besides your own. Therefore, in general, you will want to issue the condor_q command by providing your username as an argument.
 
 [youradusername@pcf-osg ~]$ condor_q youradusername

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               

Changed:
<
<
16661.0 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.1 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.2 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.3 youradusername 1/12 14:51 0+00:00:04 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.4 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.5 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.6 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.7 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.8 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16661.9 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.0 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.1 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.2 youradusername 1/12 14:51 0+00:00:03 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.3 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.4 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.5 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.6 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.7 youradusername 1/12 14:51 0+00:00:02 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.8 youradusername 1/12 14:51 0+00:00:01 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.9 youradusername 1/12 14:51 0+00:00:01 R 0 0.0 pi.sh -b 8 -r 7 -s
>
>
16663.0 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.1 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.2 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.3 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.4 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.5 youradusername 1/12 17:09 0+00:00:08 R 0 0.0 bash_pi.sh -b 8 -r 16663.6 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.7 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.8 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16663.9 youradusername 1/12 17:09 0+00:00:07 R 0 0.0 bash_pi.sh -b 8 -r 16664.0 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.1 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.2 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.3 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.4 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.5 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.6 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.7 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.8 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r 16664.9 youradusername 1/12 17:09 0+00:00:00 I 0 0.0 bash_pi.sh -b 8 -r
 
Changed:
<
<
20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended
>
>
20 jobs; 0 completed, 0 removed, 10 idle, 10 running, 0 held, 0 suspended
 
Changed:
<
<
This will limit the status information returned condor_q to your user jobs only. However, if there is a particular subset of your jobs you're interested in checking up on, you can also limit the status information by providing the specific job ClusterId as an argument to condor_q.
>
>
This will limit the job status information returned by condor_q to your jobs only. You may also limit the job status information to a particular subset of jobs you're interested in by providing the ClusterId of the subset as an argument to condor_q.
 
Changed:
<
<
 [youradusername@pcf-osg ~]$ condor_q 16662

>
>
 [youradusername@pcf-osg ~]$ condor_q 16663

  -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Changed:
<
<
16662.0 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.1 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.2 mkandes 1/12 14:51 0+00:01:53 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.3 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.4 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.5 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.6 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.7 mkandes 1/12 14:51 0+00:01:52 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.8 mkandes 1/12 14:51 0+00:01:51 R 0 0.0 pi.sh -b 8 -r 7 -s 16662.9 mkandes 1/12 14:51 0+00:01:51 R 0 0.0 pi.sh -b 8 -r 7 -s
>
>
16663.0 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.1 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.2 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.3 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.4 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.5 youradusername 1/12 17:09 0+00:03:25 R 0 0.0 bash_pi.sh -b 8 -r 16663.6 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.7 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.8 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r 16663.9 youradusername 1/12 17:09 0+00:03:24 R 0 0.0 bash_pi.sh -b 8 -r
  10 jobs; 0 completed, 0 removed, 0 idle, 10 running, 0 held, 0 suspended
Added:
>
>
The status of each submitted job in the queue is provided in the column labeled ST in the standard output of the condor_q command. In general, you will only find 3 different status codes in this column, namely:

  • R: The job is currently running.
  • I: The job is idle. It is not running right now, because it is waiting for a machine to become available.
  • H: The job is the held state. In the held state, the job will not be scheduled to run until it is released.

If your job is running (R), you probably don't have anything to worry about. However, if the job has been idle (I) for an unusually long period of time or is found in the held (H) state, you may want to investigate why your job is not running before contacting the PCF system administrators for additional help.

If you find your job in the held state (H)

 [youradusername@pcf-osg ~]$ condor_q 16663.3

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID        OWNER                  SUBMITTED    RUN_TIME   ST PRI SIZE CMD               
 16663.3   youradusername         1/12 17:09   0+00:56:56 H  0   26.9 bash_pi.sh -b 8 -r

 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 

you can check the hold reason by appending the -held option to the condor_q command.

 [youradusername@pcf-osg ~]$ condor_q 16663.3 -held

 -- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
 ID       OWNER                  HELD_SINCE HOLD_REASON                                
 16663.3  youradusername         1/12 18:06 via condor_hold (by user youradusername)          

 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 

In this case, for some reason you purposely placed the job on hold using the condor_hold command. However, if you find a more unusual HOLD_REASON given and you are unable to resolve the issue yourself, please contact the PCF system administrators to help you investigate the problem.

If you find that your job has been sitting idle (I) for an unusually long period of time, you can run condor_q with the -analyze (or -better-analyze) option to attempt to diagnose the problem.

 [youradusername@pcf-osg ~]$ condor_q -analyze 16250.0

-- Schedd: pcf-osg.t2.ucsd.edu : <169.228.130.75:9615?...
User priority for youradusername@pcf-osg.t2.ucsd.edu is not available, attempting to analyze without it.
---
16250.000:  Run analysis summary.  Of 20 machines,
     19 are rejected by your job's requirements 
      1 reject your job because of their own requirements 
      0 match and are already running your jobs 
      0 match but are serving other users 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Thu Jan 12 18:45:36 2017

	Reason for last match failure: no match found 

The Requirements expression for your job is:

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Memory >= 16384 )        1                    
2   ( TARGET.Cpus >= 8 )              1                    
3   ( TARGET.Arch == "X86_64" )       20                   
4   ( TARGET.OpSys == "LINUX" )       20                   
5   ( TARGET.Disk >= 1 )              20                   
6   ( TARGET.HasFileTransfer )        20                   

The following attributes should be added or modified:

Attribute               Suggestion
---------               ----------
local                   change to undefined 
  mkandes@pcf-osg ~$ condor_q 16662.4 -l | less
Line: 302 to 374
 module swap foo1 foo2 switches loaded module foo1 with module foo2 module unload foo reverses all changes to the environment made by previously loading module foo
Added:
>
>

Special Instructions

Running Jobs on Comet

Running Jobs on Amazon Web Services

  condor_annex is a Perl-based script that utilizes the AWS command-line interface and other AWS services to orchestrate the delivery of HTCondor execute nodes to an HTCondor pool like the one available to you on pcf-osg.t2.ucsd.edu. If you would like to try running your jobs on AWS resources, please contact Marty Kandes at mkandes@sdsc.edu. Some backend configuration of your AWS account will be necessary to get started. However, once your AWS account is configured, you will be able to order instances on-demand with one command:
Line: 322 to 397
  --config-file $AWS_USER_CONFIG"
Added:
>
>

Contact Information

 
Added:
>
>
  • Physics Help Desk
  • PCF System Administrators:
 

Additional Documentation

Added:
>
>
 
Deleted:
<
<
  • pi.condor: A sample HTCondor submit description file

  • pi.sh: A bash script that estimates the value of Pi via the Monte Carlo method.
 
Changed:
<
<
  • bash_pi.sh: A bash script uses a simple Monte Carlo method to estimate the value of Pi
>
>
  • bash_pi.sh: A bash script that uses a simple Monte Carlo method to estimate the value of Pi
 
META FILEATTACHMENT attachment="bash_pi.condor" attr="" comment="A sample HTCondor submit description file" date="1484096462" name="bash_pi.condor" path="bash_pi.condor" size="467" stream="bash_pi.condor" tmpFilename="/tmp/aq6lkzQfzs" user="MartinKandes" version="1"
META FILEATTACHMENT attachment="bash_pi.sh" attr="" comment="A bash script uses a simple Monte Carlo method to estimate the value of Pi" date="1484096507" name="bash_pi.sh" path="bash_pi.sh" size="1756" stream="bash_pi.sh" tmpFilename="/tmp/NEyD4BXYUQ" user="MartinKandes" version="1"
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback