The objective of this page is to document where things go when you submit a job on the UAF. And how to check and/or change the priorities of people.

Architecture diagram


Note: This architectur ediagram is now wrong. We eliminated cmssubmit-r1. The schedd's on the UAF are now directly communicating to the frontend. Also the actual frontend is glidein-frontend-2 now.

There's a frontend monitor here:

There is also 2 factories that are both mon itoried independently of each other here:

To see our activity you select UCSDCMS_cmspilot from the pulldown menu.

The text that follows has not been updated for this new architecture .... sigh!

How to figure things out

The above diagram shows that there are multiple components involved in getting a job started. Let's start here by explaining what they are.

  • You can think of "startd" as the actual batch slot that can run your job.
  • Think of the "schedd" as the queue of the batch system that you are submitting into. As you can see from the picture, there is a hierarchy of queues. Each UAF has its own local queue, and then all of those local queues "forward" whatever they have queued into the queue on cmssubmit-r1
  • The "frontend" is the input to the provisioning system. The way we use condor we make a distinction between provisioning resources and scheduling jobs. To be able to schedule a job you first have to provision a resource into the "pool". The pool is implemented on glidein-collector. To summarize:
    • the frontend watches the schedd on cmssubmit-r1 for queued jobs. If it finds any, it tries to provision a batch slot. To do so, a glidein gets submitted to each of the sites you indicated in your DESIRED_sites statement.
    • when the glidein starts at a site, it runs a startd. That startd then calls back to the pool to announce its availability.
    • once there are resources (startd's) in the pool, the schedd can schedule whoever is first in line onto those resources.

To get a job started on a startd the file that define the job thus need to get copied at least 2 times, typically. Once from the UAF to cmssubmit-r1, and a second time from the latter to the startd. While this copying is happening, the job may go into the "H" (hold) state. It will recover from that typically after a few minutes or so. This also means that if you delete the directory for a job after you submit it, your job is guaranteed to go into the "H" state, because it uses files in that directory to communicate the state it is in.

From all this, it should be obvious that getting a job started takes a few minutes or so. It thus makes no sense to have runtimes for a job of only a few minutes. I.e. you should structure the work you do such that execution times per job are an hour or more. You also need to make sure that the sum of all files that define your job don't become too large because each job carries them with it. This includes executable, scripts, libaries, etc. but of course not the files you read via XRootd or alike.

How to figure out why your job isn't running

  • Start with condor_q on the UAF that you submitted the job.
  • If your job is in the "H" state then see below for how to understand why it is "held".
  • If your job is in the "I" state then first check if you have properly set the "DESIRED_Sites" attribute:
    • condor_q -l interger-job-ID |grep DESIRED
    • this should give you something like: DESIRED_Sites = "T2_US_UCSD,T2_US_Nebraska,T2_US_Wisconsin,T2_US_MIT,T1_US_FNAL,T2_US_Purdue" Note that the "" are crucial. Also, any typos will mean that your job may never run.
  • once you have ruled that out, you can try this command:
    • condor_q -analyze integer-job-ID
    • Note that the information you get from this is often too cryptic to understand what is going on.
  • you can do condor_q -help or google for condor_q to learn more.

How to figure out why your job was held

Each job has a description of its state. You can query that description using condor_q -long jobId When condor holds a job it records a (more or less cryptic) reson for doing so.

E.g. a very common reason for a job being held is that your proxy is about to expire. Here's what that would look like:

condor_q -l 27600.0 | grep -i reason
ReleaseReason = undefined
HoldReasonSubCode = 0
HoldReason = "Error from Proxy about to expire"
HoldReasonCode = 4

Similarily, you can also find out details like when your proxy expires:

condor_q -l 27600.0 | grep -i x509
x509UserProxyVOName = "cms"
x509UserProxyExpiration = 1441937795

 date -d @1441937795
Thu Sep 10 19:16:35 PDT 2015

To avoid this particular problem, you will want to extend your proxy lifetime to 72h with "voms-proxy-init -H 72".

How to figure out why your job is running so much longer than it should?

Here we need your help to hel you, because chances are this requires root privileges on the cluster to figure out. So here's what you should do:

  • pick a job that has been running for way too long and do as follows:
         condor_q -l 45638.0 |grep GridJobId
         #this will get you something like:
        GridJobId = "condor 845562.0”
        # the last number is the job Id for this job on cmssubmit-r1, so next you do:
       condor_q -n -pool -l 845562.0 |grep MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot
       #this will give you something like:
       MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = "”
       #at this point you send an email to t2support that tells people that you think cabinet-3-3-2 has a broken hadoop fuse mount, and provide those guys with
       #all the info from the above, i.e. the UAF your jobs were submitted from, an example job ID there, the job ID on cmssubmit-r1 that it corresponds to, and the
       # that you figured out above.
  • if I'm awake, or somebody else is awake, we will then try to fix the broken fuse mount. In the meantime, you just leave the long running job hanging, and resubmit another one just like it.
  • Here's what we will do:
    • log into the node in question
    • use pstree -p guser2 or equivalent to see what process Id the job has that hangs.
    • cd /proc/processId
    • cat cmdline
    • and this will tell us what the job is doing, and why it's hung.

How to query the schedd on cmssubmit-r1

  • condor_q -n -pool

Basically, all commands of condor added with -n and -pool as above will talk to the schedd on cmssubmit-r1. Amone the useful commands are:

  • condor_q -help
  • condor_q -analyze
  • condor_q -long

The -analyze and the -long are kind of heavy. So you should run them only against a single job ID. I.e. first do condor_q to figure out what the job IDs are you want to look at, then look at just one of them.

How to get the status of the pool

  • condor_status -pool

This will show you all the startd's connected to the pool at this moment. It will tell you which ones are busy and which ones are idle. An idle resource is one that can be used when a job shows up that is willing to run on it.

How to figure out the relative priority between different users that submit jobs from the UAF

You need special privileges to do this.

  • ssh condor@glidein-collector
  • condor_userprio -all

This then dumps out the priorities for different users based on the names HTCondor knows about.

You then need to figure out who is who based on the GUMS mapping to DN. The DN will have the name in them. E.g.: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mderdzin/CN=760843/CN=Mark Derdzinski is uscms5606

If I wanted to change the relative priority of different users on the UAF then I'd use the commands:

  • condor_userprio -setfactor
  • condor_userprio -setprio

This affects who gets the next free CPU among all those queued up, and willing to run on that CPU.

E.g. if Joe and Jane both are willing to run at Caltech or UCSD, the the relative priority of them as set here will determine who gets the first free slot at either Caltech or UCSD. If Joe insists on UCSD while Jane is ok with both, then a free slot at Caltech will go to Jane irrespective of any settings here.

How to figure out what DN corresponds to which username inside the UCSD T2 cluster

Note, with "usernames in the UCSD T2 cluster" I mean the names that GUMS maps the DN to at each of the OSG-CEs of the cluster. This username is then used to submit to HTCondor, and thus the name under which the job is known inside the cluster.

The important ones here are those mapped to /DC=ch/DC=cern/OU=computers/CN=cmspilotXY/ here XY is a 2 digit integer, e.g. 01.

e.g. as of August 28th 2015, the DN /DC=ch/DC=cern/OU=computers/CN=cmspilot01/ which is used by the glideinfronend for the UAF, is mapped on our cluster to the username cp0035. So if I want to adjust the relative priority of submissions via the UAF with submissions via CRAB3, or WMAgent, I need to change the relative priority of username cp0035.

How to modify priorities on the cluster

You need superuser privileges to do this.

  • ssh root@osg-gw-1
  • condor_userprio -all

This gives the priorities of all recently queued or running users on the cluster.

are the two ways of changing the priority of the user cp0035. The first sets a multiplicative factor, the second resets the absolute to 1, the lowest number it can be.

HTCondor will start whatever job has the lowest priority number and meets the criteria for an open slot. So setting prio to 1 is equivalent of resetting it to the best prio it can have. Setting the factor to a small integer is the best priority factor you can have.

The absolute priority number is prio x factor.

-- FkW - 2015/08/28

Topic revision: r7 - 2017/06/13 - 22:37:35 - FkW
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback