Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 8 to 8 | ||||||||
![]() | ||||||||
Added: | ||||||||
> > | Note: This architectur ediagram is now wrong. We eliminated cmssubmit-r1. The schedd's on the UAF are now directly communicating to the frontend. Also the actual frontend is glidein-frontend-2 now. There's a frontend monitor here: http://glidein-frontend-2.t2.ucsd.edu:8319/vofrontend/monitor/frontendGroupGraphStatusNow.html There is also 2 factories that are both mon itoried independently of each other here: http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatusNow.html http://glidein.grid.iu.edu/factory/monitor/factoryStatusNow.html To see our activity you select UCSDCMS_cmspilot from the pulldown menu. The text that follows has not been updated for this new architecture .... sigh! | |||||||
How to figure things outThe above diagram shows that there are multiple components involved in getting a job started. Let's start here by explaining what they are. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 148 to 148 | ||||||||
This gives the priorities of all recently queued or running users on the cluster. | ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
are the two ways of changing the priority of the user cp0035. The first sets a multiplicative factor, the second resets the absolute to 1, the lowest number it can be. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Changed: | ||||||||
< < | The objective of this page is to document where things go when you submit a job on the UAF. And how to check and/or change the priorities of people. | |||||||
> > | The objective of this page is to document where things go when you submit a job on the UAF. And how to check and/or change the priorities of people. | |||||||
Architecture diagram | ||||||||
Changed: | ||||||||
< < | ![]() | |||||||
> > | ![]() | |||||||
How to figure things out | ||||||||
Changed: | ||||||||
< < | The above diagram shows that there are multiple components involved in getting a job started. Let's start here by explaining what they are. | |||||||
> > | The above diagram shows that there are multiple components involved in getting a job started. Let's start here by explaining what they are. | |||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
| ||||||||
Changed: | ||||||||
< < | How to query the schedd on cmssubmit-r1
| |||||||
> > | To get a job started on a startd the file that define the job thus need to get copied at least 2 times, typically. Once from the UAF to cmssubmit-r1, and a second time from the latter to the startd. While this copying is happening, the job may go into the "H" (hold) state. It will recover from that typically after a few minutes or so. This also means that if you delete the directory for a job after you submit it, your job is guaranteed to go into the "H" state, because it uses files in that directory to communicate the state it is in. | |||||||
Changed: | ||||||||
< < | The -analyze and the -long are kind of heavy. So you should run them only against a single job ID.
I.e. first do condor_q to figure out what the job IDs are you want to look at, then look at just one of them.
How to get the status of the pool
| |||||||
> > | From all this, it should be obvious that getting a job started takes a few minutes or so. It thus makes no sense to have runtimes for a job of only a few minutes.
I.e. you should structure the work you do such that execution times per job are an hour or more. You also need to make sure that the sum of all files that define your job
don't become too large because each job carries them with it. This includes executable, scripts, libaries, etc. but of course not the files you read via XRootd or alike.
How to figure out why your job isn't running
| |||||||
How to figure out why your job was held | ||||||||
Changed: | ||||||||
< < | Each job has a description of its state. You can query that description using condor_q -long jobId When condor holds a job it records a (more or less cryptic) reson for doing so. | |||||||
> > | Each job has a description of its state. You can query that description using condor_q -long jobId When condor holds a job it records a (more or less cryptic) reson for doing so. | |||||||
Changed: | ||||||||
< < | E.g. a very common reason for a job being held is that your proxy is about to expire. Here's what that would look like: | |||||||
> > | E.g. a very common reason for a job being held is that your proxy is about to expire. Here's what that would look like: | |||||||
condor_q -l 27600.0 | grep -i reason | ||||||||
Line: 75 to 66 | ||||||||
To avoid this particular problem, you will want to extend your proxy lifetime to 72h with "voms-proxy-init -H 72". | ||||||||
Added: | ||||||||
> > | How to query the schedd on cmssubmit-r1
How to get the status of the pool
| |||||||
How to figure out the relative priority between different users that submit jobs from the UAF | ||||||||
Added: | ||||||||
> > | You need special privileges to do this. | |||||||
| ||||||||
Changed: | ||||||||
< < | You then need to figure out who is who based on the GUMS mapping to DN. The DN will have the name in them. E.g.: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mderdzin/CN=760843/CN=Mark Derdzinski is uscms5606 | |||||||
> > | You then need to figure out who is who based on the GUMS mapping to DN. The DN will have the name in them. E.g.: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mderdzin/CN=760843/CN=Mark Derdzinski is uscms5606 | |||||||
If I wanted to change the relative priority of different users on the UAF then I'd use the commands:
| ||||||||
Line: 94 to 101 | ||||||||
This affects who gets the next free CPU among all those queued up, and willing to run on that CPU. | ||||||||
Changed: | ||||||||
< < | E.g. if Joe and Jane both are willing to run at Caltech or UCSD, the the relative priority of them as set here will determine who gets the first free slot at either Caltech or UCSD. If Joe insists on UCSD while Jane is ok with both, then a free slot at Caltech will go to Jane irrespective of any settings here. | |||||||
> > | E.g. if Joe and Jane both are willing to run at Caltech or UCSD, the the relative priority of them as set here will determine who gets the first free slot at either Caltech or UCSD. If Joe insists on UCSD while Jane is ok with both, then a free slot at Caltech will go to Jane irrespective of any settings here. | |||||||
How to figure out what DN corresponds to which username inside the UCSD T2 cluster | ||||||||
Changed: | ||||||||
< < | Note, with "usernames in the UCSD T2 cluster" I mean the names that GUMS maps the DN to at each of the OSG-CEs of the cluster. This username is then used to submit to HTCondor, and thus the name under which the job is known inside the cluster. | |||||||
> > | Note, with "usernames in the UCSD T2 cluster" I mean the names that GUMS maps the DN to at each of the OSG-CEs of the cluster. This username is then used to submit to HTCondor, and thus the name under which the job is known inside the cluster. | |||||||
https://sentry.t2.ucsd.edu/showmap/index.txt | ||||||||
Changed: | ||||||||
< < | The important ones here are those mapped to /DC=ch/DC=cern/OU=computers/CN=cmspilotXY/vocms080.cern.ch here XY is a 2 digit integer, e.g. 01. | |||||||
> > | The important ones here are those mapped to /DC=ch/DC=cern/OU=computers/CN=cmspilotXY/vocms080.cern.ch here XY is a 2 digit integer, e.g. 01. | |||||||
Changed: | ||||||||
< < | e.g. as of August 28th 2015, the DN /DC=ch/DC=cern/OU=computers/CN=cmspilot01/vocms080.cern.ch which is used by the glideinfronend for the UAF, is mapped on our cluster to the username cp0035. So if I want to adjust the relative priority of submissions via the UAF with submissions via CRAB3, or WMAgent, I need to change the relative priority of username cp0035. | |||||||
> > | e.g. as of August 28th 2015, the DN /DC=ch/DC=cern/OU=computers/CN=cmspilot01/vocms080.cern.ch which is used by the glideinfronend for the UAF, is mapped on our cluster to the username cp0035. So if I want to adjust the relative priority of submissions via the UAF with submissions via CRAB3, or WMAgent, I need to change the relative priority of username cp0035. | |||||||
How to modify priorities on the cluster | ||||||||
Added: | ||||||||
> > | You need superuser privileges to do this. | |||||||
| ||||||||
Line: 124 to 125 | ||||||||
| ||||||||
Changed: | ||||||||
< < | are the two ways of changing the priority of the user cp0035. The first sets a multiplicative factor, the second resets the absolute to 1, the lowest number it can be. | |||||||
> > | are the two ways of changing the priority of the user cp0035. The first sets a multiplicative factor, the second resets the absolute to 1, the lowest number it can be. | |||||||
Changed: | ||||||||
< < | HTCondor will start whatever job has the lowest priority number and meets the criteria for an open slot. So setting prio to 1 is equivalent of resetting it to the best prio it can have. Setting the factor to a small integer is the best priority factor you can have. | |||||||
> > | HTCondor will start whatever job has the lowest priority number and meets the criteria for an open slot. So setting prio to 1 is equivalent of resetting it to the best prio it can have. Setting the factor to a small integer is the best priority factor you can have. | |||||||
The absolute priority number is prio x factor. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 11 to 11 | ||||||||
How to figure things out | ||||||||
Added: | ||||||||
> > | The above diagram shows that there are multiple components involved in getting a job started.
Let's start here by explaining what they are.
How to query the schedd on cmssubmit-r1
How to get the status of the pool
How to figure out why your job was heldEach job has a description of its state. You can query that description using condor_q -long jobId When condor holds a job it records a (more or less cryptic) reson for doing so. E.g. a very common reason for a job being held is that your proxy is about to expire. Here's what that would look like:condor_q -l 27600.0 | grep -i reason ReleaseReason = undefined HoldReasonSubCode = 0 HoldReason = "Error from glidein_230462_317727040@cabinet-5-5-21.t2.ucsd.edu: Proxy about to expire" HoldReasonCode = 4Similarily, you can also find out details like when your proxy expires: condor_q -l 27600.0 | grep -i x509 x509UserProxyVOName = "cms" x509UserProxyExpiration = 1441937795 date -d @1441937795 Thu Sep 10 19:16:35 PDT 2015To avoid this particular problem, you will want to extend your proxy lifetime to 72h with "voms-proxy-init -H 72". | |||||||
How to figure out the relative priority between different users that submit jobs from the UAF
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 23 to 23 | ||||||||
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mderdzin/CN=760843/CN=Mark Derdzinski is uscms5606 | ||||||||
Added: | ||||||||
> > | If I wanted to change the relative priority of different users on the UAF then I'd use the commands:
| |||||||
How to figure out what DN corresponds to which username inside the UCSD T2 clusterNote, with "usernames in the UCSD T2 cluster" I mean the names that GUMS maps the DN to at each of the OSG-CEs of the cluster. | ||||||||
Line: 36 to 46 | ||||||||
e.g. as of August 28th 2015, the DN /DC=ch/DC=cern/OU=computers/CN=cmspilot01/vocms080.cern.ch which is used by the glideinfronend for the UAF, is mapped on our cluster to the username cp0035. | ||||||||
Added: | ||||||||
> > | So if I want to adjust the relative priority of submissions via the UAF with submissions via CRAB3, or WMAgent, I need to
change the relative priority of username cp0035.
How to modify priorities on the cluster
| |||||||
Changed: | ||||||||
< < | How to figure out | |||||||
> > | The absolute priority number is prio x factor. | |||||||
-- FkW - 2015/08/28 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Architecture diagram![]() How to figure things outHow to figure out the relative priority between different users that submit jobs from the UAF
How to figure out what DN corresponds to which username inside the UCSD T2 clusterNote, with "usernames in the UCSD T2 cluster" I mean the names that GUMS maps the DN to at each of the OSG-CEs of the cluster. This username is then used to submit to HTCondor, and thus the name under which the job is known inside the cluster. https://sentry.t2.ucsd.edu/showmap/index.txt The important ones here are those mapped to /DC=ch/DC=cern/OU=computers/CN=cmspilotXY/vocms080.cern.ch here XY is a 2 digit integer, e.g. 01. e.g. as of August 28th 2015, the DN /DC=ch/DC=cern/OU=computers/CN=cmspilot01/vocms080.cern.ch which is used by the glideinfronend for the UAF, is mapped on our cluster to the username cp0035.How to figure out-- FkW - 2015/08/28
|