Difference: FkwSTEP09GlideinWMS (3 vs. 4)

Revision 42009/05/26 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="FkwGlideinWMS"
Line: 34 to 34
 
8733.0   uscms2294       5/25 13:56   0+00:00:00 I  0   0.0  CMSSW.sh 119      
Added:
>
>
Let's discuss each of these one at a time.
 The first is the condor job Id. You can use that to get all the gory details about this job by doing:
condor_q -long 8733.0 >& junk.log
I redirected it into a file here because you will most likely want to look at this at your leisure, and carefully.
Added:
>
>
We discuss the output of this in the next subsection in more detail below.

The next is the username on the UCSD cluster. You can find a map here from username to DN. Then come submission time, runtime so far, "I" for idle, i.e. this job is waiting in the queue. And the last part is the beginning of the command string to execute.

Info on what's in condor_q -long

 There are a few particularly useful pieces of information in this long listing:
NumJobStarts = 0

Line: 46 to 54
  If this isn't 0 or 1 then the glidein executing the job probably failed at that site once, and the job got rescheduled. This is a sign that either the site, or the glideinWMS is having trouble.
Added:
>
>
Cmd = "/data/gftp_cache/spadhi_crab_0_090525_225421_8qw7x3/CMSSW.sh"
is the command that condor will execute at the remote site.
EnteredCurrentStatus = 1243284992
is the time in unix time when the job entered it's current state. E.g., if the job is idle, and has never started then this is the submission time.
x509userproxysubject = "/DC=org/DC=doegrids/OU=People/CN=Sanjay Padhi 496075"
x509userproxy = "/data/gftp_cache/spadhi_crab_0_090525_225421_8qw7x3/userProxy"
These two tell you who submitted the job, and where to find that user's proxy. And finally, the following tells you where this job wants to be executed:
DESIRED_Gatekeepers = "hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs"

getting unix time

the following is often handy:
date +%s

glidein-1

The web based monitoring of the gfactory is here.

Under the username "gfactory" we run the glidein factory. I.e. this is the daemon that submits the glideins to all the sites. If you ever are uncertain why a certain job that is meant to run at a certain site hasn't run yet, login here and do:

condor_q -global -globus > junk.log
I believe the states go from:
UNSUBMITTED - PENDING - ACTIVE - STAGE_OUT
The first is the state when the schedd knows it is supposed to submit to the site, but has not gotten around to it yet. The last is after the job finishes, and while the glidein tries to stage-out the job output files back to glidein-2. I believe we only stage back out the stderr and sdtout via condor, but am not sure. Need to ask the experts.

glidein-frontend

The web based monitoring for the frontend is here.

That's all I know so far about it!

glidein-collector

If you want to mess with the relative priority between different users that submit to the glideinWMS, then this is the place to do so. The commands are all the same as for the UCSD condor cluster. I've documented those here. No point in repeating myself.

To know who is who, you will find the map from username to DN useful.

Note: the priorities assigned here determines which user among those who have submitted to our CRAB server will get access to the glideins that we have running at the sites. E.g., it allows us to make sure that jobRobot jobs only get the leftovers. By default, all users have the same priority. I'll mess with that later at some point.

 

Other Details for STEP09

Line: 53 to 117
 
Added:
>
>

Details from CCRC08 that might still be useful.

 -- FkW - 2009/05/22 \ No newline at end of file
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback