Hardware Deployment

glidein-collector	Collector/Negotiator
submit-1		Schedd/CRAB server (development version)
glidein-2		Schedd/Crab server  (production version)
glidein-frontend	glideinWMS Frontend
glidein-1		glideinWMS Factory and WMS Collector
glidein-mon     subir's monitoring

In general, Condor runs as user "condor" with homedirectory "/home/condor". And glideinWMS factory runs as "gfactory" with home directory "/home/gfactory".

Both the development ad production version of CRAB server submit to the same glideinWMS. I.e., we do not operate a development version of glideinWMS at this point for STEP09.

In the following, we go through one piece of hardware after the other and note some useful commands to figure out what's going on.

glidein-2

which condor_q
/data/glidecondor/bin/condor_q
condor_q

This will list all jobs presently known to the schedd on glidein-2. As glidein-2 is where the Crab server lives, this means it lists all the jobs that the Crab server has pushed into glideinWMS, and are not yet completed. A typical line looks like this:

8733.0   uscms2294       5/25 13:56   0+00:00:00 I  0   0.0  CMSSW.sh 119      
Let's discuss each of these one at a time. The first is the condor job Id. You can use that to get all the gory details about this job by doing:
condor_q -long 8733.0 >& junk.log
I redirected it into a file here because you will most likely want to look at this at your leisure, and carefully. We discuss the output of this in the next subsection in more detail below.

The next is the username on the UCSD cluster. You can find a map here from username to DN. Then come submission time, runtime so far, "I" for idle, i.e. this job is waiting in the queue. And the last part is the beginning of the command string to execute.

Info on what's in condor_q -long

There are a few particularly useful pieces of information in this long listing:

NumJobStarts = 0
NumRestarts = 0
If this isn't 0 or 1 then the glidein executing the job probably failed at that site once, and the job got rescheduled. This is a sign that either the site, or the glideinWMS is having trouble.
Cmd = "/data/gftp_cache/spadhi_crab_0_090525_225421_8qw7x3/CMSSW.sh"
is the command that condor will execute at the remote site.
EnteredCurrentStatus = 1243284992
is the time in unix time when the job entered it's current state. E.g., if the job is idle, and has never started then this is the submission time.
x509userproxysubject = "/DC=org/DC=doegrids/OU=People/CN=Sanjay Padhi 496075"
x509userproxy = "/data/gftp_cache/spadhi_crab_0_090525_225421_8qw7x3/userProxy"
These two tell you who submitted the job, and where to find that user's proxy. And finally, the following tells you where this job wants to be executed:
DESIRED_Gatekeepers = "hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs"

getting unix time

the following is often handy:
date +%s

glidein-1

The web based monitoring of the gfactory is here.

Email from Igor on how to manage the factory:

Hi Frank.

The factory is installed at
gfactory@glidein-1.t2.ucsd.edu:/home/gfactory/glideinsubmit/glidein_STEP09_v1

All operations need to be done as used gfactory, in that directory.

To start the factory:
./factory_startup start
To stop the factory:
./factory_startup stop

To reconfig the factory, edit
../glidein_STEP09_v1.cfg/glideinWMS.xml
then
./factory_startup reconfig ../glidein_STEP09_v1.cfg/glideinWMS.xml

To see if the factory is in downtime:
./factory_startup statusdown factory
Put the whole factory in downtime
./factory_startup down factory
Re-enable the factory
./factory_startup up factory

To see which entries are in downtime:
./factory_startup statusdown entries
Put an entry in downtime
./factory_startup down 
Re-enable an entry
./factory_startup up 

See what is going on with the factory
~/glideinWMS/tools/wmsTxtView.py
or on Web at
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_STEP09_v1/factoryStatus.html

Igor

where to find site specific glidein logs

There are logs for each CE separately. If you ever want to know why a certain set of glideins failed, this is where you look.

 gfactory@glidein-1 ~/glideinsubmit/glidein_STEP09_v1/entry_T2_IT_Legnaro/log 

obsolete things

Under the username "gfactory" we run the glidein factory. I.e. this is the daemon that submits the glideins to all the sites. If you ever are uncertain why a certain job that is meant to run at a certain site hasn't run yet, login here and do:
condor_q -global -globus > junk.log
I believe the states go from:
UNSUBMITTED - PENDING - ACTIVE - STAGE_OUT
The first is the state when the schedd knows it is supposed to submit to the site, but has not gotten around to it yet. The last is after the job finishes, and while the glidein tries to stage-out the job output files back to glidein-2. I believe we only stage back out the stderr and sdtout via condor, but am not sure. Need to ask the experts.

Igor tells me that there are better ways to analyze what's going on by looking at the web, or the python script described in glidein-frontend section.

glidein-frontend

The web based monitoring for the frontend is here.

the most powerful command for debugging the sites is the following:

frontend@glidein-frontend ~/glideinWMS/tools$ ./entry_compare.py

This makes a list of all jobs and glideins running, and compares the total numbers. I.e. it tells you which sites have glideins running that seem to have failed before registering with the collector back home.

You then go into glidein-1 and look at the glidein logs at just those sites to figure out what went wrong. Most typical error is that the C++ libraries are missing at the worker nodes, and the glideins thus don't find the shlibs they need. However, there are other reasons for failure as well.

That's all I know so far about it!

glidein-collector

If you want to mess with the relative priority between different users that submit to the glideinWMS, then this is the place to do so. The commands are all the same as for the UCSD condor cluster. I've documented those here. No point in repeating myself.

To know who is who, you will find the map from username to DN useful.

Note: the priorities assigned here determines which user among those who have submitted to our CRAB server will get access to the glideins that we have running at the sites. E.g., it allows us to make sure that jobRobot jobs only get the leftovers. By default, all users have the same priority. I'll mess with that later at some point.

Other Details for STEP09

Details from CCRC08 that might still be useful.

-- FkW - 2009/05/22

Topic revision: r6 - 2009/06/10 - 01:02:03 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback