Difference: FkwSTEP09GlideinWMS (5 vs. 6)

Revision 62009/06/10 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="FkwGlideinWMS"
Line: 81 to 81
 

glidein-1

The web based monitoring of the gfactory is here.
Changed:
<
<
Under the username "gfactory" we run the glidein factory. I.e. this is the daemon that submits the glideins to all the sites. If you ever are uncertain why a certain job that is meant to run at a certain site hasn't run yet, login here and do:
condor_q -global -globus > junk.log
I believe the states go from:
UNSUBMITTED - PENDING - ACTIVE - STAGE_OUT
The first is the state when the schedd knows it is supposed to submit to the site, but has not gotten around to it yet. The last is after the job finishes, and while the glidein tries to stage-out the job output files back to glidein-2. I believe we only stage back out the stderr and sdtout via condor, but am not sure. Need to ask the experts.

glidein-frontend

The web based monitoring for the frontend is here.

Email from Igor:

>
>

Email from Igor on how to manage the factory:

 
Hi Frank.

Line: 139 to 123
 
Added:
>
>

where to find site specific glidein logs

There are logs for each CE separately. If you ever want to know why a certain set of glideins failed, this is where you look.

 gfactory@glidein-1 ~/glideinsubmit/glidein_STEP09_v1/entry_T2_IT_Legnaro/log 

obsolete things

Under the username "gfactory" we run the glidein factory. I.e. this is the daemon that submits the glideins to all the sites. If you ever are uncertain why a certain job that is meant to run at a certain site hasn't run yet, login here and do:
condor_q -global -globus > junk.log
I believe the states go from:
UNSUBMITTED - PENDING - ACTIVE - STAGE_OUT
The first is the state when the schedd knows it is supposed to submit to the site, but has not gotten around to it yet. The last is after the job finishes, and while the glidein tries to stage-out the job output files back to glidein-2. I believe we only stage back out the stderr and sdtout via condor, but am not sure. Need to ask the experts.

Igor tells me that there are better ways to analyze what's going on by looking at the web, or the python script described in glidein-frontend section.

glidein-frontend

The web based monitoring for the frontend is here.

the most powerful command for debugging the sites is the following:

frontend@glidein-frontend ~/glideinWMS/tools$ ./entry_compare.py

This makes a list of all jobs and glideins running, and compares the total numbers. I.e. it tells you which sites have glideins running that seem to have failed before registering with the collector back home.

You then go into glidein-1 and look at the glidein logs at just those sites to figure out what went wrong. Most typical error is that the C++ libraries are missing at the worker nodes, and the glideins thus don't find the shlibs they need. However, there are other reasons for failure as well.

 That's all I know so far about it!

glidein-collector

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback