TWiki> UCSDTier2 Web>FkwCCRC08GlideinWMS (revision 14)EditAttach

This page describes the glideinWMS system as deployed for CCRC08.

Hardware

  • glidein-2 = submitter, place where CRAB runs
  • glidein-1 = gfactory
  • srm-2 = collector
  • t2data0 = frontend, also runs a gcb under gcbuser account
  • gftp-3 = runs gcb under gcbuser account

submitter details

The submitter machine is glidein-2 . The directory out of which I run stuff is ~fkw/CCRC08 . There's a top-level README that describes where to find the source_me to get started.

On there we have 10 schedd's running. The 9 other than the default one are called schedd_jobs1@ up to schedd_jobs9@ . Submitting a job to any of them is as follows:

condor_submit -name schedd_jobs3@  myClassAd
condor_q -global
condor_q -name schedd_jobs3@

CRAB submission details

fkw is working out of:
cd ~fkw/CCRC08/sanjay/
./ccrcV1.sh T2_US_Florida /WW_incl/CMSSW_1_6_7-CSA07-1196178448/RECO 4 6

The "4 6" at the end here would make sure that this datasets is submitted to 3 times, with the int going 4,5,6.

The main submission script is: ccrcV1.sh It exists in three versions:

  • V1 to be used for all sites except ...
  • V2 to be used for IFCA
  • V3 to be used for RHUL

To make sure that long distance latencies don't interfere with schedd operations, we use multiple schedd's. To do so unfortunately requires modifications of CRAB each time. The file that needs modification is:

cd ~fkw/CCRC08/sanjay/CRAB-HEAD/ProdCommon/ProdCommon/BossLite/Scheduler
rm -f SchedulerCondorCommon.pyc
emacs -nw SchedulerCondorCommon.py

In there, modify the "condor_submit -name schedd_jobs8@" to whatever schedd you want used.

At present, I am using the following schedd's:

  • schedd_jobs1@ UERJ
  • schedd_jobs2@ Warsaw
  • schedd_jobs3@ IFCA
  • schedd_jobs4@ Nebraska, Florida
  • schedd_jobs5@ JINR, RHUL, SINP
  • schedd_jobs8@ UCSD

CRAB logfiles and alike

/home/spadhi/CRAB/CCRC08/T2_HU_Budapest/Njet-blabla-1/glidein-blabla/share/.condor_temp

Note that the last directory in this path has a dot in front, and is thus a hidden directory.

Parsing Details Error 50115

To find which jobs are the ones with the error:
 grep "crab_fjr.xml: 50115" *.stdout
 grep "PoolSource:source" *.stderr
 grep "rfio_open failed: filename =" *.stderr

The first gets you the jobs with that error code. The second finds the dates at which the file open failure occured. The third is specific for SEs that use rfio. It gives you a list of files that failed to open.

Note: Not all jobs that fail to open a file are recorded as failed jobs!!! I found one job in Budapest that successfully read a first file then failed on a second, and was recorded as a successful job in the dashboard.

Getting epoch time

 date +%s

Problems with condor tests

I had a hell of a time with my jdl file for simple tests until Igor explained:

  • To get a glidein started to a site for which there is none running right now you need to have
+DESIRED_Gatekeepers = "t2-ce-01.lnl.infn.it:2119/jobmanager-lcglsf" 
+DESIRED_Archs = "INTEL,X86_64" 
Requirements = stringListMember(GLIDEIN_Gatekeepers,DESIRED_Gatekeepers) && stringListMember(Arch, DESIRED_Archs) 
  • However, if you already have a glidein running at a site, then you can also get matched to the site if you have:
+DESIRED_Site = "INFN-LNL" +DESIRED_Archs = "INTEL,X86_64" Requirements = stringListMember(GLIDEIN_Gatekeepers,DESIRED_Site) && stringListMember(Arch, DESIRED_Archs)

This confused me because I once succeeded with the second, and then never again.

To see that you are succeeding, login to glidein-1 and do:

condor_q -global -globus

This will show you the status of the glideins at the remote host.

In general, gfactory is pretty quick. It picks up the submissions form glidein-2 within a minute, and submits glideins accordingly. If you see any significant delay then something's wrong.

gfactory details

The gfactory is deployed on glidein-1 in the account "gfactory". According to the ~gfactory/start_factory, the present version is in:
/home/gfactory/glideinsubmit/glidein_CCRC08_2

The master xml file that describes the system seems to be in:
~gfactory/glideinWMS/creation/glideinWMS.xml

logfiles

Glidein logs are in ~/glideinsubmit/glidein/entry_/log

In this case, it is ~/glideinsubmit/glidein_CCRC08_2/entry_CIEMAT-LCG2-LCG02-CMS/log

They are in job.*.err

manage gfactory

In /home/gfactory/start_factory.sh look up which version of the gfactory configuration is presently running. Then copy the xml for that version into the creation directory, modify it, and reconfigure.

ps -auwx |grep python
killall python
killall -9 python
cd glideinWMS/creation
cp ~/glideinsubmit/glidein_CCRC08_2/glideinWMS.xml
now edit this file. Then reconfigure. Then maybe check what the reconfig did by looking through the proper directory in glideinsubmit. Then start the gfactory back up.

./reconfig_glidein glideinWMS.xml
cd
./start_factory.sh

Useful commands on glidein-1

condor_status -any -constraint 'GlideinMyType =?= "glidefactory"' -format 'Entry: %s ' EntryName -format "Site: %s " GLIDEIN_Site -format "Gatekeeper: %s\n" GLIDEIN_Gatekeeper

Next some useful bdii querries:

ldapsearch -LLL -x -h lcg-bdii.cern.ch -p 2170 -b 'mds-vo-name=local,o=grid'

ldapsearch -LLL -x -h is.grid.iu.edu -p 2170 -b 'mds-vo-name=local,o=grid'

Finally, a query to show the status of the remote glideins at the remote sites:

condor_q -global -globus

For this last query, the glidien at the remote site can be in one of the following states:

  • ACTIVE = well it's running
  • PENDING = its queued up in the remote batch system
  • UNKNOWN = this is what happens when a glidein gets lost for whatever reason. It's basically the same as the "hold" state.
  • UNSUBMITTED = in the schedd queue on gfactory but not yet sent to the remote site
  • STAGE_OUT = no idea what this means

Whenever you want to know more about a specific job, do:

condor_q -global -globus -l 333.2

Here 333.2 is the condor id for that job. There's a variety of classAd attributes that give you important information on the state of the job.

Igor says that frontend does a condor_release on the held jobs in gfactory. He says furthermore, that the held jobs should then be taken care of by condor_g if they don't exist any more at the remote site. He says that doing a condor_rm is really not useful, nor advisable. Given that Igor expects condor_g to take care of the UNKNOWN once released by frontend, there is no counter in frontend that would check how often a job was released. Given that we have presently (May11 21:37) 1487 UNKNOWN, we got plenty of opportunity to watch if this is really happening. An initial wild guess indicates that this is indeed working because this afternoon (May 11 15:32) we had 2099 UNKNOWN. However, the UNSUBMITTED seem stuck forever if they are stuck. We have 874 of these at hera since earlier in the day, and this count hasn't changed.

From Igor:

Another useful command on glidein-1
cd glideinWMS/tools/
python wmsXMLView.py

This gives you the same information as:
http://glidein-1.t2.ucsd.edu:8319/glidefactory//monitor/glidein_CCRC08_2/schedd_status.xml

Igor

Here's the monitoring

http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_CCRC08_2/

things not to do

don't change x509 stuff while condor has anything in the queue.

frontend

To restart the frontend:

su - frontend
ps -auwx |grep python
killall python
killall -9 python
./start_frontend.sh
ps -auwx |grep python

The shell script "start_frontend.sh" points to the location of the config file. The configs are pretty obvious. They include such things as maximum jobs expected to run. Maximum number of idle glideins per entrypoint, etc.

GCB

The GCB logs are in ~/gcbcondor/condor_local/log/

We have two of them gcbuser@t2data0 and gcbuser@gftp-3

Most of the interesting stuff is in GCB_Broker_Log. But ocasionally you want to look at the others, too.

Commands worth knowing on lxplus

source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
lcg-info --list-se --vo cms --query 'SE=*osg-se.sprace.org.br*' --attrs Path,Root
lcg-ls -l -b -D srmv2 srm://osg-se.sprace.org.br:8443/srm/managerv2?SFN=/pnfs/sprace.org.br/data/

lcg-cp file://`pwd`/site.txt srm://t2data2.t2.ucsd.edu:8443/srm/managerv2?SFN=/pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/abc.txt

UCSD cluster condor system

While I'm at it, I migth as well document how one changes the priority of a user on the UCSD tier-2 cluster. I had to do that on Sunday evening to get some traction on getting jobs through at UCSD.

First of all, find out where the negotiator is actually runnning:

condor_status -negotiator

At present, the negotiator is run on osg-gw-1 .

Then login to that node as root, and do:

condor_userprio -all |more
condor_userprio -all |grep uscms1586
condor_userprio -setfactor group_cms.uscms1586@osg-gw-2.t2.ucsd.edu 1
condor_userprio -setfactor group_cms.uscms1586@osg-gw-4.t2.ucsd.edu 1
condor_userprio -all |grep uscms1586
condor_userprio -setprio  group_cms.uscms1586@osg-gw-2.t2.ucsd.edu 0.5
condor_userprio -setprio  group_cms.uscms1586@osg-gw-4.t2.ucsd.edu 0.5

The setprio command basically resets your effective priority to 0.5, i.e you start over near zero, as if you hadn't run here in ages. The setfactor command sets the mu;tiplier for your wallclock time to get the effective priority. This is an integer. Setting it to 1 is thus the best you can do.

-- FkW - 25 Apr 2008

Edit | Attach | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2008/05/19 - 16:36:46 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback