TWiki
>
UCSDTier2 Web
>
FkwCCRC08GlideinWMS
(2008/08/06,
SanjayPadhi
)
(raw view)
E
dit
A
ttach
%TOC% This page describes the glideinWMS system as deployed for CCRC08. ---++ Hardware * glidein-2 = submitter, place where CRAB runs * glidein-1 = gfactory * glidein-collector = collector * glidein-frontend = frontend, also runs a gcb under gcbuser account * glidein-gcb-1 = runs gcb under gcbuser account ---++ submitter details The submitter machine is glidein-2 . The directory out of which I run stuff is ~fkw/CCRC08 . There's a top-level README that describes where to find the source_me to get started. On there we have 10 schedd's running. The 9 other than the default one are called schedd_jobs1@ up to schedd_jobs9@ . Submitting a job to any of them is as follows:<br> <pre> condor_submit -name schedd_jobs3@ myClassAd condor_q -global condor_q -name schedd_jobs3@ </pre> ---+++ CRAB submission details fkw is working out of: <pre> cd ~fkw/CCRC08/sanjay/ ./ccrcV1.sh T2_US_Florida /WW_incl/CMSSW_1_6_7-CSA07-1196178448/RECO 4 6 </pre> The "4 6" at the end here would make sure that this datasets is submitted to 3 times, with the int going 4,5,6. The main submission script is: ccrcV1.sh It exists in three versions: * V1 to be used for all sites except ... * V2 to be used for IFCA * V3 to be used for RHUL To make sure that long distance latencies don't interfere with schedd operations, we use multiple schedd's. To do so unfortunately requires modifications of CRAB each time. The file that needs modification is: <pre> cd ~fkw/CCRC08/sanjay/CRAB-HEAD/ProdCommon/ProdCommon/BossLite/Scheduler rm -f SchedulerCondorCommon.pyc emacs -nw SchedulerCondorCommon.py </pre> In there, modify the "condor_submit -name schedd_jobs8@" to whatever schedd you want used. At present, I am using the following schedd's: * schedd_jobs1@ UERJ * schedd_jobs2@ Warsaw * schedd_jobs3@ IFCA * schedd_jobs4@ Nebraska, Florida * schedd_jobs5@ JINR, RHUL, SINP * schedd_jobs8@ UCSD ---+++ CRAB logfiles and alike <pre> /home/spadhi/CRAB/CCRC08/T2_HU_Budapest/Njet-blabla-1/glidein-blabla/share/.condor_temp </pre> Note that the last directory in this path has a dot in front, and is thus a hidden directory. ---++++ Parsing Details Error 50115 To find which jobs are the ones with the error: <pre> grep "crab_fjr.xml: 50115" *.stdout grep "PoolSource:source" *.stderr grep "rfio_open failed: filename =" *.stderr </pre> The first gets you the jobs with that error code. The second finds the dates at which the file open failure occured. The third is specific for SEs that use rfio. It gives you a list of files that failed to open. *Note: Not all jobs that fail to open a file are recorded as failed jobs!!! I found one job in Budapest that successfully read a first file then failed on a second, and was recorded as a successful job in the dashboard.* ---++++ Getting epoch time <pre> date +%s </pre> ---+++ Problems with condor tests I had a hell of a time with my jdl file for simple tests until Igor explained: * To get a glidein started to a site for which there is none running right now you need to have <pre> +DESIRED_Gatekeepers = "t2-ce-01.lnl.infn.it:2119/jobmanager-lcglsf" +DESIRED_Archs = "INTEL,X86_64" Requirements = stringListMember(GLIDEIN_Gatekeepers,DESIRED_Gatekeepers) && stringListMember(Arch, DESIRED_Archs) </pre> * However, if you already have a glidein running at a site, then you can also get matched to the site if you have: <pe> +DESIRED_Site = "INFN-LNL" +DESIRED_Archs = "INTEL,X86_64" Requirements = stringListMember(GLIDEIN_Gatekeepers,DESIRED_Site) && stringListMember(Arch, DESIRED_Archs) </pre> This confused me because I once succeeded with the second, and then never again. To see that you are succeeding, login to glidein-1 and do: <pre> condor_q -global -globus </pre> This will show you the status of the glideins at the remote host. In general, gfactory is pretty quick. It picks up the submissions form glidein-2 within a minute, and submits glideins accordingly. If you see any significant delay then something's wrong. ---++ gfactory details The gfactory is deployed on glidein-1 in the account "gfactory". According to the ~gfactory/start_factory, the present version is in:<br> /home/gfactory/glideinsubmit/glidein_CCRC08_2<br> The master xml file that describes the system seems to be in:<br> ~gfactory/glideinWMS/creation/glideinWMS.xml<br> ---+++ logfiles Glidein logs are in ~/glideinsubmit/glidein<factory name>/entry_<entry_name>/log In this case, it is ~/glideinsubmit/glidein_CCRC08_2/entry_CIEMAT-LCG2-LCG02-CMS/log They are in job.*.err ---+++ manage gfactory In /home/gfactory/start_factory.sh look up which version of the gfactory configuration is presently running. Then copy the xml for that version into the creation directory, modify it, and reconfigure. <pre> ps -auwx |grep python killall python killall -9 python cd glideinWMS/creation cp ~/glideinsubmit/glidein_CCRC08_2/glideinWMS.xml </pre> now edit this file. Then reconfigure. Then maybe check what the reconfig did by looking through the proper directory in glideinsubmit. Then start the gfactory back up. <pre> ./reconfig_glidein glideinWMS.xml cd ./start_factory.sh </pre> ---+++ Location of the script that is run by glidein on worker nodes <pre> [1012] gfactory@glidein-1 ~/glideinsubmit/glidein_CCRC08_2$ pwd /home/gfactory/glideinsubmit/glidein_CCRC08_2 [1016] gfactory@glidein-1 ~/glideinsubmit/glidein_CCRC08_2$ less glidein_startup.sh </pre> ---+++ Useful commands on glidein-1 <pre> condor_status -any -constraint 'GlideinMyType =?= "glidefactory"' -format 'Entry: %s ' EntryName -format "Site: %s " GLIDEIN_Site -format "Gatekeeper: %s\n" GLIDEIN_Gatekeeper </pre> Next some useful bdii querries: <pre> ldapsearch -LLL -x -h lcg-bdii.cern.ch -p 2170 -b 'mds-vo-name=local,o=grid' ldapsearch -LLL -x -h is.grid.iu.edu -p 2170 -b 'mds-vo-name=local,o=grid' </pre> Finally, a query to show the status of the remote glideins at the remote sites: <pre> condor_q -global -globus </pre> For this last query, the glidien at the remote site can be in one of the following states: * ACTIVE = well it's running * PENDING = its queued up in the remote batch system * UNKNOWN = this is what happens when a glidein gets lost for whatever reason. It's basically the same as the "hold" state. * UNSUBMITTED = in the schedd queue on gfactory but not yet sent to the remote site * STAGE_OUT = no idea what this means Whenever you want to know more about a specific job, do: <pre> condor_q -global -globus -l 333.2 </pre> Here 333.2 is the condor id for that job. There's a variety of classAd attributes that give you important information on the state of the job. Igor says that frontend does a condor_release on the held jobs in gfactory. He says furthermore, that the held jobs should then be taken care of by condor_g if they don't exist any more at the remote site. He says that doing a condor_rm is really not useful, nor advisable. Given that Igor expects condor_g to take care of the UNKNOWN once released by frontend, there is no counter in frontend that would check how often a job was released. Given that we have presently (May11 21:37) 1487 UNKNOWN, we got plenty of opportunity to watch if this is really happening. An initial wild guess indicates that this is indeed working because this afternoon (May 11 15:32) we had 2099 UNKNOWN. However, the UNSUBMITTED seem stuck forever if they are stuck. We have 874 of these at hera since earlier in the day, and this count hasn't changed. From Igor: <pre> Another useful command on glidein-1 cd glideinWMS/tools/ python wmsXMLView.py This gives you the same information as: http://glidein-1.t2.ucsd.edu:8319/glidefactory//monitor/glidein_CCRC08_2/schedd_status.xml Igor </pre> ---+++ Here's the monitoring http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_CCRC08_2/ ---+++ things not to do don't change x509 stuff while condor has anything in the queue. ---++ frontend To restart the frontend: <pre> su - frontend ps -auwx |grep python killall python killall -9 python ./start_frontend.sh ps -auwx |grep python </pre> The shell script "start_frontend.sh" points to the location of the config file. The configs are pretty obvious. They include such things as maximum jobs expected to run. Maximum number of idle glideins per entrypoint, etc. The configuration script is at: <pre> /home/frontend/glideinfrontend/v2/etc/vofrontend.cfg </pre> This is very CMS CRAB specific (it assumes you are running CRAB). The config lines that do this are <pre> # specify the appropriate constraint job_constraint='(JobUniverse==5) && ((DESIRED_Sites=!=UNDEFINED) || (DESIRED_Gatekeepers=!=UNDEFINED))' # String for matching jobs to glideins match_string='(job.has_key("DESIRED_Sites") and (glidein["attrs"]["GLIDEIN_Site"] in job["DESIRED_Sites"].split(","))) or (job.has_key("DESIRED_Gatekeepers") and (glidein["attrs"]["GLIDEIN_Gatekeeper"] in job["DESIRED_Gatekeepers"].split(",")))' </pre> ---++ GCB The GCB logs are in ~/gcbcondor/condor_local/log/ We have two of them gcbuser@t2data0 and gcbuser@gftp-3 Most of the interesting stuff is in GCB_Broker_Log. But ocasionally you want to look at the others, too. ---++ Commands worth knowing on lxplus <pre> source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh lcg-info --list-se --vo cms --query 'SE=*osg-se.sprace.org.br*' --attrs Path,Root lcg-ls -l -b -D srmv2 srm://osg-se.sprace.org.br:8443/srm/managerv2?SFN=/pnfs/sprace.org.br/data/ </pre> <pre> lcg-cp file://`pwd`/site.txt srm://t2data2.t2.ucsd.edu:8443/srm/managerv2?SFN=/pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/abc.txt </pre> ---++ UCSD cluster condor system While I'm at it, I migth as well document how one changes the priority of a user on the UCSD tier-2 cluster. I had to do that on Sunday evening to get some traction on getting jobs through at UCSD. First of all, find out where the negotiator is actually runnning: <pre> condor_status -negotiator </pre> At present, the negotiator is run on osg-gw-1 . Then login to that node as root, and do: <pre> condor_userprio -all |more condor_userprio -all |grep uscms1586 condor_userprio -setfactor group_cms.uscms1586@osg-gw-2.t2.ucsd.edu 1 condor_userprio -setfactor group_cms.uscms1586@osg-gw-4.t2.ucsd.edu 1 condor_userprio -all |grep uscms1586 condor_userprio -setprio group_cms.uscms1586@osg-gw-2.t2.ucsd.edu 0.5 condor_userprio -setprio group_cms.uscms1586@osg-gw-4.t2.ucsd.edu 0.5 </pre> The setprio command basically resets your effective priority to 0.5, i.e you start over near zero, as if you hadn't run here in ages. The setfactor command sets the mu;tiplier for your wallclock time to get the effective priority. This is an integer. Setting it to 1 is thus the best you can do. ---++ Managing the certificate To manage a long lived certificate from a crontab without breaking out a sweat, the following is useful: * First created a very long lived proxy: <pre> [1600] fkw@uaf-1 ~/CMS/condor$ voms-proxy-init -hours 9876543:0 -out=/tmp/testForDave.file Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373 Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating proxy ............................................... Done Warning: your certificate and proxy will expire Sat Feb 7 19:51:06 2009 which is within the requested lifetime of the proxy </pre> * Then use that very long lived proxy to et a short lived proxy with a CMS attribute attached to it: <pre> [1602] fkw@uaf-1 ~/CMS/condor$ voms-proxy-init -voms cms -hours 120 -valid=96:0 -cert=/tmp/testForDave.file Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Your identity: /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373/CN=proxy Cannot find file or dir: /home/condor/execute/dir_11128/userdir/glite/etc/vomses Creating temporary proxy .................................................... Done Contacting voms.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch] "cms" Done Creating proxy ........................................ Done Your proxy is valid until Sat Jun 7 16:02:59 2008 </pre> * Finally, verify that this got me a proxy with attribute that is short lived, and by default in the right place: <pre> [1603] fkw@uaf-1 ~/CMS/condor$ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo cms subject : /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373/CN=proxy/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373/CN=proxy identity : /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373/CN=proxy type : unknown strength : 512 bits path : /tmp/x509up_u10179 timeleft : 95:59:39 === VO cms extension information === VO : cms subject : /DC=org/DC=doegrids/OU=People/CN=Frank Wuerthwein 699373 issuer : /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch attribute : /cms/Role=NULL/Capability=NULL attribute : /cms/uscms/Role=NULL/Capability=NULL timeleft : 95:59:39 </pre> -- Main.FkW - 25 Apr 2008 -- Main.SanjayPadhi - 06 Aug 2008
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r17
<
r16
<
r15
<
r14
<
r13
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r17 - 2008/08/06 - 15:14:58 -
SanjayPadhi
UCSDTier2
Log In
UCSDTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
CMSBrownBag
CMSUCSD
HEPProjects
Main
Sandbox
TWiki
UCLHCWeb
UCSDHepBrownBag
UCSDScaleTests
UCSDTier2
USCMSWeb
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback