TWiki> UCSDTier2 Web>GlideinWMSCrab (revision 11)EditAttach

List of Issues(Logbook)

2009/03/31

1. 2009-03-31 07:42:51,385:FatWorker worker_0 preparing submission
2009-03-31 07:42:51,386:FatWorker worker_0 performing list-match operation
2009-03-31 07:42:59,743:Sending TTXmlLogging? .
2009-03-31 07:42:59,743:Registering information:
{'submittedJobs': None, 'SE-White': None, 'exc': 'Traceback (most recent call last):\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWork
er/FatWorker.py", line 147, in run\n sub_jobs, reqs_jobs, matched, unmatched = self.submissionListCreation(taskObj, newRange)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-se
rver/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 570, in submissionListCreation\n schedParam, sites = self.sched_parameter_Glidein(id_job, taskObj)\n File "/home/hpi/CRABSERVER_Deployment
/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 780, in sched_parameter_Glidein\n availCEs = listAllCEs(version, arch, onlyOSG=onlyOSG)\n File "/h
ome/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/prodcommon/PRODCOMMON_0_12_12_CRAB_1-cmp/lib/ProdCommon/BDII/BdiiLdap.py", line 247, in listAllCEs\n ceList = filterCE(ceList, software, arch, b
dii, onlyOSG)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/prodcommon/PRODCOMMON_0_12_12_CRAB_1-cmp/lib/ProdCommon/BDII/BdiiLdap.py", line 219, in filterCE\n ceList = getSoftware
AndArch(ceList, software, arch, bdii)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/prodcommon/PRODCOMMON_0_12_12_CRAB_1-cmp/lib/ProdCommon/BDII/BdiiLdap.py", line 184, in getSoftwar
eAndArch\n query += buildOrQuery(\'GlueChunkKey=GlueClusterUniqueID\', [ce_to_cluster_map[h] for h in host_list])\nKeyError: \'gridce.sns.it:2119/jobmanager-lcgpbs-cms\'\n', 'skippedJobs': None, 'error': 'W
orkerError worker_0. Task spiga_crab_0_090331_164202_45iyu1. listMatch.', 'reason': 'Failure in pre-submission init', 'SE-Black': "['gridce.pg.infn.it']", 'unmatchedJobs': None, 'range': '[1, 2, 3, 4, 5, 6, 7,
8, 9, 10]', 'CE-White': None, 'time': None, 'notSubmittedJobs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce02.grid.sinica.edu.tw', 'lcg00125.grid.sinica
.edu.tw', 'gridpp.rl.ac.uk', 'cclcgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}
2009-03-31 07:42:59,744:WorkerError worker_0. Task spiga_crab_0_090331_164202_45iyu1. listMatch.
2009-03-31 07:42:59,744:'gridce.sns.it:2119/jobmanager-lcgpbs-cms'
2009-03-31 07:42:59,744:FatWorker worker_0 performing submission
2009-03-31 07:42:59,748:Sending TTXmlLogging? .
2009-03-31 07:42:59,748:Registering information:
{'submittedJobs': None, 'SE-White': None, 'exc': 'Traceback (most recent call last):\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWork
er/FatWorker.py", line 159, in run\n submittedJobs, nonSubmittedJobs, errorTrace = self.submitTaskBlocks(taskObj, sub_jobs, reqs_jobs, matched)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_
gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 372, in submitTaskBlocks\n for sub in sub_jobs: fullSubJob.extend(sub)\nTypeError: iteration over non-sequence\n', 'skipp
edJobs': None, 'error': 'WorkerError worker_0. Task spiga_crab_0_090331_164202_45iyu1.', 'reason': 'Failure during jobs submission', 'SE-Black': "['gridce.pg.infn.it']", 'unmatchedJobs': None, 'range': '[1, 2,
3, 4, 5, 6, 7, 8, 9, 10]', 'CE-White': None, 'time': None, 'notSubmittedJobs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce02.grid.sinica.edu.tw', 'lcg00
125.grid.sinica.edu.tw', 'gridpp.rl.ac.uk', 'cclcgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}
2009-03-31 07:42:59,748:WorkerError worker_0. Task spiga_crab_0_090331_164202_45iyu1.
2009-03-31 07:42:59,748:iteration over non-sequence

2.

2009-03-31 11:32:00,441:Registering information:
{'submittedJobs': None, 'SE-White': "['grid-srm.physik.rwth-aachen.de']", 'exc': 'Traceback (most recent call last):\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 159, in run\n submittedJobs, nonSubmittedJobs, errorTrace = self.submitTaskBlocks(taskObj, sub_jobs, reqs_jobs, matched)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 397, in submitTaskBlocks\n task = self.blSchedSession.submit(task[\'id\'], sub_jobs[ii], reqs_jobs[ii])\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/prodcommon/PRODCOMMON_0_12_12_CRAB_1-cmp/lib/ProdCommon/BossLite/API/BossLiteAPISched.py", line 129, in submit\n self.scheduler.submit( task, requirements )\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/prodcommon/PRODCOMMON_0_12_12_CRAB_1-cmp/lib/ProdCommon/BossLite/Scheduler/Scheduler.py", line 95, in submit\n job.runningJob[\'schedulerId\'] = jobAttributes[ job[\'name\'] ]\nKeyError: \'spadhi_crab_0_090331_202909_4p39kj_job114\'\n', 'skippedJobs': None, 'error': 'WorkerError worker_1. Task spadhi_crab_0_090331_202909_4p39kj.', 'reason': 'Failure during jobs submission', 'SE-Black': None, 'unmatchedJobs': None, 'range': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125]', 'CE-White': None, 'time': None, 'notSubmittedJobs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce02.grid.sinica.edu.tw', 'lcg00125.grid.sinica.edu.tw', 'gridpp.rl.ac.uk', 'cclcgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}
2009-03-31 11:32:00,441:WorkerError worker_1. Task spadhi_crab_0_090331_202909_4p39kj.

2009/03/21

1. Desired_SE names to be part of the jdl

2. SchedulerGrid? .py:

Issues for glideins and condor-g:

txt += ' echo "SyncCE=`glite-brokerinfo getCE`" >> $RUNTIME_AREA/$repo \n'


txt += 'if [ $middleware = LCG ]; then\n'
txt +
' CloseCEs? =`glite-brokerinfo getCE`\n'

glite specific dependencies needs to addressed.

3. FatWorker? .py

2009-03-20 20:38:16,516:Registering information:
{'submittedJobs': None, 'SE-White': None, 'exc': 'Traceback (most recent call last):\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 159, in run\n submittedJobs, nonSubmittedJobs, errorTrace = self.submitTaskBlocks(taskObj, sub_jobs, reqs_jobs, matched)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 378, in submitTaskBlocks\n self.SendMLpre(task)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 615, in SendMLpre? \n params = self.collect_MLInfo(task)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 647, in collect_MLInfo\n params = {\'tool\': \'crab\',\\\n File "/build/fvlingen/CMS_BUILD/comp-nightly-prodagent/w/slc4_ia32_gcc345/external/python/2.4.2-cmp4/lib/python2.4/UserDict.py", line 17, in __getitem__\n def __getitem__(self, key): return self.data[key]\nKeyError: \'HOSTNAME\'\n', 'skippedJobs': None, 'error': 'WorkerError worker_0. Task spadhi_crab_0_090321_043717_80rcv3.', 'reason': 'Failure during jobs submission', 'SE-Black': None, 'unmatchedJobs': None, 'range': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]', 'CE-White': None, 'time': None, 'notSubmittedJobs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce02.grid.sinica.edu.tw', 'lcg00125.grid.sinica.edu.tw', 'gridpp.rl.ac.uk', 'cclcgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}
2009-03-20 20:38:16,517:WorkerError worker_0. Task spadhi_crab_0_090321_043717_80rcv3.
2009-03-20 20:38:16,517:'HOSTNAME'

2.

2009-03-20 18:04:23,639:Registering information:
{'submittedJobs': None, 'SE-White': None, 'exc': 'Traceback (most recent call last):\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/
cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 159, in run\n submittedJobs, nonSubmittedJobs, errorTrace = self.submitTa
skBlocks(taskObj, sub_jobs, reqs_jobs, matched)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/
lib/CrabServerWorker/FatWorker.py", line 378, in submitTaskBlocks\n self.SendMLpre(task)\n File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_
gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 615, in SendMLpre? \n params = self.collect_MLInfo(task)\n File "/h
ome/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-server/CRABSERVER_1_0_7-cmp/lib/CrabServerWorker/FatWorker.py", line 647, in collect_ML
Info\n params = {\'tool\': \'crab\',\\\n File "/build/fvlingen/CMS_BUILD/comp-nightly-prodagent/w/slc4_ia32_gcc345/external/python/2.4.2-cmp4/lib/pyth
on2.4/UserDict.py", line 17, in __getitem__\n def __getitem__(self, key): return self.data[key]\nKeyError: \'HOSTNAME\'\n', 'skippedJobs': None, 'error
': 'WorkerError worker_0. Task spadhi_crab_0_090321_020302_3vfw41.', 'reason': 'Failure during jobs submission', 'SE-Black': None, 'unmatchedJobs': None,
'range': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77
, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]
', 'CE-White': None, 'time': None, 'notSubmittedJobs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce
02.grid.sinica.edu.tw', 'lcg00125.grid.sinica.edu.tw', 'gridpp.rl.ac.uk', 'cclcgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}

2009/03/07

1. GCB Issues from the WN(Some examples)

HOSTNAME=n74.lcgwn.kiae, n18.lcgwn.kiae

CE = gate.grid.kiae.ru:2119/jobmanager-lcgpbs-cms

HOSTNAME=wn011.polgrid.pl

CE=ce.polgrid.pl:2119/jobmanager-lcgpbs-cms

HOSTNAME=gaew0213.ciemat.es, gaew0225.ciemat.es

CE=lcg02.ciemat.es:2119/jobmanager-lcgpbs-cms

HOSTNAME=wn002.jinr.ru

CE=lcgce02.jinr.ru:2119/jobmanager-lcgpbs-cms

3/7 22:32:15 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:32:15 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:32:15 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:32:15 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:37:46 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:37:46 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:37:46 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:37:46 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:43:15 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:43:15 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:43:15 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:43:15 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:48:45 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:48:45 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:48:45 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:48:45 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:54:11 (pid:6761) No resources have been claimed for 1200 seconds
3/7 22:54:11 (pid:6761) Shutting down Condor on this machine.
3/7 22:54:11 (pid:6761) Got SIGTERM. Performing graceful shutdown.

2. Users generating MC (with Datasetpath=None)

DESIRED_Gatekeepers is empty.

2009/03/05

1. GCB Fails

/home/cms001/globus-tmp.cmsfarm-08-16.20721.0/glide_S20865/condor/sbin/gcb_broker_query: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory

Sites are:

1. CE=ce.indiacms.res.in:2119/jobmanager-lcgpbs-cms

HOST=wn104.indiacms.res.in

2. CE=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-long

HOSTNAME=wn-b-36

3. Following there are the same site: CE=t2-ce-01.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-12-04

CE=t2-ce-02.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-08-10

CE=t2-ce-03.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-12-02

Igor's problem to fix (FIXED)

2. Condor exit code fails

Factory submitts more jobs

Igor's problem to talk to Condor to get it fixed (FIXED)

3. DESIRED_Gatekeeper string empty

If the osg_bdii cannot find the CE name

For Eric to fix. Will mock up behaviour of glite submission, in that job submission to crabserver will fail if CE name not found vi bdii query.

4. environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO for glideins

Error:
- environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO not defined
- ./BrokerInfo file is not found
Error:
- environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO not defined
- ./BrokerInfo file is not found

Eric's problem to disable it for condor_g and glidein.

5. Traceback (most recent call last):

File "/home/cms040/globus-tmp.cmsfarm-04-10.29601.0/glide_R29835/execute/dir_30920/writeCfg.py", line 234, in ?
exit_status = main(sys.argv[1:])
File "/home/cms040/globus-tmp.cmsfarm-04-10.29601.0/glide_R29835/execute/dir_30920/writeCfg.py", line 90, in main
maxEvents = int(os.environ.get('MaxEvents', '0'))
ValueError: invalid literal for int(): /store/mc/JobRobot/QCD_pt_0_15/GEN-SIM-RAW-RECO/IDEAL_V9_JobRobot/0000/A48D5963-E5A1-DD11-83B5-001560AC7E98.root
%MSG-s CMSException: PoolSource? :source{*ctor*} 04-Mar-2009 23:10:46 CET pre-events
cms::Exception caught in cmsRun
---- Configuration BEGIN
Error occured while creating source PoolSource?
---- Configuration BEGIN
MissingParameter: The required parameter 'fileNames' was not specified.
---- Configuration END
---- Configuration END

SOLUTION: Problem identified and fixed (positional parameters were wrong)

6. Disable the condor to sendback the output files back to the server.

This needs some thought !!! The problem is that anybody using crabserver at reasonable scale ends up having too many files to get back, erach and every one of

which is a separate gridftp connection to crabserver host. This is a royal pain in the neck for the user. One way to fix it is to gzip the tgz's at the server into one,

and grab that one larger gzipped archive from the client. We shouldhave a discussion about pros and cons o this!

7. gfactory that can work qwith multiple proxies

Igor's problem. We are ready to deploy a gfactory that works with multiple proxies any time we get one. An ideal configuration would be to have Stefano's jobRobot jobs all run with his proxy only, while rest of the user jobs can use the "service proxies".

8. CRAB status does not communicate the associated CE names back.

When you do a
crab -status
via the client, you do not get the CE name where your job is running. This is not a big deal. Just listed here for completeness.

9. All sorts of dashboard related issues

Sanjay's problems

10. Crab client looks for condor daemons/commands by default

Need to disable this check, it in order to use it at lxplus. We do not have condor installed at CERN.

Eric's problem

-- SanjayPadhi - 2009/03/04

Edit | Attach | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2009/03/31 - 18:33:58 - SanjayPadhi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback