TWiki> UCSDTier2 Web>GlideinWMSCrab (revision 8)EditAttach

List of Issues(Logbook)

2009/03/07

1. GCB Issues from the WN(Some examples)

HOSTNAME=n74.lcgwn.kiae, n18.lcgwn.kiae

CE = gate.grid.kiae.ru:2119/jobmanager-lcgpbs-cms

HOSTNAME=wn011.polgrid.pl

CE=ce.polgrid.pl:2119/jobmanager-lcgpbs-cms

HOSTNAME=gaew0213.ciemat.es, gaew0225.ciemat.es

CE=lcg02.ciemat.es:2119/jobmanager-lcgpbs-cms

HOSTNAME=wn002.jinr.ru

CE=lcgce02.jinr.ru:2119/jobmanager-lcgpbs-cms

3/7 22:32:15 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:32:15 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:32:15 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:32:15 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:37:46 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:37:46 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:37:46 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:37:46 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:43:15 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:43:15 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:43:15 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:43:15 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:48:45 (pid:6761) GCB: ERROR "handleActiveBlkedConn(8): setting status=CONN_FAILED, error 111 Connection refused"
3/7 22:48:45 (pid:6761) attempt to connect to <169.228.130.23:9629> failed: Connection refused (connect errno = 111).
3/7 22:48:45 (pid:6761) ERROR: SECMAN:2004:Failed to create security session to <169.228.130.23:9629> with TCP|SECMAN:2003:TCP connection to <169.228.130.23:9629> failed

3/7 22:48:45 (pid:6761) Failed to start non-blocking update to <169.228.130.23:9629>.
3/7 22:54:11 (pid:6761) No resources have been claimed for 1200 seconds
3/7 22:54:11 (pid:6761) Shutting down Condor on this machine.
3/7 22:54:11 (pid:6761) Got SIGTERM. Performing graceful shutdown.

2. Users generating MC (with Datasetpath=None)

DESIRED_Gatekeepers is empty.

2009/03/05

1. GCB Fails

/home/cms001/globus-tmp.cmsfarm-08-16.20721.0/glide_S20865/condor/sbin/gcb_broker_query: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory

Sites are:

1. CE=ce.indiacms.res.in:2119/jobmanager-lcgpbs-cms

HOST=wn104.indiacms.res.in

2. CE=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-long

HOSTNAME=wn-b-36

3. Following there are the same site: CE=t2-ce-01.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-12-04

CE=t2-ce-02.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-08-10

CE=t2-ce-03.lnl.infn.it:2119/jobmanager-lcglsf-cms

HOSTNAME=cmsfarm-12-02

Igor's problem to fix (FIXED)

2. Condor exit code fails

Factory submitts more jobs

Igor's problem to talk to Condor to get it fixed (FIXED)

3. DESIRED_Gatekeeper string empty

If the osg_bdii cannot find the CE name

For Eric to fix. Will mock up behaviour of glite submission, in that job submission to crabserver will fail if CE name not found vi bdii query.

4. environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO for glideins

Error:
- environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO not defined
- ./BrokerInfo file is not found
Error:
- environment GLITE_WMS_RB_BROKERINFO or EDG_WL_RB_BROKERINFO not defined
- ./BrokerInfo file is not found

Eric's problem to disable it for condor_g and glidein.

5. Traceback (most recent call last):

File "/home/cms040/globus-tmp.cmsfarm-04-10.29601.0/glide_R29835/execute/dir_30920/writeCfg.py", line 234, in ?
exit_status = main(sys.argv[1:])
File "/home/cms040/globus-tmp.cmsfarm-04-10.29601.0/glide_R29835/execute/dir_30920/writeCfg.py", line 90, in main
maxEvents = int(os.environ.get('MaxEvents', '0'))
ValueError: invalid literal for int(): /store/mc/JobRobot/QCD_pt_0_15/GEN-SIM-RAW-RECO/IDEAL_V9_JobRobot/0000/A48D5963-E5A1-DD11-83B5-001560AC7E98.root
%MSG-s CMSException: PoolSource? :source{*ctor*} 04-Mar-2009 23:10:46 CET pre-events
cms::Exception caught in cmsRun
---- Configuration BEGIN
Error occured while creating source PoolSource?
---- Configuration BEGIN
MissingParameter: The required parameter 'fileNames' was not specified.
---- Configuration END
---- Configuration END

SOLUTION: Problem identified and fixed (positional parameters were wrong)

6. Disable the condor to sendback the output files back to the server.

This needs some thought !!! The problem is that anybody using crabserver at reasonable scale ends up having too many files to get back, erach and every one of

which is a separate gridftp connection to crabserver host. This is a royal pain in the neck for the user. One way to fix it is to gzip the tgz's at the server into one,

and grab that one larger gzipped archive from the client. We shouldhave a discussion about pros and cons o this!

7. gfactory that can work qwith multiple proxies

Igor's problem. We are ready to deploy a gfactory that works with multiple proxies any time we get one. An ideal configuration would be to have Stefano's jobRobot jobs all run with his proxy only, while rest of the user jobs can use the "service proxies".

8. CRAB status does not communicate the associated CE names back.

When you do a
crab -status
via the client, you do not get the CE name where your job is running. This is not a big deal. Just listed here for completeness.

9. All sorts of dashboard related issues

Sanjay's problems

10. Crab client looks for condor daemons/commands by default

Need to disable this check, it in order to use it at lxplus. We do not have condor installed at CERN.

Eric's problem

-- SanjayPadhi - 2009/03/04

Edit | Attach | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009/03/08 - 22:04:51 - SanjayPadhi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback