CMS glidein requirements for the CREAM CE Services
Description
The Computing Resource Execution And Management (CREAM) is a service for job management operation at the Computing Element (CE) level. For glideinWMS usage, we plan to submit the pilot jobs (glideins) via the Condor-G method. The advantage includes inheriting the already proven scalability, delegation of the proxies, checking for the availability of remote CE services etc, using Condor-G, at the same time expanding its usage for new states in the CE. The procedure from defining, testing to the final operations can be outlined as follows.
People involved:
- Jaime Frey (Condor/WISC)
- Sanjay Padhi (UCSD)
- Frank Wuerthwein (UCSD)
- Igor Sfiligoi (FNAL)
- Burt Holzman (FNAL)
Overall Planning
Phase 1 (01 Nov - 30 Nov 2008)
Initial goal for this phase is to be able to test the direct job submission to the CREAM CE via Condor-G. This includes testing the following operations:
- CREAM Job Submission - DONE
- Job Start (CREAM_JOB_START)- DONE
- Job Cancel - DONE
- CREAM Job Register (CREAM_JOB_REGISTER) - DONE
- Job Suspend- NOT WORKING (For Externally managed state: creamState CANCELLED)
- Job Resume- NOT WORKING (For Externally managed state: creamState CANCELLED)
- Job Purge (CREAM_JOB_PURGE) - WORKING (breaks approx - 4-7% of the time)
- Job List - DONE
- Job Info states - Currently available as part of the Gridmanager logs (TOBEDONE for Phase 2)
- (ABORTED, CANCELLED, DONE-FAILED, DONE-OK, HELD, IDLE, PENDING, REALLY-RUNNING, REGISTERED, RUNNING, UNKNOWN)
- Job Status (CREAM_JOB_STATUS) - DONE
- Auto Proxy Delegation method (ASYNC_MODE, CACHE_PROXY)- DONE
- Delegated Proxy Renewal after the expiry - DONE
- Service Status (CREAM_PING) - DONE
- ServiceInfo? for enabled, disabled and status of the job submission service - DONE
- Synchronisation of the CREAMID with GHAP - DONE
Example of output ENV at the WN using the CREAM CE can be found
here. A failure rate of about ~4-7% observed out of 10K short jobs using the "modified" Condor-G.
Disclaimer: The current activity will be part of the next Condor software version >= 7.3.0. Not all aspects related to the submission to the CREAM CE using Condor-G is covered/studied here, although this phase sucessfully evaluates the basic submission procedure/mechanism to the next generation CE.
Test setup at UCSD (glidein-c.t2.ucsd.edu)
debug: starting to put gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: connecting to gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: response from gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh:
220 cream-12.pd.infn.it GridFTP Server 2.3 (gcc32dbg, 1144436882-63) ready.
debug: authenticating with gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: response from gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh:
230 User cms032 logged in ....
Phase 2 (01 Feb - 28 Feb 2009)
The objective of Phase 2 is also to ensure the functionalities of Phase 1 with respect to any change in the software version CE.
- Test the basic job Submission procedure
- Job leases
- Test submission of multiple jobs
- Test failure recovery
- Input/Output Sandbox
- Prototype - Integration with the glideinWMS
- Small scale user jobs with glideinWMS + CRAB/Crabserver
- Results during this phase
- Provide first-hand experience of the CE using Condor-G with glideins for CHEP09 Workshop
Phase 3 (15 April - Summer 2009)
Based on how EGEE moves from Pre-Production sites to Production/Certification of the software and the status of ICE-based WMS, this phase can have a wide range of goals.
- Test the production functionalities of the CE and the information system, BDII
- Install a prototype CREAM CE at UCSD, study the scalability at the "ghost" cluster, up to 10K Nodes.
- Similar to the study done at UCSD
- Study the Condor-G and glideinWMS interface up to the production level
- Add production modules to the glidein interface in order to be able to submit to both globus and non-globus based CEs
- Production level tests with glideinWMS and Crabserver
- Provide a "frozen" release version for glideins involving both kind of CEs.
Technical documentation
CREAM Documentation
CREAM CEs
-CNAF: cert-ce-03.cnaf.infn.it + 4 virtual WNs using pbs, 7 queues (alice,atlas,cms,lhcb,ops,dteam) pps
-FZK: pps-cream-fzk.gridka.d
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de
LSF Cream INFN-PADOVA
- cream-10 SL4, batch master LSF
- cream-21 SL4, CE with LSF
- cream-22 SL4, CE with LSF
- cream-23 SL4, CE with LSF
- cream-24 SL4, CE with LSF
- cream-25 SL4, CE with LSF
- cream-26 SL4, CE with LSF
- cream-27 SL4, CE with LSF - site BDII
PBS Cream INFN-PADOVA
- cream-28 SL4, CE with pbs - batch master
- cream-29 SL4, CE with pbs
- cream-30 SL4, CE with pbs
- cream-31 SL4, CE with pbs
- cream-32 SL4, CE with pbs
- cream-33 SL4, CE with pbs
- cream-34 SL4, CE with pbs - site BDII
CREAM UI & WMS (if needed)
- cert-ui-01.cnaf.infn.it (UI) cert-rb-01.cnaf.infn.it (WMS + ICE)
Site BDII
- ldap://cream-27.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST,o=grid
- ldap://cream-34.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST-PBS,o=grid
--
SanjayPadhi - 2009/03/21