CMS glidein requirements for the CREAM CE Services

Description

The Computing Resource Execution And Management (CREAM) is a service for job management operation at the Computing Element (CE) level. For glideinWMS usage, we plan to submit the pilot jobs (glideins) via the Condor-G method. The advantage includes inheriting the already proven scalability, delegation of the proxies, checking for the availability of remote CE services etc, using Condor-G, at the same time expanding its usage for new states in the CE. The procedure from defining, testing to the final operations can be outlined as follows.

People involved:

  • Jaime Frey (Condor/WISC)
  • Sanjay Padhi (UCSD)
  • Frank Wuerthwein (UCSD)
  • Igor Sfiligoi (FNAL)
  • Burt Holzman (FNAL)

Overall Planning

Phase 1 (01 Nov - 30 Nov 2008)

Initial goal for this phase is to be able to test the direct job submission to the CREAM CE via Condor-G. This includes testing the following operations:

  • CREAM Job Submission - DONE
  • Job Start (CREAM_JOB_START)- DONE
  • Job Cancel - DONE
  • CREAM Job Register (CREAM_JOB_REGISTER) - DONE
  • Job Suspend- NOT WORKING (For Externally managed state: creamState CANCELLED)
  • Job Resume- NOT WORKING (For Externally managed state: creamState CANCELLED)
  • Job Purge (CREAM_JOB_PURGE) - WORKING (breaks approx - 4-7% of the time)
  • Job List - DONE
  • Job Info states - Currently available as part of the Gridmanager logs (TOBEDONE for Phase 2)
    • (ABORTED, CANCELLED, DONE-FAILED, DONE-OK, HELD, IDLE, PENDING, REALLY-RUNNING, REGISTERED, RUNNING, UNKNOWN)
  • Job Status (CREAM_JOB_STATUS) - DONE
  • Auto Proxy Delegation method (ASYNC_MODE, CACHE_PROXY)- DONE
  • Delegated Proxy Renewal after the expiry - DONE
  • Service Status (CREAM_PING) - DONE
  • ServiceInfo? for enabled, disabled and status of the job submission service - DONE
  • Synchronisation of the CREAMID with GHAP - DONE

Example of output ENV at the WN using the CREAM CE can be found here. A failure rate of about ~4-7% observed out of 10K short jobs using the "modified" Condor-G.

Disclaimer: The current activity will be part of the next Condor software version >= 7.3.0. Not all aspects related to the submission to the CREAM CE using Condor-G is covered/studied here, although this phase sucessfully evaluates the basic submission procedure/mechanism to the next generation CE.

Test setup at UCSD (glidein-c.t2.ucsd.edu)

debug: starting to put gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: connecting to gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: response from gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh:
220 cream-12.pd.infn.it GridFTP Server 2.3 (gcc32dbg, 1144436882-63) ready.

debug: authenticating with gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh
debug: response from gsiftp://cream-12.pd.infn.it:2811/tmp/test.sh:
230 User cms032 logged in .... 

Phase 2 (01 Feb - 28 Feb 2009)

The objective of Phase 2 is also to ensure the functionalities of Phase 1 with respect to any change in the software version CE.

  • Test the basic job Submission procedure
  • Job leases
  • Test submission of multiple jobs
  • Test failure recovery
  • Input/Output Sandbox
  • Prototype - Integration with the glideinWMS
  • Small scale user jobs with glideinWMS + CRAB/Crabserver
  • Results during this phase
  • Provide first-hand experience of the CE using Condor-G with glideins for CHEP09 Workshop

Phase 3 (15 April - Summer 2009)

Based on how EGEE moves from Pre-Production sites to Production/Certification of the software and the status of ICE-based WMS, this phase can have a wide range of goals.

  • Test the production functionalities of the CE and the information system, BDII
  • Install a prototype CREAM CE at UCSD, study the scalability at the "ghost" cluster, up to 10K Nodes.
    • Similar to the study done at UCSD
  • Study the Condor-G and glideinWMS interface up to the production level
  • Add production modules to the glidein interface in order to be able to submit to both globus and non-globus based CEs
  • Production level tests with glideinWMS and Crabserver
  • Provide a "frozen" release version for glideins involving both kind of CEs.

Technical documentation

CREAM Documentation

CREAM CEs

  • Cream CEs:
-CNAF: cert-ce-03.cnaf.infn.it + 4 virtual WNs using pbs, 7 queues (alice,atlas,cms,lhcb,ops,dteam) pps
-FZK: pps-cream-fzk.gridka.d
  • ICE WMS:
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de
LSF Cream INFN-PADOVA
  • cream-10 SL4, batch master LSF
  • cream-21 SL4, CE with LSF
  • cream-22 SL4, CE with LSF
  • cream-23 SL4, CE with LSF
  • cream-24 SL4, CE with LSF
  • cream-25 SL4, CE with LSF
  • cream-26 SL4, CE with LSF
  • cream-27 SL4, CE with LSF - site BDII

PBS Cream INFN-PADOVA

  • cream-28 SL4, CE with pbs - batch master
  • cream-29 SL4, CE with pbs
  • cream-30 SL4, CE with pbs
  • cream-31 SL4, CE with pbs
  • cream-32 SL4, CE with pbs
  • cream-33 SL4, CE with pbs
  • cream-34 SL4, CE with pbs - site BDII

CREAM UI & WMS (if needed)

  • cert-ui-01.cnaf.infn.it (UI) cert-rb-01.cnaf.infn.it (WMS + ICE)

Site BDII

  • ldap://cream-27.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST,o=grid
  • ldap://cream-34.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST-PBS,o=grid

-- SanjayPadhi - 2009/03/21

Topic attachments
I Attachment Action Size Date Who Comment
elseout cream.26.633.out manage 9.6 K 2008/12/03 - 14:22 SanjayPadhi Example of output ENV at the WN using CREAM CE
pdfpdf cream_cms.pdf manage 506.6 K 2009/03/21 - 06:02 SanjayPadhi cream status
Topic revision: r10 - 2009/03/21 - 06:04:09 - SanjayPadhi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback