Current CMS Proposal

Approach

The goal is to achieve a single Dashboard reporting API that can be used for all monitoring purposes in CMS. We plan to use Monalisa Server both as a transport agent and also to provide details of a given job report to the dashboard.

Initial Contact

Upon submission, a unique CMSJobID? will be created and the submission related information (Status = Submitted / FailedToSubmit? / DataSetNotAvailable? ) should be sent to the Monalisa Server/Dashboard. The CMSJobID? is also expected to be propagated along with the job to the WN. This should allow us to have a well-defined ID for all monitoring purposes.

Runtime parameters at the Worker Node

At the startup, a "forked" process will be created by the parent job or the job wrapper. The forked process will invoke Monitoring API and will contact the server with job, system and application related parameters (See below). This child process is expected to live throughout the lifetime of the parent, along with a periodic update of some of the system related parameters. Once the parent process is finished, the forked process will then collect the status, application and job related summary information, which then will be sent to the server. The forked process will also have a hard timeout of a few seconds for aggregation of this summary info before it terminates itself. After this stage we expect to have all the Monitoring related parameters already at the Monalisa Server/Dashboard, unless the job is terminated by the user at the User Interface.

Job Termination at the User Interface

In case of job termination at any given stage, the CMSJobID? along with the status ("Terminated by the User") will be sent to the server.

Data Structure for a job

The following data structure will be used for job identification, system related performance and monitoring of the application layer.

struct.png

Node = CMSJobID? (Unique job ID created at the UI)

ClusterName = TaskID?

Job related key-value pairs

  • Activity = Analysis/Production/T1Reprocessing
  • Computing Element = lxb7636.cern.ch
  • DataSet = /HerwigQCDPt1400/Summer08_IDEAL_V9_v1/GEN-SIM-RECO
  • Grid = EGEE/OSG/ARG
  • GridJobID =
  • JobExitCode =
  • Job-Summary Info =
  • SiteName =
  • LFN = /store/user/spadhi/outDIRNAME
  • Status =
  • StorageElement = srm-3.t2.ucsd.edu
  • SubmissionServer = lb008.cnaf.infn.it/crabserver1.cern.ch/CondorG/GlideinWMS/GliteWMS
  • SubmissionTool = Crab/ProdAgent/Crabserver
  • TaskID =
  • User = SanjayPadhi? /spadhi [Derived from ]
  • VOMSRole=USER/Production/TopGroup/HiggsGroup

System related key-value pairs

  • CPU =
  • Disk Space =
  • MACID =
  • Memory =
  • NetworkIN =
  • NetworkOUT =
  • Load = CPUTime/WallTime
  • WorkerNode =

Application related key-value pairs

  • Number of Events =
  • CMSSW ExitCodes? =
  • Application-Summary Info =

We should be able to use any of the above mentioned key value-pairs for statistical distribution or plots within the dashboard.

-- SanjayPadhi - 2008/10/20

Topic attachments
I Attachment Action Size Date Who Comment
pngpng struct.png manage 76.3 K 2008/10/21 - 07:43 SanjayPadhi  
Topic revision: r5 - 2008/10/23 - 18:40:00 - SanjayPadhi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback