Current CMS Proposal
Approach
The goal is to achieve a single Dashboard reporting API that can be used for all monitoring purposes in CMS. We plan to use Monalisa Server both as a transport agent and also to provide details of a given job report to the dashboard.
Initial Contact
Upon submission, a unique
CMSJobID? will be created and the submission related information (Status = Submitted /
FailedToSubmit? /
DataSetNotAvailable? ) should be sent to the Monalisa Server/Dashboard. The
CMSJobID? is also expected to be propagated along with the job to the WN. This should allow us to have a well-defined ID for all monitoring purposes.
Runtime parameters at the Worker Node
At the startup, a "forked" process will be created by the parent job or the job wrapper. The forked process will invoke Monitoring API and will contact the server with job, system and application related parameters (See below). This child process is expected to live throughout the lifetime of the parent, along with a periodic update of some of the system related parameters. Once the parent process is finished, the forked process will then collect the status, application and job related summary information, which then will be sent to the server. The forked process will also have a hard timeout of a few seconds for aggregation of this summary info before it terminates itself. After this stage we expect to have all the Monitoring related parameters already at the Monalisa Server/Dashboard, unless the job is terminated by the user at the User Interface.
Job Termination at the User Interface
In case of job termination at any given stage, the
CMSJobID? along with the status ("Terminated by the User") will be sent to the server.
Data Structure for a job
The following data structure will be used for job identification, system related performance and monitoring of the application layer.
Node = CMSJobID? (Unique job ID created at the UI)
ClusterName = TaskID?
Job related key-value pairs
- Activity = Analysis/Production/T1Reprocessing
- Computing Element = lxb7636.cern.ch
- DataSet = /HerwigQCDPt1400/Summer08_IDEAL_V9_v1/GEN-SIM-RECO
- Grid = EGEE/OSG/ARG
- GridJobID =
- JobExitCode =
- Job-Summary Info =
- SiteName =
- LFN = /store/user/spadhi/outDIRNAME
- Status =
- StorageElement = srm-3.t2.ucsd.edu
- SubmissionServer = lb008.cnaf.infn.it/crabserver1.cern.ch/CondorG/GlideinWMS/GliteWMS
- SubmissionTool = Crab/ProdAgent/Crabserver
- TaskID =
- User = SanjayPadhi? /spadhi [Derived from ]
- VOMSRole=USER/Production/TopGroup/HiggsGroup
System related key-value pairs
- CPU =
- Disk Space =
- MACID =
- Memory =
- NetworkIN =
- NetworkOUT =
- Load = CPUTime/WallTime
- WorkerNode =
Application related key-value pairs
- Number of Events =
- CMSSW ExitCodes? =
- Application-Summary Info =
We should be able to use any of the above mentioned key value-pairs for statistical distribution or plots within the dashboard.
--
SanjayPadhi - 2008/10/20
Topic revision: r5 - 2008/10/23 - 18:40:00 -
SanjayPadhi