Big Picture

Understand the performance characteristics of jobs running on the global grid infrastructure by correlating information from HTCondor with information from XRootd.

Metric info to describe the dataset

Split the metrics up into just the sample that you can correlate to XRootd, as well as the total. For some of these this split makes no sense. It'll be obvious which ones those are.

Simple numbers

  • start and end date
  • total number of jobs in HTCondor
  • total number of DAGs
  • total number of users
  • total number of sites
  • total amount of walltime consumed
  • total amount of CPU time consumed
  • total amount of data accessed
    • total amount of data as well as fraction of that data that was read
  • total number of datasets
  • total number of releases
  • total number of jobs that accessed
    • AOD or AODSIM
    • MiniAOD?
    • RECO
    • other
  • same as previous but
    • walltime per data type
    • cputime per data type
    • number of users that read the data type

Averages and their variances

For each of the following calculate the average, the stddev, and the median.

  • number of jobs per DAG
  • number of jobs per user
  • number jobs per site
  • number of users per site
  • amount of data read per job
  • fraction of data read per job
  • walltime per job
  • cputime per job
  • cpu efficiency per job
  • the previous five split up by data type
  • walltime per user
  • cputime per user
  • cpu efficiency per user

Distributions

For all of the previous metrics also each a histogram of the distribution, rather than just average, stddev, and median.

Initial Questions to answer

  • Is the distribution of CPU/walltime different for jobs that read remotely via XRootd then for jobs that read local at the site that they consume storage at?
  • If there is a significant difference, how does this difference compare with the difference among local reads for different sites?
  • How does it compare with the difference in local reads for different tasks?
  • How does it compare with the difference within a task?
  • How does it compare with the difference for different tasks by different people?

Tools you need to answer the initial questions

Analyzing the HTCondor ClassAd?

Each job in HTCondor has an end-of-job classAd. We've put a file with a few such records here. Each classAd is about 250 lines or so. Not all have the same length.

For a given classAd, there are a few fields of particular relevance for this purpose:

  • RemoteUserCPU?
  • RemoteSysCPU?
  • RemoteWallClockTime?
    • CPUefficiency is defined as (RemoteUserCPU? +RemoteSysCPU)/RemoteWallClockTime
  • DAGNodeName?
    • this identifies the job number within a task. This is a unique number within the task.
  • CRAB_ReqName
    • this identifies uniquely the task. I.e. all jobs from the same task will have this set the same way.
  • MATCH_GLIDEIN_CMSSite
    • this uniquely identifies the site. All jobs that ran at the same site will have the same string for this parameter.
  • Crab_UserDN
    • this uniquely identifies the user. I.e. different people will have different strings. And all jobs from the same person will have the same string.

I think this is all you need to know about classAds to answer the initial questions. Feel free to read a few classAds carefully, and see if there are other things in them that look interesting to track.

Analyzing the detailed monitoring info from XRootd

First of all, not all Xrootd records have information about what job they refer to. So ignore all that don't.

Second, not all jobs read via XRootd. I.e. you need to go through the XRootd info, find jobs that are interesting, then find those same jobs in the HTCondor classAd records.

The Xrootd info to tag onto is app info and looks something like this:

70_https://glidein.cern.ch/70/150823:160611:aidan:crab:20150823:RunIISpring15DR74:WZ:25ns:v1_0

This is made up from the following pieces:

  • Task 150823_160611_aidan_crab_20150823_RunIISpring15DR74_WZ_25ns_v1
  • Job ID 70
  • Retry 0

The place in gitbug where this seems to be define is: https://github.com/dmwm/CRABServer/blob/master/scripts/CMSRunAnalysis.py#L97

params['MonitorJobID'] = '%d_https://glidein.cern.ch/%d/%s_%d' % (myad['CRAB_Id'], myad['CRAB_Id'], myad['CRAB_ReqName'].replace("_", ":"), myad['CRAB_Retry’])

-- FkW - 2015/08/27

Topic revision: r3 - 2016/01/23 - 02:28:30 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback