Here we make an attempt of specifying some goals for the UCSD CRAB server on glideinWMS installation, and a crude timeline for accomplishing them. We start with the overarching goals for STEP09 Analysis activity in general.

This twiki was written to facilitate communications within us. Our preference would be not to couple the goals for CRAB server with goals for STEP09. However, we are happy to change our mind, based on the comments.

Big picture official goals of STEP09 Analysis activity

  • show that we can reach the level of analysis pledged at each commissioned T2.
    • we will actually submit as many jobs as we can until we exhaust the available CPU at all sites. There are close to 16,000 pledged slots, about 50% of which are for analysis. We will have most likely at least one week, maybe two during which we have significant overlap with large scale MC production.
    • we will try to look for evidence that fair share works both between analysis and production, as well as between different analysis users.
  • monitor analysis activity via the dashboard with a set of metrics that we then stick to beyond STEP09 for some time.
    • validate the dashboard information against logfiles on our crabs server. I.e. all the load we put into the system, we will verify that the dashboard reports numbers consistent with the logfiles we get on the crabserver.
  • explore data placement as a tool to decrease inefficiencies in the global system.
    • create a table of all the commissioned T2s, the data volume they host officially, inofficially, and their pledges.
    • identify sites that have little use from people other than us, and make little use of their disk space. Move popular datasets to those sites into the "central" portion of the diskspace. Observe if this generates measurably more use for those sites.

Inofficial goals

  • Exercise the new procedures for /store/user to /store/results
  • Get a sense of the scalability, reliability, and robustness, basically the operational aspects of running the CRAB server on glideinWMS for O(10) users at large scale and across as many sites as possible/reasonable.
    • verify that we can submit jobs to all T2s and all T3s registered in siteDB. This is a much more ambitious goal than what's spelled out for STEP09, and I'm happy to do this after we are done with STEP09.
  • Get a sense of present functionality available in CRAB server 1_0_8pre2 when used with the respective client XXX (?what's the latest?)
    • Dave has a bunch of complaints last time he used (and then stopped using) CRAB server. I'd like to see him go through this, and see if we can convince him that CRAB server as of today is added value over crab client only.
  • Want to have subir's monitoring, including the pseudo-interactive part in place for STEP09.
  • would like to stage-out with a vengance to our BestMan? /gridFTP/hadoop installation. However, we know this ain't working at huge scale right now. We thus will need to wait and see where we are at by end of May.

Note: We keep track of problems encountered with crabserver here.

Attempt at a schedule

  • last week of May:
    • verify that we can successfully run crab jobs at all sites on the list.
    • have Dave go through his functionality learning exercise, including our own ntuple production.
    • after the other 2 things are done: reach a scale of 3000 jobs running, 4h each, assuming we can get that many resources globally.
    • stop, do a complete error analysis, and contemplate what next to do.
      • validate the dashboard information
      • understand how to organize logfile parsing on crab server.
  • first week of June:
    • find the scalability limit of our CRAB server
      • Note: our CRAB server uses gridftp very differently than the gLite one. We thus ought to expect very different scalability issues.
      • Personally, I'd like to see 15,000 running jobs, more is better. Am happy to keep the jobs run long such that we do not hit scheduling scalability issues inside CRAB server. glideinWMS ought to be able to schedule 200k jobs a day without too much sweat. However, I do not want to push this at this point.
    • finalize the logfile analyzing scripts for STEP09.
  • second and third week of June:
    • don't change anything. Simply run the infrastructure as reliably as possible, and gain operational experience while keeping all slots occupied worldwide.
    • do daily dashboard analysis
    • do daily crab server logfile analysis
    • assembly daily status report for T2 listserv.
    • assemble weekly reports
  • after STEP09:
    • understand the scheduling rate limits of the CRAB server at UCSD. I.e., at this point we push hard on the number of jobs per day, until we break the system. We then watch how it breaks, and figure out how to recover from failure, as well as how to monitor the system to detect the failure quickly.

-- FkW - 2009/05/23

Topic revision: r4 - 2009/06/17 - 08:08:14 - FkW
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback