This page was created to keep track of development efforts needed to follow-up on lessons learned from CCRC08 Analysis exercise. The development effort described here is ideally geared towards having a first version public for use by a limited set of beta users by mid-July.

We'll have to see how realistic this is as we make progress.

bdii related

We see a three step development path:

Step 1: Fix the use of bdii within CRAB's condor interface

There are some sites, especially on EGEE, that the existing osg_bdii.py script doesn't find. Burt knows how to fix this.

Step 2: Make the gfactory use bdii more dynamically

The single largest source of job failures during CCRC08 Analysis exercise is storage failures. Basically, an SE has trouble, and jobs fail because either input files can't be accessed, or output files can't be staged out. We have had cases where the site notices this quickly, goes into maintenance, but we continue sending jobs there because the CE is still up at the site while the SE is down.

Ideally, we would have a two-prong approach:

  • If a site announces their SE to be down via the information system, then we shouldn't be sending any jobs there that require the SE.
  • If we notice that a lot of jobs fail because the SE at a site is not working properly, we should ourselves blacklist that site automatically. During normal business hours, the crabserver ops team then files a trouble ticket against the site, and once the problem is resolved, the crabserver ops team takes the site off the blacklist by hand.
    • A site gets onto the blacklist automatically.
    • A site gets off the blacklist only by operator intervention.

I think this will require work in multiple places.

  • We need to work with CRAB team to make the error messages more reliable.
  • glideinWMS needs to intercept the return code, and set an appropriate attribute for the site.
  • Burt and Igor need to work on glideinWMS not submitting to sites that have an SE in maintenance mode.

Step 3: Work with the grids to enable site teams to add appropriate "SE in maintenance" notifications into information system

Burt and fkw need to work with OSG and EGEE on getting this pushed through. There's an ongoing process already between EGEE and OSG that this fits into naturally. A first meeting happened in Madison in May, a second (internal OSG) discussion is to be had in Madison at the blueprint meeting end of July. A third meeting with EGEE and OSG is to be had end of August.

If we get this onto the year 3 roadmap of OSG, and into the EGEE-OSG process, then we ought to be able to get this into deployment within a year from now.

CRAB related

  • Enable CRAB to submit to multiple schedd's in round-robin fashion.
    • Eric coded up this capability as a patch to CRAB_2_2_1. It's now in Sanjay's court to try this out. This is nontrivial because it implies Sanjay to move forwards his CRAB version, which he was going to do anyway, see below.
  • Add additional classAd attribute that points to x509 proxy file.
    • Eric coded this up as patch to CRAB_2_2_1. It's now in Sanjay's court to test it.
  • cmscp as implemented in our version of crab_template.sh was error prone.
    • We believe this is fixed in CRAB_2_2_1. Sanjay needs to try it out.
  • Sanjay made several small changes, some of which are likely to be fixed in CRAB_2_2_1. Need to check out CRAB_2_2_1, and make a list of which of these are not yet taken care of in CRAB_2_2_1, then discuss them with Eric.
  • Once we switched to CRAB_2_2_1, we need to redo an at scale test to check if the error codes that CRAB assigns now make sense. There's been a number of improvements that supposedly went into CRAB_2_2_1.
  • Need to discuss the IO monitoring and CPU/wall clock part, and see how to fit this into the production version of CRAB to do this reporting to the dashboard.
  • crab -status and crab -kill weren't working in our version.
    • Need to check this and follow up once we've moved forward to CRAB_2_2_1.
  • We need to find a solution to the problem that jobs hang indefinitely when trying to access a file in dcap that doesn't exist. Not clear to me how to deal with this.

CRAB server related

At this point, Eric has not yet tried out the CRAB server interface to condor. fkw offers to work this into the testing schedule at UCSD. I.e. we turn glidein-2 into a CRAB server after the cluster move at UCSD, i.e. after June 16th.

We should discuss the implications of doing something like that!

glideinWMS related

  • To the best of my knowledge, a site can not be added to the running gfactory. This is operationally very annoying.
  • The throttling mechanism at the frontend did not work properly. We ended up with many more pending glideins than we were supposed to at some sites during CCRC08 Analysis exercise.
  • step 2 of the bdii related issues above requires glideinWMS work.
  • the glidein_startup.sh script needs to be changed
    • for EGEE sites such that it does the same thing as gliteWMS. This concerns finding the correct directory to cc into after arrival on the worker node.
    • to improve the vetting of the worker node. This requires some thought because we could simply use the return codes of CRAB, and then blacklist worker nodes and/or sites based on that.
  • changes made to work NorduGrid? submission need to be fed back into the repository for glideinWMS.
    • Sanjay needs to communicate this to Igor.
  • Igor and fkw need to discuss, and decide on how to operate the multiuser environment. This requires input from Ian and Jose as well. It's mostly a policy problem. How do we make sure that glideinWMS gets appropriate access to resources in globalCMS once it is used by many different users?
  • job level monitoring is needed.
    • Igor Michael Thomas, and fkw have a design meeting on this on Friday June 13th at Caltech.
    • This should integrate 'top','tail' etc. capability as well as the IO monitoring etc.
  • We changed the condor configuration on advice by Dan. These config changes need to be fed back into glideinWMS release as defaults.

condor related

  • Wenever succeeded running at CIEMAT. This needs to be debugged.
  • Dan Bradley is working down a list of condor issues uncovered during this exercise. Once he has a new version, we need to thoroughly test it.

Deployment and Operations issues

  • In addition to the hardware we had, we want one more machine that functions as the central monitoring node. We discussed this with Terrence, and he prefers to use t2sentry for the ML database backend, and another independent machine for the monitoring web interface. It's not clear that we need that second machine or if we can use glidein-1. glidein-1 has already a web server for the gfactory monitoring. Can the user level monitoring be added to that?
  • One or the other schedd on glidein-1 core dumped close to a dozen times during the last 2 weeks of ccrc08. The same never happened on glidein-2. I suspect this is not true. Instead, I suspect we simply weren't running condor_preen, and Igor and fkw thus got only core dump messages from glidein-1 but not glidein-2.

-- FkW - 04 Jun 2008

Edit | Attach | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2008/06/05 - 00:11:54 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback