This page was created to keep track of development efforts needed to follow-up on lessons learned from CCRC08 Analysis exercise. The development effort described here is ideally geared towards having a first version public for use by a limited set of beta users by mid-July.

We'll have to see how realistic this is as we make progress.

bdii related

We see a three step development path:

Step 1: Fix the use of bdii within CRAB's condor interface

There are some sites, especially on EGEE, that the existing osg_bdii.py script doesn't find. Burt knows how to fix this.

Step 2: Make the gfactory use bdii more dynamically

The single largest source of job failures during CCRC08 Analysis exercise is storage failures. Basically, an SE has trouble, and jobs fail because either input files can't be accessed, or output files can't be staged out. We have had cases where the site notices this quickly, goes into maintenance, but we continue sending jobs there because the CE is still up at the site while the SE is down.

Ideally, we would have a two-prong approach:

  • If a site announces their SE to be down via the information system, then we shouldn't be sending any jobs there that require the SE.
  • If we notice that a lot of jobs fail because the SE at a site is not working properly, we should ourselves blacklist that site automatically. During normal business hours, the crabserver ops team then files a trouble ticket against the site, and once the problem is resolved, the crabserver ops team takes the site off the blacklist by hand.
    • A site gets onto the blacklist automatically.
    • A site gets off the blacklist only by operator intervention.

I think this will require work in multiple places.

  • We need to work with CRAB team to make the error messages more reliable.
  • glideinWMS needs to intercept the return code, and set an appropriate attribute for the site.
    • Igor says that CRAB should do this by setting the appropriate job classAd attribute to make sure that this failed job gets restarted elsewhere.
    • In addition, we want the site to be automatically blacklisted. How exactly to do this needs some thought.
  • Burt and Igor need to work on glideinWMS not submitting to sites that have an SE in maintenance mode.

Step 3: Work with the grids to enable site teams to add appropriate "SE in maintenance" notifications into information system

Burt and fkw need to work with OSG and EGEE on getting this pushed through. There's an ongoing process already between EGEE and OSG that this fits into naturally. A first meeting happened in Madison in May, a second (internal OSG) discussion is to be had in Madison at the blueprint meeting end of July. A third meeting with EGEE and OSG is to be had end of August.

If we get this onto the year 3 roadmap of OSG, and into the EGEE-OSG process, then we ought to be able to get this into deployment within a year from now.

CRAB related

Comparison of our version with CRAB_2_2_1

Logic in cmscp in crab_template.sh for stage-out

There are a number of things in the way it is implemented in CRAB_2_2_1 that are suboptimal.

  • It hardcodes the port number to :8443 for lcg-cp. This is unnecessary, and will fail at sites that use DPM. Our version doesn't do this, and thus successfully stages out at DPM sites.
  • It should remove the file if the lcg-cp fails and lcg-ls says the file exists, before using srmcp. Otherwise even trying with srmcp is useless.
  • Again for srmcp it relies on the srmcp exit code. We know very well from our exercise that this is not reliable. We need to rely on srmls and file size, and if these two fail we say the copy failed not srmcp exit status
  • The logic should be such that it doesn't bother trying lcg-cp if lcg-cp doesn't exist.

If this was ok with Eric, we could provide code for cmscp that fixes all of these problems.

MonitorJobID? changes we need.

To make sure that our CRAB jobs via glideinWMS are properly reporting to the dashboard, we needed to make some changes. Those aren't part of 2_2_1 yet.

  • The SyncCE? sill uses =`glite-brokerinfo getCE` which will fail for Glideins in EGEE and NorduGrid? for sure.
  • MonitorJobID? & SyncGridJobId? does not gives you guarantee about uniqueness of the job in the dashboard mainly for the jobs which runs using the same environment_unique_identifier of the parent glideins This will screw up the monitoring part

If this was ok with Eric, we could provide code for fkw doesn't actually know where the mods need to go. Sanjay fill in details here.

Mods from Eric to 2_2_1

  • Enable CRAB to submit to multiple schedd's in round-robin fashion.
    • Eric coded up this capability as a patch to CRAB_2_2_1. It's now in Sanjay's court to try this out.
  • Add additional classAd attribute that points to x509 proxy file.
    • Eric coded this up as patch to CRAB_2_2_1. It's now in Sanjay's court to test it.

Other Issues

  • Need to discuss the IO monitoring and CPU/wall clock part, and see how to fit this into the production version of CRAB to do this reporting to the dashboard.
  • crab -status and crab -kill weren't working in our version.
    • fkw doesn't understand why. Do we know details?
  • We need to find a solution to the problem that jobs hang indefinitely when trying to access a file in dcap that doesn't exist. Not clear to fkw how to deal with this.

CRAB server related

At this point, Eric has not yet tried out the CRAB server interface to condor. fkw offers to work this into the testing schedule at UCSD. I.e. we turn glidein-2 into a CRAB server after the cluster move at UCSD, i.e. after June 16th.

We should discuss the implications of doing something like that!

glideinWMS related

  • To the best of my knowledge, a site can not be added to the running gfactory. This is operationally very annoying.
    • Wrong. All you do is to stop the gfactory, and reconfig it to start it again. The operations of running and idle jobs and glideins is completely independent of the gfactory.
  • The throttling mechanism at the frontend did not work properly. We ended up with many more pending glideins than we were supposed to at some sites during CCRC08 Analysis exercise.
    • We'll debug this next time it happens.
  • step 2 of the bdii related issues above requires glideinWMS work.
  • the glidein_startup.sh script needs to be changed
    • for EGEE sites such that it does the same thing as gliteWMS. This concerns finding the correct directory to cc into after arrival on the worker node.
    • to improve the vetting of the worker node. This requires some thought because we could simply use the return codes of CRAB, and then blacklist worker nodes and/or sites based on that.
  • changes made to work NorduGrid? submission need to be fed back into the repository for glideinWMS.
    • Sanjay needs to communicate this to Igor.
  • Igor and fkw need to discuss, and decide on how to operate the multiuser environment. This requires input from Ian and Jose as well. It's mostly a policy problem. How do we make sure that glideinWMS gets appropriate access to resources in globalCMS once it is used by many different users?
  • CIEMAT and CSCS problem.
    • Igor says that we should bounce these sites to him to debug them as part of his development system.
  • job level monitoring is needed.
    • Igor Michael Thomas, and fkw have a design meeting on this on Tuesday June 10th at UCSD.
      • We sketched out a development path where Michael puts together a shell of a system, and Haifeng & Michael, and maybe others (?) finish off the project as a whole.
    • This should integrate 'top','tail' etc. capability as well as the IO monitoring etc.
  • We changed the condor configuration on advice by Dan. These config changes need to be fed back into glideinWMS release as defaults.
    • Igor is of the impression that none of these config changes should be kept for the production system.

condor related

  • Wenever succeeded running at CIEMAT. This needs to be debugged.
  • Dan Bradley is working down a list of condor issues uncovered during this exercise. Once he has a new version, we need to thoroughly test it.

Deployment and Operations issues

  • In addition to the hardware we had, we want one more machine that functions as the central monitoring node. We discussed this with Terrence, and he prefers to use t2sentry for the ML database backend, and another independent machine for the monitoring web interface. It's not clear that we need that second machine or if we can use glidein-1. glidein-1 has already a web server for the gfactory monitoring. Can the user level monitoring be added to that?
  • One or the other schedd on glidein-1 core dumped close to a dozen times during the last 2 weeks of ccrc08. The same never happened on glidein-2. I suspect this is not true. Instead, I suspect we simply weren't running condor_preen, and Igor and fkw thus got only core dump messages from glidein-1 but not glidein-2.

-- FkW - 04 Jun 2008

Topic revision: r5 - 2008/06/11 - 00:59:31 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback