This page was created to keep track of development efforts needed to follow-up on lessons learned from CCRC08 Analysis exercise.
The development effort described here is ideally geared towards having a first version public for use by a limited set of beta users
by mid-July.
We'll have to see how realistic this is as we make progress.
bdii related
We see a three step development path:
Step 1: Fix the use of bdii within CRAB's condor interface
There are some sites, especially on EGEE, that the existing osg_bdii.py script doesn't find.
Burt knows how to fix this.
Step 2: Make the gfactory use bdii more dynamically
The single largest source of job failures during CCRC08 Analysis exercise is storage failures.
Basically, an SE has trouble, and jobs fail because either input files can't be accessed, or output files can't be staged out.
We have had cases where the site notices this quickly, goes into maintenance, but we continue sending jobs there
because the CE is still up at the site while the SE is down.
Ideally, we would have a two-prong approach:
- If a site announces their SE to be down via the information system, then we shouldn't be sending any jobs there that require the SE.
- If we notice that a lot of jobs fail because the SE at a site is not working properly, we should ourselves blacklist that site automatically. During normal business hours, the crabserver ops team then files a trouble ticket against the site, and once the problem is resolved, the crabserver ops team takes the site off the blacklist by hand.
- A site gets onto the blacklist automatically.
- A site gets off the blacklist only by operator intervention.
I think this will require work in multiple places.
- We need to work with CRAB team to make the error messages more reliable.
- glideinWMS needs to intercept the return code, and set an appropriate attribute for the site.
- Burt and Igor need to work on glideinWMS not submitting to sites that have an SE in maintenance mode.
Step 3: Work with the grids to enable site teams to add appropriate "SE in maintenance" notifications into information system
Burt and fkw need to work with OSG and EGEE on getting this pushed through. There's an ongoing process already between
EGEE and OSG that this fits into naturally. A first meeting happened in Madison in May, a second (internal OSG) discussion is to be had in
Madison at the blueprint meeting end of July. A third meeting with EGEE and OSG is to be had end of August.
If we get this onto the year 3 roadmap of OSG, and into the EGEE-OSG process, then we ought to be able to get this into deployment within a year
from now.
CRAB related
Comparison of our version with CRAB_2_2_1
Logic in cmscp in crab_template.sh for stage-out
There are a number of things in the way it is implemented in CRAB_2_2_1 that are suboptimal.
- It hardcodes the port number to :8443 for lcg-cp. This is unnecessary, and will fail at sites that use DPM. Our version doesn't do this, and thus successfully stages out at DPM sites.
- It should remove the file if the lcg-cp fails and lcg-ls says the file exists, before using srmcp. Otherwise even trying with srmcp is useless.
- Again for srmcp it relies on the srmcp exit code. We know very well from our exercise that this is not reliable. We need to rely on srmls and file size, and if these two fail we say the copy failed not srmcp exit status
- The logic should be such that it doesn't bother trying lcg-cp if lcg-cp doesn't exist.
If this was ok with Eric, we could provide code for cmscp that fixes all of these problems.
MonitorJobID? changes we need.
To make sure that our CRAB jobs via glideinWMS are properly reporting to the dashboard, we needed to make
some changes. Those aren't part of 2_2_1 yet.
- The SyncCE? sill uses =`glite-brokerinfo getCE` which will fail for Glideins in EGEE and NorduGrid? for sure.
- MonitorJobID? & SyncGridJobId? does not gives you guarantee about uniqueness of the job in the dashboard mainly for the jobs which runs using the same environment_unique_identifier of the parent glideins This will screw up the monitoring part
If this was ok with Eric, we could provide code for
fkw doesn't actually know where the mods need to go. Sanjay fill in details here.
Mods from Eric to 2_2_1
- Enable CRAB to submit to multiple schedd's in round-robin fashion.
- Eric coded up this capability as a patch to CRAB_2_2_1. It's now in Sanjay's court to try this out.
- Add additional classAd attribute that points to x509 proxy file.
- Eric coded this up as patch to CRAB_2_2_1. It's now in Sanjay's court to test it.
Other Issues
- Need to discuss the IO monitoring and CPU/wall clock part, and see how to fit this into the production version of CRAB to do this reporting to the dashboard.
- crab -status and crab -kill weren't working in our version.
- fkw doesn't understand why. Do we know details?
- We need to find a solution to the problem that jobs hang indefinitely when trying to access a file in dcap that doesn't exist. Not clear to fkw how to deal with this.
CRAB server related
At this point, Eric has not yet tried out the CRAB server interface to condor.
fkw offers to work this into the testing schedule at UCSD. I.e. we turn glidein-2 into a CRAB server after
the cluster move at UCSD, i.e. after June 16th.
We should discuss the implications of doing something like that!
glideinWMS related
- To the best of my knowledge, a site can not be added to the running gfactory. This is operationally very annoying.
- The throttling mechanism at the frontend did not work properly. We ended up with many more pending glideins than we were supposed to at some sites during CCRC08 Analysis exercise.
- step 2 of the bdii related issues above requires glideinWMS work.
- the glidein_startup.sh script needs to be changed
- for EGEE sites such that it does the same thing as gliteWMS. This concerns finding the correct directory to cc into after arrival on the worker node.
- to improve the vetting of the worker node. This requires some thought because we could simply use the return codes of CRAB, and then blacklist worker nodes and/or sites based on that.
- changes made to work NorduGrid? submission need to be fed back into the repository for glideinWMS.
- Sanjay needs to communicate this to Igor.
- Igor and fkw need to discuss, and decide on how to operate the multiuser environment. This requires input from Ian and Jose as well. It's mostly a policy problem. How do we make sure that glideinWMS gets appropriate access to resources in globalCMS once it is used by many different users?
- job level monitoring is needed.
- Igor Michael Thomas, and fkw have a design meeting on this on Friday June 13th at Caltech.
- This should integrate 'top','tail' etc. capability as well as the IO monitoring etc.
- We changed the condor configuration on advice by Dan. These config changes need to be fed back into glideinWMS release as defaults.
condor related
- Wenever succeeded running at CIEMAT. This needs to be debugged.
- Dan Bradley is working down a list of condor issues uncovered during this exercise. Once he has a new version, we need to thoroughly test it.
Deployment and Operations issues
- In addition to the hardware we had, we want one more machine that functions as the central monitoring node. We discussed this with Terrence, and he prefers to use t2sentry for the ML database backend, and another independent machine for the monitoring web interface. It's not clear that we need that second machine or if we can use glidein-1. glidein-1 has already a web server for the gfactory monitoring. Can the user level monitoring be added to that?
- One or the other schedd on glidein-1 core dumped close to a dozen times during the last 2 weeks of ccrc08. The same never happened on glidein-2. I suspect this is not true. Instead, I suspect we simply weren't running condor_preen, and Igor and fkw thus got only core dump messages from glidein-1 but not glidein-2.
--
FkW - 04 Jun 2008