Difference: FkwGlideinWMSDevelopment (3 vs. 4)

Revision 42008/06/05 - Main.FkW

Line: 1 to 1
 

This page was created to keep track of development efforts needed to follow-up on lessons learned from CCRC08 Analysis exercise.

Line: 41 to 41
 from now.

CRAB related

Added:
>
>

Comparison of our version with CRAB_2_2_1

Logic in cmscp in crab_template.sh for stage-out

There are a number of things in the way it is implemented in CRAB_2_2_1 that are suboptimal.

  • It hardcodes the port number to :8443 for lcg-cp. This is unnecessary, and will fail at sites that use DPM. Our version doesn't do this, and thus successfully stages out at DPM sites.
  • It should remove the file if the lcg-cp fails and lcg-ls says the file exists, before using srmcp. Otherwise even trying with srmcp is useless.
  • Again for srmcp it relies on the srmcp exit code. We know very well from our exercise that this is not reliable. We need to rely on srmls and file size, and if these two fail we say the copy failed not srmcp exit status
  • The logic should be such that it doesn't bother trying lcg-cp if lcg-cp doesn't exist.

If this was ok with Eric, we could provide code for cmscp that fixes all of these problems.

MonitorJobID? changes we need.

To make sure that our CRAB jobs via glideinWMS are properly reporting to the dashboard, we needed to make some changes. Those aren't part of 2_2_1 yet.

  • The SyncCE? sill uses =`glite-brokerinfo getCE` which will fail for Glideins in EGEE and NorduGrid? for sure.
  • MonitorJobID? & SyncGridJobId? does not gives you guarantee about uniqueness of the job in the dashboard mainly for the jobs which runs using the same environment_unique_identifier of the parent glideins This will screw up the monitoring part

If this was ok with Eric, we could provide code for fkw doesn't actually know where the mods need to go. Sanjay fill in details here.

Mods from Eric to 2_2_1

 
  • Enable CRAB to submit to multiple schedd's in round-robin fashion.
    • Eric coded up this capability as a patch to CRAB_2_2_1. It's now in Sanjay's court to try this out.
Deleted:
<
<
This is nontrivial because it implies Sanjay to move forwards his CRAB version, which he was going to do anyway, see below.
 
  • Add additional classAd attribute that points to x509 proxy file.
    • Eric coded this up as patch to CRAB_2_2_1. It's now in Sanjay's court to test it.
Changed:
<
<
  • cmscp as implemented in our version of crab_template.sh was error prone.
    • We believe this is fixed in CRAB_2_2_1. Sanjay needs to try it out.
  • Sanjay made several small changes, some of which are likely to be fixed in CRAB_2_2_1. Need to check out CRAB_2_2_1, and make a list of which of these are not yet taken care of in CRAB_2_2_1, then discuss them with Eric.
  • Once we switched to CRAB_2_2_1, we need to redo an at scale test to check if the error codes that CRAB assigns now make sense. There's been a number of improvements that supposedly went into CRAB_2_2_1.
>
>

Other Issues

 
  • Need to discuss the IO monitoring and CPU/wall clock part, and see how to fit this into the production version of CRAB to do this reporting to the dashboard.
  • crab -status and crab -kill weren't working in our version.
Changed:
<
<
    • Need to check this and follow up once we've moved forward to CRAB_2_2_1.
>
>
    • fkw doesn't understand why. Do we know details?
 
  • We need to find a solution to the problem that jobs hang indefinitely when trying to access a file in dcap that doesn't exist.
Changed:
<
<
Not clear to me how to deal with this.
>
>
Not clear to fkw how to deal with this.
 

CRAB server related

At this point, Eric has not yet tried out the CRAB server interface to condor.
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback