Difference: FkwGlideinWMSDevelopment (4 vs. 5)

Revision 52008/06/11 - Main.FkW

Line: 1 to 1
 

This page was created to keep track of development efforts needed to follow-up on lessons learned from CCRC08 Analysis exercise.

Line: 30 to 30
 I think this will require work in multiple places.
  • We need to work with CRAB team to make the error messages more reliable.
  • glideinWMS needs to intercept the return code, and set an appropriate attribute for the site.
Added:
>
>
    • Igor says that CRAB should do this by setting the appropriate job classAd attribute to make sure that this failed job gets restarted elsewhere.
    • In addition, we want the site to be automatically blacklisted. How exactly to do this needs some thought.
 
  • Burt and Igor need to work on glideinWMS not submitting to sites that have an SE in maintenance mode.

Step 3: Work with the grids to enable site teams to add appropriate "SE in maintenance" notifications into information system

Line: 91 to 94
 

glideinWMS related

  • To the best of my knowledge, a site can not be added to the running gfactory. This is operationally very annoying.
Added:
>
>
    • Wrong. All you do is to stop the gfactory, and reconfig it to start it again. The operations of running and idle jobs and glideins is completely independent of the gfactory.
 
  • The throttling mechanism at the frontend did not work properly. We ended up with many more pending glideins than we were supposed to at some sites during CCRC08 Analysis exercise.
Added:
>
>
    • We'll debug this next time it happens.
 
  • step 2 of the bdii related issues above requires glideinWMS work.
  • the glidein_startup.sh script needs to be changed
    • for EGEE sites such that it does the same thing as gliteWMS. This concerns
Line: 104 to 110
 
  • Igor and fkw need to discuss, and decide on how to operate the multiuser environment. This requires input from Ian and Jose as well. It's mostly a policy problem. How do we make sure that glideinWMS gets appropriate access to resources in globalCMS once it is used by many different users?
Added:
>
>
  • CIEMAT and CSCS problem.
    • Igor says that we should bounce these sites to him to debug them as part of his development system.
 
  • job level monitoring is needed.
Changed:
<
<
    • Igor Michael Thomas, and fkw have a design meeting on this on Friday June 13th at Caltech.
>
>
    • Igor Michael Thomas, and fkw have a design meeting on this on Tuesday June 10th at UCSD.
      • We sketched out a development path where Michael puts together a shell of a system, and Haifeng & Michael, and maybe others (?) finish off the project as a whole.
 
    • This should integrate 'top','tail' etc. capability as well as the IO monitoring etc.
  • We changed the condor configuration on advice by Dan. These config changes need to be fed back into glideinWMS release as defaults.
Added:
>
>
    • Igor is of the impression that none of these config changes should be kept for the production system.
 

condor related

  • Wenever succeeded running at CIEMAT. This needs to be debugged.
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback