Difference: FkwCCRC08OpenIssues (12 vs. 13)

Revision 132008/06/03 - Main.FkW

Line: 1 to 1

Stage-out issues at sites

Line: 100 to 100

Issues with Condor


CIEMAT Problem

We never figured out why we can't run jobs at CIEMAT. The working hypothesis is that CIEMAT has a firewall we can't get around.


1st emergency meeting

Attending: Dan, Igor, Sanjay, fkw
Line: 295 to 299
 In that case, the startd ends up failing to connect to the schedd because the time to write into the startd log is delayed so massively that every connection attempt times out.


More stuff

From email "lessons learned":

-> we failed to run jobs at CIEMAT. Suspicion is that its a problem with outgoing firewall rules.
     This is thus a problem that could bite us anywhere anytime a site changes their firewall rules.
-> latency problems with collector and schedd.
    A single site with long latencies can cause havoc on the WMS.
    We got a fix to condor to have better control of this, and have adjusted the architecture of the WMS to make ourselves less
    vulnerable. However, both improve the situation without really fixing it. Dan is starting to work on a long term fix as part of his DISUN
    contribution to CMS.
-> There are a few issues we worked around that are specific to the interface of CRAB and glideinWMS.
    -> some dashboard reporting related things that we still need to check if they are fixed in CRAB 2_2 anyway.
    -> we've done work on IO monitoring which I think is not yet conclusive.
    -> the bdii connection made assumptions that were invalid, so we coded around it.
    -> we messed with the CRAB<->glideinWMS interface to deal with the latency issues mentioned above.
-> There are some nuisance issues from an operational perspective.
     -> to add a new site to glideinWMS requires
     taking the system down. That's operationally less than ideal.
     -> there are a number of additional checks the glideins could do to make sure we don't fail because of wn issues
     -> the way glideinwms works right now it does not know about sites being in drain-off etc. As a result, we submitted jobs
          to CEs that were off.
     -> there may be changes needed to submit to nordugrid. We don't know this yet for sure because we haven't done the submissions.
-> I am convinced that there's a bug in frontend because we ended up with many more glideins pending at various times than we were supposed to
     have according to the config of frontend.
-> I am not convinced that we understand the scalability. As far as I'm concerned, the total number of running jobs we achieved was disappointing.
     The peak I have ever seen is about 3000, maybe slightly more.
-> We need to settle the question of how we deal with multi-user operations.
-> I never tried using the 'top', 'cat' etc. commands. I'd like to try that sometime!
     -> as an aside, it's not clear how this works once we use glexec. Igor should educate us.
-> CRAB server 1_0_0 is now out. We should deploy that, and understand what the operations implications are.

I've probably forgotten something. So please add to the list.

From email "more lessons learned":

-> crab -status isn't working
-> crab -kill isn't working.

The issue with the second is kind of serious because as it stands now, we can not kill a job, and have the fact that we killed the job be recorded on the dashboard.

-> We see a lot of "application unknown" on the dashboard. Initially, we thought this was because of cockpit error on our part. However, we haven't made any mistakes of this sort lately, and still see a lot of application unknown. There is thus something fundamentally wrong here that we do not fully understand. As of now, we have indications of three different types of issues: -> jobs get hung in dcap because the file doesn't exist, and the batch system eventually kills the job. This is a very small proportion. -> there are "keep alive" messages between startd and schedd. When a number of them are missed then the schedd considers the startd gone, and moves on. The startd commits suicide, I think. fkw is a bit hazy how this works in detail. Sanjay says he's used a mechanism in Cronus that replaces tcp for udp IFF a number of keep alive messages in a row fail to be received. I.e. a glidein is not abandoned without trying tcp instead of udp. It would be good if we could figure out from the logs if this is contributing to the "application unknown" in the dashboard. We know that if this happens we are guaranteed that the job is recorded as application unknown. In addition, Sanjay used the "time to live" from the job classAd to prevent schedd and startd giving up on each other within the expected runtime of the job. A feature we may want to add if we can convince ourselves that this is contributing to the application unknown. -> At present, the glidein knows about the amount of time it has left to live, but the crab job doesn't tell it the amount of time it needs. As a result, it's conceivable that some of the application unknown are caused by timed out batch slot leases. It would be good if we could quantify this from logs, and then in the future communicate a "time to live" in the job classAd. We believe that a job that times out in the batch slot will contribute to application unknown.

To give you a sense of scale, we see up to about 10% of all crab jobs on the dashboard during CCRC08 recorded as application unknown. This is about an order of magnitude larger than the number of jobs that are recorded as failing !!! Our failure stats in CCRC08 is thus completely meaningless unless we get some understanding of application unknown!

Ok. I think that's it.

From the email "still more lessons learned":

I forgot one more operational nuisance thing.

On OSG, there is no mechanism to even know when a site is "scheduling" a downtime.
As a result, we merrily submit to a site that's known to be non-functional, and merrily fail all the submitted jobs.
See the exchange below with Bockjoo.

Not a good state of affairs, unfortunately.

I think the way out of this is to:

1.) work with OSG to start using the bdii for announcing downtimes.
      We already talked about doing this. Now Burt and I just need to remember to put it into the WBS of OSG for year 3.
2.) Follow up on using this automatically in the gfactory. However, that's already on our follow-up list anyway.
  -- FkW - 30 Apr 2008
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback