Stage-out issues at sites

IFCA and storm

IFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
  • I verified that lcg-cp and lcg-ls works.
  • need to follow up as this would imply that people assigned to IFCA as their tier-2 would not be able to stage-out from OSG.

T2_FR_CCIN2P3 has no TFC entry

This T2 shares storage with the T1. Started using T1_FR_CCIN2P3_Buffer to get TFC stuff. So far failed after that, send them error log.

Attempts at error accounting

Note: I learned today (May 18th) that the error accounting on the dashboard is incomplete. E.g., I have 18 jobs in Florida that are running for 4 days now. They are actually 9 jobs, each of which have two copies, as I am running twice over the same dataset. The 9 jobs end up being 3 groups of 3 consecutive jobs. I looked at the jdl for these jobs, and it looks like these jobs might be stuck on one file per group. Send Bockjoo email to verify.

These jobs would not show up in my accounting as they end up in the application unknown whenever a job fails to complete, and gets killed by the local batch system.

Moral of the story, I need to go through the logfile directories and sort out which jobs actually completed, and which didn't !!! A way to do this might be the following:

ls -altr WW_incl-1/glideinWMS-WW_incl-1-CMSSW_1_6_7-CSA07-1196178448/share/.condor_temp/*.stdout |grep " 0 May" |wc
This basically searches for 0 length stdout files, and counts them. By walking myself through all directories in this fashion, I can compute a complete tally, in principle. Will need to try this in practice !!!

All failed jobs in detail as of May 14th 7pm

This was based on Sanjay's submissions between May 10th 22:00 and all of mine. Need to redo this more carefully, starting for Sanjay from May 9th 22:00 to May 14th 17:00, and fkw May 9th 22:00 to May 19th 00:00. At this point, only the 18 jobs in Florida are still running for fkw. Those are the ones that get restarted over and over again as mentioned above.

Sanjay from May 14th 18:00 onwards should not be counted as part of this because he's running without the 1s sleep per event from then onwards.

Also, this stats here has the caveat above about making sure application unknown isn't hiding errors the end user sees as well as discussed above.

site success fail 50115 failed files comment
Beijing 1750 0 0    
BEgrid 443 7 7    
Budapest 1354 26 26   4 missing files
BelGrid? -UCL 1855 0
Caltech 3693 9 8   Some pools drop out of dcache. 133-4
CSCS 438 1 0   134
DESY 2117 0    
Wisconsin 2792 12 4   7x135 + 10034
Bari 1257 5 5   2 missing files.
LNL 1833 0 0    
Pisa 1194 5 3   2x-1: ML server can't be reached. This will not cause failed job error in CRAB 2_2 any more. Files did exist. Read problem temporary.
KNU 901 0 0    
MIT 3663 0 0    
Purdue 978 2   2x10016 = one bad node
Aachen 1215 0    
Estonia 785 1 0   bad file in tW dataset
Taiwan 1678 3 3   Files do exist. Seems to be a temporary DPM issue
Brunel 1631 7 7   all 7 files unavailable also to Stuart
London-IC 900 0    
SouthGrid? 1446 0    
IFCA 376 0      
JINR 1804 0    
Nebraska 1811 0      
Florida 1785 0      
Warsaw 1800 0      
UCSD 3421 4 0   4xbad file in tW dataset
UERJ 1748 64 64 80
RHUL 1730 66 65 78 one bad node
Total 42920 212 192  

Other Issues

Batch system config

  • Estonia has no fair share between analysis users. Whoever comes first gets served first. I.e. jobs are dealt with as they appear.
  • At CSCS I got 1/10 the resources that Ale got. Suspecting the same issue as for Estonia.
  • At BEGrid_ULB_VLB I got 1/7 what Ale got. Suspect the same issue as for Estonia.
  • glideins never worked at CIEMAT. We suspect a firewall issue.
  • There's a site at Helsinki that uses ARC
  • A total of at least 13 files across 3 sites were unavailable. Don't have info from UERJ yet. Seems likely to increase this count.
  • 1 bad file in tW dataset accounts for 5 failures. This file was invalidated.

Time out of reads

We presently have no timeouts in cmssw nor crab for files that hang once opened.

Issues in CRAB stage-out logic

The logic we found in cmscp makes no sense.
  • It starts with srmcp v1.
  • then srmcp v2
  • then lcg-cp

The checking after srmcp is insufficient. It relies on the return code, instead of doing srmls always after a failure to make sure the failure is really a failure. Sanjay is going to rewrite this part of crab for our last run of exercise 3. We'll start this on June 1st out of my account.

Issue with gfactory monitoring

The monitoring plots should show in the legend only those sites that are actually used during the time that the plot covers.

Issues with Condor

CIEMAT Problem

We never figured out why we can't run jobs at CIEMAT. The working hypothesis is that CIEMAT has a firewall we can't get around.

1st emergency meeting

Attending: Dan, Igor, Sanjay, fkw

We started out initially with just one schedd and one collector. This was immediately shown to not scale because of latencies due to GSI as well as other reasons at some sites, maybe network connectivity related?

We did three things in response:

  • Moved to a hirarchy of 9 collectors that receive connections from glideins, and one collector that accumulates the info from the 9.
  • Dan provided us with a new collector binary that Igor tested, and put in place (I think). This new collector has a configurable timeout to avoid the collector waiting too long, blocking on a single glidein connection. This was 5min, and can now be dialed down to as little as 1s. Don't know how we actually run the system at this point.
  • started to use all the schedd's on glidein-2.

2nd emergency meeting

Attending: Dan, Sanjay, fkw

We found several irregularities in the way condor works in our system. Let me paste the email summary from this meeting below.

Hi Igor, (cc to Sanjay and Dan)

Below is my minutes of a phone conference Dan, Sanjay and fkw just had.
Figure, it's probably useful for you to know about.

Dan and Sanjay, please feel free to correct me where I misunderstood.

I think we made good progress in this meeting. I for sure learned a lot.

Have a great vacation next week.

Thanks, frank

We did 4 things to verify that we understand the configuration:
Sanity check:

Dan confirmed with netstat that we are indeed running the glidein system such that the startd connects to the shadow.
I.e., GCB is in the way only initially so that schedd can tell startd to contact it.

ok. This is settled.

keep alive:

Dan confirmed that the "keep alive" is sent via tcp in our system, not udp.
We don't know how many "keep alive" are sent before giving up.

ok. This is settled too.

Match history:

Dan confirmed that we do not keep match history. => Need to change this in configuration of user job, i.e. it's a crab modification.
Dan has a way to do this by default via the condor config so that we do NOT need to add it to crab. He's sending info about this in email.

Sanjay is trying to change this now so we have this info in the future.

Event log:

The current version of condor we are using does not do that. We need to switch to a new version when Dan tells it's ready.
See email from Dan on this topic.

We then agreed that there are three fundamental problems, and that we do not know at this point if the three are related:

1.) NumJobStarts > 1
This leads to "application unknown" in the dashboard because actual user jobs are started more than once.

We know that there are legitimate reasons for this (e.g. when a file open request in dcap blocks and never returns
because the file to be opened exists only in pnfs but not on disk.)

We fear that there are also illegitimate reasons, i.e. condor related failures.
Dan saw eviction notices in shadow logs.

This can have one of two origins:

a.) the worker node batch slot lease times out before the job is done, and both the glidein and the user job are evicted.
b.) the glidein evicts a job, and takes a new one.

Not sure how one would distinguish between the two.
Not sure if the eviction is the only cause for NumJobStarts>1, or if there are other reasons as well.

2.)  Condor_q is stuck until it times out:

We quite routinely find that condor_q -global times out one 1 or 2 of the schedd's, thus returning info only from a subset.
Dan looked at the schedd logs, and thinks that this is because of latencies due to gsi authentication.
Dan has thus convinced himself that this is an issue that is serious, and needs to be worked on. He's going to add this to his list of

We could use quill on glidein-2 if we wanted to improve our ability to diagnose things, possibly with the cost of more problems submitting?

3.) We find an exorbitant NumShadowStarts for some jobs into the hundreds.

We agreed that there are in principle two such cases to be distinguished.

a) Those when the glidein never manages to successful run a job because the connection keeps on failing until the batch slot
lease times out, and the glidein gets evicted on the worker node. The logfile Sanjay sent around was such a case, it seems.
This is a giant waste of cpu resources, and we need to learn how to avoid this.
b) Those where the job starts, and then the connection gets lost, and reestablished many times over.

We can distinguish these two by looking at NumJobStarts=0 and NumJobStarts>0
We have not done this, and it's probably worth doing!

Dan noted that the particular logfile Sanjay had found was very sick. It seemed to indicate that there is something very very wrong,
possibly a bug in threading. Would be worthwhile to find another logfile like this.

The evening after the meeting, fkw observed a job at UCSD that clearly was run twice, and succeeded in its entirety both times. Leading to a "file exists" problem for the second incarnation of the job.

Here's the email on that one:


So I am watching a job that's been running for more than 6 hours on the UCSD cluster.

And I think I can prove that condor restarted a job that actually completed correctly !!!
I.e. it looks to me like condor screwed up royally !!!

details on the second running of the job:

jobid on schedd_jobs7@ = 14661.0
MATCH_Name = "vm2@14979@cabinet-6-6-12.local"
LastRemoteHost = "vm2@2001@cabinet-6-6-17.local"
RemoteHost = "vm2@14979@cabinet-6-6-12.local"

NumShadowStarts = 3
NumJobStarts = 2
NumShadowExceptions = 1

LastMatchTime = 1212085038
JobStartDate = 1212085378
LastVacateTime = 1212096246
EnteredCurrentStatus = 1212096247
ShadowBday = 1212096251
JobLastStartDate = 1212096245
JobCurrentStartDate = 1212096251
LastJobLeaseRenewal = 1212108161
ServerTime = 1212108268

And just as I was trying to find it on the cluster, it finished.

Interestingly, the job succeeded to run, and failed in stage-out with a supposed "file already exists" error.
So I looked in dcache for the file and its creation date and find:
[1834] fkw@uaf-1 ~$ ls -altr /pnfs/
-rw-r--r--  1 19060 cms 10239300 May 29 14:01 /pnfs/

What I conclude from this is the following history:

May 29 14:01  Job succeeded writing its file into storage element
May 29 14:24  the same job is restarted on cabinet-6-6-12
May 29 18:24  this second running of this job ends, and fails during stage-out because the file already exists in dcache.

I'm asking Abhishek to show me how I can find the hostname from where the file was written successfully the first time.
If this is 6-6-17 then the case is airtight, and condor screwed up royally !!!

Or am I missing something?

Thanks, Frank

Dan made a number of suggestions for configuration changes to deal with this. Sanjay put those in place, and rewrote the logic in cmscp in before we are rerunning exercise 3 for the last time on June 1st.

Here's the email on the changes in condor configuration:

Hi Dan and Frank.

I have changed the configuration for glidein-2 in order to incorporate the following based on
your suggestions.

MAX_SCHEDD_LOG          = 100000000
MAX_SHADOW_LOG          = 100000000

# Added params for sched issues
JobLeaseDuration = 10800
SUBMIT_EXPRS = $(SUBMIT_EXPRS) LastMatchListLength JobLeaseDuration

Hope this will help to understand better. Please let me know if I missed somethings.



The Aachen Problem

Aachen is a very large T2, growing to be even larger soon. They run all jobs out of a single shared filesystem. In that case, the startd ends up failing to connect to the schedd because the time to write into the startd log is delayed so massively that every connection attempt times out.

More stuff

From email "lessons learned":

-> we failed to run jobs at CIEMAT. Suspicion is that its a problem with outgoing firewall rules.
     This is thus a problem that could bite us anywhere anytime a site changes their firewall rules.
-> latency problems with collector and schedd.
    A single site with long latencies can cause havoc on the WMS.
    We got a fix to condor to have better control of this, and have adjusted the architecture of the WMS to make ourselves less
    vulnerable. However, both improve the situation without really fixing it. Dan is starting to work on a long term fix as part of his DISUN
    contribution to CMS.
-> There are a few issues we worked around that are specific to the interface of CRAB and glideinWMS.
    -> some dashboard reporting related things that we still need to check if they are fixed in CRAB 2_2 anyway.
    -> we've done work on IO monitoring which I think is not yet conclusive.
    -> the bdii connection made assumptions that were invalid, so we coded around it.
    -> we messed with the CRAB<->glideinWMS interface to deal with the latency issues mentioned above.
-> There are some nuisance issues from an operational perspective.
     -> to add a new site to glideinWMS requires
     taking the system down. That's operationally less than ideal.
     -> there are a number of additional checks the glideins could do to make sure we don't fail because of wn issues
     -> the way glideinwms works right now it does not know about sites being in drain-off etc. As a result, we submitted jobs
          to CEs that were off.
     -> there may be changes needed to submit to nordugrid. We don't know this yet for sure because we haven't done the submissions.
-> I am convinced that there's a bug in frontend because we ended up with many more glideins pending at various times than we were supposed to
     have according to the config of frontend.
-> I am not convinced that we understand the scalability. As far as I'm concerned, the total number of running jobs we achieved was disappointing.
     The peak I have ever seen is about 3000, maybe slightly more.
-> We need to settle the question of how we deal with multi-user operations.
-> I never tried using the 'top', 'cat' etc. commands. I'd like to try that sometime!
     -> as an aside, it's not clear how this works once we use glexec. Igor should educate us.
-> CRAB server 1_0_0 is now out. We should deploy that, and understand what the operations implications are.

I've probably forgotten something. So please add to the list.

From email "more lessons learned":

-> crab -status isn't working
-> crab -kill isn't working.

The issue with the second is kind of serious because as it stands now, we can not kill a job, and have the fact that we killed the job be recorded on the dashboard.

-> We see a lot of "application unknown" on the dashboard. Initially, we thought this was because of cockpit error on our part.
     However, we haven't made any mistakes of this sort lately, and still see a lot of application unknown. There is thus something fundamentally
     wrong here that we do not fully understand.
     As of now, we have indications of three different types of issues:
     -> jobs get hung in dcap because the file doesn't exist, and the batch system eventually kills the job. This is a very small proportion.
     -> there are "keep alive" messages between startd and schedd. When a number of them are missed then the schedd considers the startd gone,
          and moves on. The startd commits suicide, I think. fkw is a bit hazy how this works in detail.
          Sanjay says he's used a mechanism in Cronus that replaces tcp for udp IFF a number of keep alive messages in a row fail to be received.
          I.e. a glidein is not abandoned without trying tcp instead of udp. It would be good if we could figure out from the logs if this is contributing
          to the "application unknown" in the dashboard. We know that if this happens we are guaranteed that the job is recorded as application unknown.
          In addition, Sanjay used the "time to live" from the job classAd to prevent schedd and startd giving up on each other within the expected
          runtime of the job. A feature we may want to add if we can convince ourselves that this is contributing to the application unknown.
     -> At present, the glidein knows about the amount of time it has left to live, but the crab job doesn't tell it the amount of time it needs.
          As a result, it's conceivable that some of the application unknown are caused by timed out batch slot leases. It would be good if we could 
          quantify this from logs, and then in the future communicate a "time to live" in the job classAd. We believe that a job that times out in
          the batch slot will contribute to application unknown.

To give you a sense of scale, we see up to about 10% of all crab jobs on the dashboard during CCRC08 recorded as application unknown.
This is about an order of magnitude larger than the number of jobs that are recorded as failing !!! Our failure stats in CCRC08 is thus completely
meaningless unless we get some understanding of application unknown!

Ok. I think that's it.

From the email "still more lessons learned":

I forgot one more operational nuisance thing.

On OSG, there is no mechanism to even know when a site is "scheduling" a downtime.
As a result, we merrily submit to a site that's known to be non-functional, and merrily fail all the submitted jobs.
See the exchange below with Bockjoo.

Not a good state of affairs, unfortunately.

I think the way out of this is to:

1.) work with OSG to start using the bdii for announcing downtimes.
      We already talked about doing this. Now Burt and I just need to remember to put it into the WBS of OSG for year 3.
2.) Follow up on using this automatically in the gfactory. However, that's already on our follow-up list anyway.

EGEE procedures

Here's an email from Claudio Grandi about how EGEE works the arrival of a job at a worker node: We need to adopt this in glideinWMS appropriately !!!

*** Discussion title: Tier-2 Activities

Brew, CAJ (Chris wrote:
$DG_WL_SCRATCH is set by the site (if it wants to) and is honoured by
jobs submitted through the lcg-RB. gLite-WMS jobs don't honour this but
you can create a file /opt/glite/ppd/ that gets run before
handing control to the job payload.

Exactly. As mentioned in the twiki page I pointed you at 
the site manager has to cd into the top execution directory (e.g. /tmp) 
in the script ${GLITE_LOCAL_CUSTOMIZATION_DIR}/ Then the wrapper 
creates a directory (e.g. /tmp/) and uses it to start the job.

To be precise the the job wrapper executes:
- ${OSG_GRID}/ if it exists (relevant on OSG grid)
- the script that may be present in the VO 
directory ($VO_CMS_SW_DIR)
- the site customization script in ${GLITE_LOCAL_CUSTOMIZATION_DIR}/
- creates  under the current location and cd's to it.

Note that there are other customization hooks that the site manager may 
use before and after the job execution.

                     Cheers, Claudio

and a second one refering to this first one:

Please see
for what concerns the expected behavior on EGEE. Instead of work_dir="$EDG_WL_SCRATCH" you have to execute

  if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/" ]; then

                       Cheers, Claudio

-- FkW - 30 Apr 2008
Topic revision: r14 - 2008/06/04 - 16:28:39 - FkW
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback