Difference: FkwCCRC08OpenIssues (10 vs. 11)

Revision 112008/05/31 - Main.FkW

Line: 1 to 1
Changed:
<
<

Stage-out issues

>
>

Stage-out issues at sites

 

IFCA and storm

IFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
Line: 77 to 79
 
  • glideins never worked at CIEMAT. We suspect a firewall issue.
  • There's a site at Helsinki that uses ARC
  • A total of at least 13 files across 3 sites were unavailable. Don't have info from UERJ yet. Seems likely to increase this count.
Changed:
<
<
  • 1 bad file in tW dataset accounts for 5 failures.
>
>
  • 1 bad file in tW dataset accounts for 5 failures. This file was invalidated.
 

Time out of reads

We presently have no timeouts in cmssw nor crab for files that hang once opened.
Line: 90 to 91
 
  • then lcg-cp

The checking after srmcp is insufficient. It relies on the return code, instead of doing srmls always after a failure to make sure the failure is really a failure.

Added:
>
>
Sanjay is going to rewrite this part of crab for our last run of exercise 3. We'll start this on June 1st out of my account.
 

Issue with gfactory monitoring

The monitoring plots should show in the legend only those sites that are actually used during the time that the plot covers.

Added:
>
>

Issues with Condor

1st emergency meeting

Attending: Dan, Igor, Sanjay, fkw

We started out initially with just one schedd and one collector. This was immediately shown to not scale because of latencies due to GSI as well as other reasons at some sites, maybe network connectivity related?

We did three things in response:

  • Moved to a hirarchy of 9 collectors that receive connections from glideins, and one collector that accumulates the info from the 9.
  • Dan provided us with a new collector binary that Igor tested, and put in place (I think). This new collector has a configurable timeout to avoid the collector waiting too long, blocking on a single glidein connection. This was 5min, and can now be dialed down to as little as 1s. Don't know how we actually run the system at this point.
  • started to use all the schedd's on glidein-2.

2nd emergency meeting

Attending: Dan, Sanjay, fkw

We found several irregularities in the way condor works in our system. Let me paste the email summary from this meeting below.

Hi Igor, (cc to Sanjay and Dan)

Below is my minutes of a phone conference Dan, Sanjay and fkw just had.
Figure, it's probably useful for you to know about.

Dan and Sanjay, please feel free to correct me where I misunderstood.

I think we made good progress in this meeting. I for sure learned a lot.

Have a great vacation next week.

Thanks, frank

We did 4 things to verify that we understand the configuration:
---------------------------------------------------------------------------------------
Sanity check:

Dan confirmed with netstat that we are indeed running the glidein system such that the startd connects to the shadow.
I.e., GCB is in the way only initially so that schedd can tell startd to contact it.

ok. This is settled.

keep alive:

Dan confirmed that the "keep alive" is sent via tcp in our system, not udp.
We don't know how many "keep alive" are sent before giving up.

ok. This is settled too.

Match history:

Dan confirmed that we do not keep match history. => Need to change this in configuration of user job, i.e. it's a crab modification.
Dan has a way to do this by default via the condor config so that we do NOT need to add it to crab. He's sending info about this in email.

Sanjay is trying to change this now so we have this info in the future.

Event log:

The current version of condor we are using does not do that. We need to switch to a new version when Dan tells it's ready.
See email from Dan on this topic.

We then agreed that there are three fundamental problems, and that we do not know at this point if the three are related:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

1.) NumJobStarts > 1
This leads to "application unknown" in the dashboard because actual user jobs are started more than once.

We know that there are legitimate reasons for this (e.g. when a file open request in dcap blocks and never returns
because the file to be opened exists only in pnfs but not on disk.)

We fear that there are also illegitimate reasons, i.e. condor related failures.
Dan saw eviction notices in shadow logs.

This can have one of two origins:

a.) the worker node batch slot lease times out before the job is done, and both the glidein and the user job are evicted.
b.) the glidein evicts a job, and takes a new one.

Not sure how one would distinguish between the two.
Not sure if the eviction is the only cause for NumJobStarts>1, or if there are other reasons as well.

2.)  Condor_q is stuck until it times out:

We quite routinely find that condor_q -global times out one 1 or 2 of the schedd's, thus returning info only from a subset.
Dan looked at the schedd logs, and thinks that this is because of latencies due to gsi authentication.
Dan has thus convinced himself that this is an issue that is serious, and needs to be worked on. He's going to add this to his list of
priorities.

We could use quill on glidein-2 if we wanted to improve our ability to diagnose things, possibly with the cost of more problems submitting?

 
Added:
>
>
3.) We find an exorbitant NumShadowStarts? for some jobs into the hundreds.

We agreed that there are in principle two such cases to be distinguished.

a) Those when the glidein never manages to successful run a job because the connection keeps on failing until the batch slot lease times out, and the glidein gets evicted on the worker node. The logfile Sanjay sent around was such a case, it seems. This is a giant waste of cpu resources, and we need to learn how to avoid this. b) Those where the job starts, and then the connection gets lost, and reestablished many times over.

We can distinguish these two by looking at NumJobStarts? =0 and NumJobStarts? >0 We have not done this, and it's probably worth doing!

Dan noted that the particular logfile Sanjay had found was very sick. It seemed to indicate that there is something very very wrong, possibly a bug in threading. Would be worthwhile to find another logfile like this.

The evening after the meeting, fkw observed a job at UCSD that clearly was run twice, and succeeded in its entirety both times. Leading to a "file exists" problem for the second incarnation of the job.

Here's the email on that one:

Hi,

So I am watching a job that's been running for more than 6 hours on the UCSD cluster.

And I think I can prove that condor restarted a job that actually completed correctly !!!
I.e. it looks to me like condor screwed up royally !!!

details on the second running of the job:

jobid on schedd_jobs7@ = 14661.0
MATCH_Name = "vm2@14979@cabinet-6-6-12.local"
LastRemoteHost = "vm2@2001@cabinet-6-6-17.local"
RemoteHost = "vm2@14979@cabinet-6-6-12.local"

NumShadowStarts = 3
NumJobStarts = 2
NumShadowExceptions = 1

LastMatchTime = 1212085038
JobStartDate = 1212085378
LastVacateTime = 1212096246
EnteredCurrentStatus = 1212096247
ShadowBday = 1212096251
JobLastStartDate = 1212096245
JobCurrentStartDate = 1212096251
LastJobLeaseRenewal = 1212108161
ServerTime = 1212108268

And just as I was trying to find it on the cluster, it finished.

Interestingly, the job succeeded to run, and failed in stage-out with a supposed "file already exists" error.
So I looked in dcache for the file and its creation date and find:
[1834] fkw@uaf-1 ~$ ls -altr /pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/CCRC08/t2data2.t2.ucsd.edu/WW_incl_P3-48471/TestFile_31.txt
-rw-r--r--  1 19060 cms 10239300 May 29 14:01 /pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/CCRC08/t2data2.t2.ucsd.edu/WW_incl_P3-48471/TestFile_31.txt

What I conclude from this is the following history:

May 29 14:01  Job succeeded writing its file into storage element
May 29 14:24  the same job is restarted on cabinet-6-6-12
May 29 18:24  this second running of this job ends, and fails during stage-out because the file already exists in dcache.

I'm asking Abhishek to show me how I can find the hostname from where the file was written successfully the first time.
If this is 6-6-17 then the case is airtight, and condor screwed up royally !!!

Or am I missing something?

Thanks, Frank

Dan made a number of suggestions for configuration changes to deal with this. Sanjay put those in place, and rewrote the logic in cmscp in crab_template.sh before we are rerunning exercise 3 for the last time on June 1st.

Here's the email on the changes in condor configuration:

Hi Dan and Frank.

I have changed the configuration for glidein-2 in order to incorporate the following based on
your suggestions.

MAX_SCHEDD_LOG          = 100000000
SCHEDD_DEBUG            = D_FULLDEBUG
MAX_SHADOW_LOG          = 100000000
SHADOW_DEBUG            = D_FULLDEBUG

# Added params for sched issues
LastMatchListLength=10
JobLeaseDuration = 10800
SUBMIT_EXPRS = $(SUBMIT_EXPRS) LastMatchListLength JobLeaseDuration
SHADOW_MAX_JOB_CLEANUP_RETRIES = 20

Hope this will help to understand better. Please let me know if I missed somethings.

 Thanks.

   Sanjay
 

-- FkW - 30 Apr 2008

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback