Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issues at sites | ||||||||
Line: 385 to 385 | ||||||||
2.) Follow up on using this automatically in the gfactory. However, that's already on our follow-up list anyway. | ||||||||
Added: | ||||||||
> > | EGEE proceduresHere's an email from Claudio Grandi about how EGEE works the arrival of a job at a worker node: We need to adopt this in glideinWMS appropriately !!!*** Discussion title: Tier-2 Activities Brew, CAJ (Chris wrote: [...] $DG_WL_SCRATCH is set by the site (if it wants to) and is honoured by jobs submitted through the lcg-RB. gLite-WMS jobs don't honour this but you can create a file /opt/glite/ppd/cp_1.sh that gets run before handing control to the job payload. Exactly. As mentioned in the twiki page I pointed you at (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310#Known_issues_divers), the site manager has to cd into the top execution directory (e.g. /tmp) in the script ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh. Then the wrapper creates a directory (e.g. /tmp/and a second one refering to this first one: Please see https://hypernews.cern.ch/HyperNews/CMS/get/t2/202/1/2/1/1/1/1/1/1/2/1/1/1/2/1/1.html for what concerns the expected behavior on EGEE. Instead of work_dir="$EDG_WL_SCRATCH" you have to execute cp_1.sh: if [ -n "${GLITE_LOCAL_CUSTOMIZATION_DIR}" ]; then if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh" ]; then . "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh" fi fi Cheers, Claudio | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issues at sites | ||||||||
Line: 100 to 100 | ||||||||
Issues with Condor | ||||||||
Added: | ||||||||
> > | CIEMAT ProblemWe never figured out why we can't run jobs at CIEMAT. The working hypothesis is that CIEMAT has a firewall we can't get around. | |||||||
1st emergency meetingAttending: Dan, Igor, Sanjay, fkw | ||||||||
Line: 295 to 299 | ||||||||
In that case, the startd ends up failing to connect to the schedd because the time to write into the startd log is delayed so massively that every connection attempt times out. | ||||||||
Added: | ||||||||
> > | More stuffFrom email "lessons learned":-> we failed to run jobs at CIEMAT. Suspicion is that its a problem with outgoing firewall rules. This is thus a problem that could bite us anywhere anytime a site changes their firewall rules. -> latency problems with collector and schedd. A single site with long latencies can cause havoc on the WMS. We got a fix to condor to have better control of this, and have adjusted the architecture of the WMS to make ourselves less vulnerable. However, both improve the situation without really fixing it. Dan is starting to work on a long term fix as part of his DISUN contribution to CMS. -> There are a few issues we worked around that are specific to the interface of CRAB and glideinWMS. -> some dashboard reporting related things that we still need to check if they are fixed in CRAB 2_2 anyway. -> we've done work on IO monitoring which I think is not yet conclusive. -> the bdii connection made assumptions that were invalid, so we coded around it. -> we messed with the CRAB<->glideinWMS interface to deal with the latency issues mentioned above. -> There are some nuisance issues from an operational perspective. -> to add a new site to glideinWMS requires taking the system down. That's operationally less than ideal. -> there are a number of additional checks the glideins could do to make sure we don't fail because of wn issues -> the way glideinwms works right now it does not know about sites being in drain-off etc. As a result, we submitted jobs to CEs that were off. -> there may be changes needed to submit to nordugrid. We don't know this yet for sure because we haven't done the submissions. -> I am convinced that there's a bug in frontend because we ended up with many more glideins pending at various times than we were supposed to have according to the config of frontend. -> I am not convinced that we understand the scalability. As far as I'm concerned, the total number of running jobs we achieved was disappointing. The peak I have ever seen is about 3000, maybe slightly more. -> We need to settle the question of how we deal with multi-user operations. -> I never tried using the 'top', 'cat' etc. commands. I'd like to try that sometime! -> as an aside, it's not clear how this works once we use glexec. Igor should educate us. -> CRAB server 1_0_0 is now out. We should deploy that, and understand what the operations implications are. I've probably forgotten something. So please add to the list.From email "more lessons learned": -> crab -status isn't working -> crab -kill isn't working. The issue with the second is kind of serious because as it stands now, we can not kill a job, and have the fact that we killed the job be recorded on the dashboard. | |||||||
Added: | ||||||||
> > | -> We see a lot of "application unknown" on the dashboard. Initially, we thought this was because of cockpit error on our part.
However, we haven't made any mistakes of this sort lately, and still see a lot of application unknown. There is thus something fundamentally
wrong here that we do not fully understand.
As of now, we have indications of three different types of issues:
-> jobs get hung in dcap because the file doesn't exist, and the batch system eventually kills the job. This is a very small proportion.
-> there are "keep alive" messages between startd and schedd. When a number of them are missed then the schedd considers the startd gone,
and moves on. The startd commits suicide, I think. fkw is a bit hazy how this works in detail.
Sanjay says he's used a mechanism in Cronus that replaces tcp for udp IFF a number of keep alive messages in a row fail to be received.
I.e. a glidein is not abandoned without trying tcp instead of udp. It would be good if we could figure out from the logs if this is contributing
to the "application unknown" in the dashboard. We know that if this happens we are guaranteed that the job is recorded as application unknown.
In addition, Sanjay used the "time to live" from the job classAd to prevent schedd and startd giving up on each other within the expected
runtime of the job. A feature we may want to add if we can convince ourselves that this is contributing to the application unknown.
-> At present, the glidein knows about the amount of time it has left to live, but the crab job doesn't tell it the amount of time it needs.
As a result, it's conceivable that some of the application unknown are caused by timed out batch slot leases. It would be good if we could
quantify this from logs, and then in the future communicate a "time to live" in the job classAd. We believe that a job that times out in
the batch slot will contribute to application unknown.
To give you a sense of scale, we see up to about 10% of all crab jobs on the dashboard during CCRC08 recorded as application unknown.
This is about an order of magnitude larger than the number of jobs that are recorded as failing !!! Our failure stats in CCRC08 is thus completely
meaningless unless we get some understanding of application unknown!
Ok. I think that's it.
From the email "still more lessons learned":
I forgot one more operational nuisance thing. On OSG, there is no mechanism to even know when a site is "scheduling" a downtime. As a result, we merrily submit to a site that's known to be non-functional, and merrily fail all the submitted jobs. See the exchange below with Bockjoo. Not a good state of affairs, unfortunately. I think the way out of this is to: 1.) work with OSG to start using the bdii for announcing downtimes. We already talked about doing this. Now Burt and I just need to remember to put it into the WBS of OSG for year 3. 2.) Follow up on using this automatically in the gfactory. However, that's already on our follow-up list anyway. | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issues at sites | ||||||||
Line: 290 to 290 | ||||||||
Sanjay | ||||||||
Added: | ||||||||
> > | The Aachen ProblemAachen is a very large T2, growing to be even larger soon. They run all jobs out of a single shared filesystem. In that case, the startd ends up failing to connect to the schedd because the time to write into the startd log is delayed so massively that every connection attempt times out. | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Changed: | ||||||||
< < | Stage-out issues | |||||||
> > |
Stage-out issues at sites | |||||||
IFCA and stormIFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2. | ||||||||
Line: 77 to 79 | ||||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Time out of readsWe presently have no timeouts in cmssw nor crab for files that hang once opened. | ||||||||
Line: 90 to 91 | ||||||||
| ||||||||
Added: | ||||||||
> > | Sanjay is going to rewrite this part of crab for our last run of exercise 3. We'll start this on June 1st out of my account. | |||||||
Issue with gfactory monitoringThe monitoring plots should show in the legend only those sites that are actually used during the time that the plot covers. | ||||||||
Added: | ||||||||
> > | Issues with Condor1st emergency meetingAttending: Dan, Igor, Sanjay, fkw We started out initially with just one schedd and one collector. This was immediately shown to not scale because of latencies due to GSI as well as other reasons at some sites, maybe network connectivity related? We did three things in response:
2nd emergency meetingAttending: Dan, Sanjay, fkw We found several irregularities in the way condor works in our system. Let me paste the email summary from this meeting below.Hi Igor, (cc to Sanjay and Dan) Below is my minutes of a phone conference Dan, Sanjay and fkw just had. Figure, it's probably useful for you to know about. Dan and Sanjay, please feel free to correct me where I misunderstood. I think we made good progress in this meeting. I for sure learned a lot. Have a great vacation next week. Thanks, frank We did 4 things to verify that we understand the configuration: --------------------------------------------------------------------------------------- Sanity check: Dan confirmed with netstat that we are indeed running the glidein system such that the startd connects to the shadow. I.e., GCB is in the way only initially so that schedd can tell startd to contact it. ok. This is settled. keep alive: Dan confirmed that the "keep alive" is sent via tcp in our system, not udp. We don't know how many "keep alive" are sent before giving up. ok. This is settled too. Match history: Dan confirmed that we do not keep match history. => Need to change this in configuration of user job, i.e. it's a crab modification. Dan has a way to do this by default via the condor config so that we do NOT need to add it to crab. He's sending info about this in email. Sanjay is trying to change this now so we have this info in the future. Event log: The current version of condor we are using does not do that. We need to switch to a new version when Dan tells it's ready. See email from Dan on this topic. We then agreed that there are three fundamental problems, and that we do not know at this point if the three are related: ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1.) NumJobStarts > 1 This leads to "application unknown" in the dashboard because actual user jobs are started more than once. We know that there are legitimate reasons for this (e.g. when a file open request in dcap blocks and never returns because the file to be opened exists only in pnfs but not on disk.) We fear that there are also illegitimate reasons, i.e. condor related failures. Dan saw eviction notices in shadow logs. This can have one of two origins: a.) the worker node batch slot lease times out before the job is done, and both the glidein and the user job are evicted. b.) the glidein evicts a job, and takes a new one. Not sure how one would distinguish between the two. Not sure if the eviction is the only cause for NumJobStarts>1, or if there are other reasons as well. 2.) Condor_q is stuck until it times out: We quite routinely find that condor_q -global times out one 1 or 2 of the schedd's, thus returning info only from a subset. Dan looked at the schedd logs, and thinks that this is because of latencies due to gsi authentication. Dan has thus convinced himself that this is an issue that is serious, and needs to be worked on. He's going to add this to his list of priorities. We could use quill on glidein-2 if we wanted to improve our ability to diagnose things, possibly with the cost of more problems submitting? | |||||||
Added: | ||||||||
> > | 3.) We find an exorbitant NumShadowStarts? for some jobs into the hundreds.
We agreed that there are in principle two such cases to be distinguished.
a) Those when the glidein never manages to successful run a job because the connection keeps on failing until the batch slot
lease times out, and the glidein gets evicted on the worker node. The logfile Sanjay sent around was such a case, it seems.
This is a giant waste of cpu resources, and we need to learn how to avoid this.
b) Those where the job starts, and then the connection gets lost, and reestablished many times over.
We can distinguish these two by looking at NumJobStarts? =0 and NumJobStarts? >0
We have not done this, and it's probably worth doing!
Dan noted that the particular logfile Sanjay had found was very sick. It seemed to indicate that there is something very very wrong,
possibly a bug in threading. Would be worthwhile to find another logfile like this.
The evening after the meeting, fkw observed a job at UCSD that clearly was run twice, and succeeded in its entirety both times.
Leading to a "file exists" problem for the second incarnation of the job.
Here's the email on that one:
Hi, So I am watching a job that's been running for more than 6 hours on the UCSD cluster. And I think I can prove that condor restarted a job that actually completed correctly !!! I.e. it looks to me like condor screwed up royally !!! details on the second running of the job: jobid on schedd_jobs7@ = 14661.0 MATCH_Name = "vm2@14979@cabinet-6-6-12.local" LastRemoteHost = "vm2@2001@cabinet-6-6-17.local" RemoteHost = "vm2@14979@cabinet-6-6-12.local" NumShadowStarts = 3 NumJobStarts = 2 NumShadowExceptions = 1 LastMatchTime = 1212085038 JobStartDate = 1212085378 LastVacateTime = 1212096246 EnteredCurrentStatus = 1212096247 ShadowBday = 1212096251 JobLastStartDate = 1212096245 JobCurrentStartDate = 1212096251 LastJobLeaseRenewal = 1212108161 ServerTime = 1212108268 And just as I was trying to find it on the cluster, it finished. Interestingly, the job succeeded to run, and failed in stage-out with a supposed "file already exists" error. So I looked in dcache for the file and its creation date and find: [1834] fkw@uaf-1 ~$ ls -altr /pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/CCRC08/t2data2.t2.ucsd.edu/WW_incl_P3-48471/TestFile_31.txt -rw-r--r-- 1 19060 cms 10239300 May 29 14:01 /pnfs/sdsc.edu/data3/cms/phedex/store/user/spadhi/CCRC08/t2data2.t2.ucsd.edu/WW_incl_P3-48471/TestFile_31.txt What I conclude from this is the following history: May 29 14:01 Job succeeded writing its file into storage element May 29 14:24 the same job is restarted on cabinet-6-6-12 May 29 18:24 this second running of this job ends, and fails during stage-out because the file already exists in dcache. I'm asking Abhishek to show me how I can find the hostname from where the file was written successfully the first time. If this is 6-6-17 then the case is airtight, and condor screwed up royally !!! Or am I missing something? Thanks, FrankDan made a number of suggestions for configuration changes to deal with this. Sanjay put those in place, and rewrote the logic in cmscp in crab_template.sh before we are rerunning exercise 3 for the last time on June 1st. Here's the email on the changes in condor configuration: Hi Dan and Frank. I have changed the configuration for glidein-2 in order to incorporate the following based on your suggestions. MAX_SCHEDD_LOG = 100000000 SCHEDD_DEBUG = D_FULLDEBUG MAX_SHADOW_LOG = 100000000 SHADOW_DEBUG = D_FULLDEBUG # Added params for sched issues LastMatchListLength=10 JobLeaseDuration = 10800 SUBMIT_EXPRS = $(SUBMIT_EXPRS) LastMatchListLength JobLeaseDuration SHADOW_MAX_JOB_CLEANUP_RETRIES = 20 Hope this will help to understand better. Please let me know if I missed somethings. Thanks. Sanjay | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issuesIFCA and storm | ||||||||
Line: 83 to 83 | ||||||||
Time out of readsWe presently have no timeouts in cmssw nor crab for files that hang once opened. | ||||||||
Added: | ||||||||
> > | Issues in CRAB stage-out logicThe logic we found in cmscp makes no sense.
Issue with gfactory monitoringThe monitoring plots should show in the legend only those sites that are actually used during the time that the plot covers. | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issuesIFCA and storm | ||||||||
Line: 55 to 55 | ||||||||
| ||||||||
Added: | ||||||||
> > | Time out of readsWe presently have no timeouts in cmssw nor crab for files that hang once opened. | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stage-out issues | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deleted: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< < | Warsaw issue with store/storeNot sure I understand Krzystof's email. Need to get back to this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IFCA and stormIFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Deleted: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< < | Imperial college and CAdoesn't seem to allow DOEGrids Contact Stuart Wakefield, David Collins Today's (May 5th) a holiday in the UKTaiwan and roleTaiwan wants me to use cmssgm role. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
T2_FR_CCIN2P3 has no TFC entryThis T2 shares storage with the T1. Started using T1_FR_CCIN2P3_Buffer to get TFC stuff. So far failed after that, send them error log. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Added: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> > | All failed jobs in detail as of May 14th 7pm
Other IssuesBatch system config
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issues | ||||||||
Added: | ||||||||
> > | Warsaw issue with store/storeNot sure I understand Krzystof's email. Need to get back to this. | |||||||
IFCA and stormIFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issuesIFCA and stormIFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2. | ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Imperial college and CA |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Stage-out issuesIFCA and storm | ||||||||
Line: 14 to 14 | ||||||||
Taiwan and roleTaiwan wants me to use cmssgm role. | ||||||||
Deleted: | ||||||||
< < | T2_BE_IIHE and no roleStijn de Werdt wants me to use no role whatsoever | |||||||
T2_FR_CCIN2P3 has no TFC entryThis T2 shares storage with the T1. Started using T1_FR_CCIN2P3_Buffer to get TFC stuff. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Deleted: | ||||||||
< < | ||||||||
Stage-out issuesIFCA and storm | ||||||||
Line: 7 to 6 | ||||||||
| ||||||||
Added: | ||||||||
> > | Imperial college and CAdoesn't seem to allow DOEGrids Contact Stuart Wakefield, David Collins Today's (May 5th) a holiday in the UKTaiwan and roleTaiwan wants me to use cmssgm role.T2_BE_IIHE and no roleStijn de Werdt wants me to use no role whatsoeverT2_FR_CCIN2P3 has no TFC entryThis T2 shares storage with the T1. Started using T1_FR_CCIN2P3_Buffer to get TFC stuff. So far failed after that, send them error log. | |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Deleted: | ||||||||
< < | ||||||||
Added: | ||||||||
> > | Stage-out issues | |||||||
Added: | ||||||||
> > | IFCA and stormIFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
| |||||||
-- FkW - 30 Apr 2008 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > | -- FkW - 30 Apr 2008 |