Difference: FkwSTEP09CRABserverIssues (1 vs. 4)

Revision 42009/06/21 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="FkwSTEP09CRABserver"
Line: 157 to 157
  This leads to whole tasks disappearing in the crab server, without the user being informed about it.
Changed:
<
<

Problem with the way proxies are updated

>
>

Problems due to unsecured multi-threading

 
Changed:
<
<
It appears that "something" in the crabserver keeps the user proxy updated from myproxyserver. Whatever mechanism does this, does it in such a way to cause problems with glexec, and condor. Basically, the proxy on disk is touched and maybe rewritten (?). This is not done as an atomic (i.e. fast) process. As a result, both condor and glexec sometimes find a corrupted/incomplete/inconsistent proxy on disk while they try to access it. This leads to both of them failing.
>
>
The following is fkw's probably incomplete understanding of what we know today 6/21/09
 
Changed:
<
<
The way to do this better would be to write the new proxy into a seperate file, and then mv the file to its proper place.

As of right now, we do not know what piece of software inside crabserver does this.

Our knowledge of this going on comes from:

  • condor core dump analysis
  • glexec error "no file is found"
>
>
The way the crabserver does the condor job submission poses severe problems because (at least) two files are overwritten by multiple crabserver worker threads. This happens because more than one worker submits simultaneously jobs from the same task via glexec to condor. The two files are:
  • the actual proxy file
  • the script glexecWrapper.sh
These two files are moved from the uid context of hpi (the username the crabserver runs in) to the uid space of the user who's job it is. If this is done by multiple threads at once, it leads to problems as one thread is attempting to use the file it has just written while the second thread is overwriting it.

Proxy renewal

There is a general problem that the uid that runs the crabserver is not the uid of the user. The submission to condor is done via glexec so that condor_submit is done from within the user uid space. This than posses a problem with proxy renewal via the myproxy mechanism. Need to ask Sanjay what exactly he has done about this. He was going to talk with the gLite folks to better understand how they deal with the same issue.

 

Where to find what logs on crabserver

Here we document the directories where you find stuff on the glidein-2 crab server.

Revision 32009/06/18 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="FkwSTEP09CRABserver"
Line: 123 to 123
 
    1. 9.jdl
Changed:
<
<

SchedulerJobId? not properly specified by crab at submit time

>
>

SchedulerJobId? not properly specified by crab at submit time (fixed as of 6/18/09)

  We see that the dashboard does not show the SchedulerJobId? for jobs that were submitted but are still pending. Julia tells us that this is because the crabserver does not form the SchedulerJobId? correctly at submit time.

Revision 22009/06/17 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="FkwSTEP09CRABserver"
Line: 7 to 7
 

Problems as of 6/17/09

Incompletely submitted tasks

Added:
>
>
Whenever a task has many jobs, you are virtually guaranteed that the task does not get submitted in one piece.

identify the problem

In this log:
/home/hpi/CRABSERVER_Deployment/prodagent_work/CrabServerWorker/ComponentLog
look for this error:
 
2009-06-17 01:28:46,578:['It appears that the value of pthread_mutex_init is 6813744\n', '\n', 'ERROR: Failed to open command file (No such file or directory)\n']

In the log it also tells you how many jobs of this task it failed to create and submit. However, that info is already wrong, as it generally submits more than it thinks it does.

Proof of this problem

Take as an example the task:

spadhi_crab_0_090617_095513_lv5k24

It is supposed to have 1609 jobs in the task. The ComponentLog? file says all of them failed to submit. The actual logs tell you that 399 out of 1609 were created and we don't know how many were submitted in the first round. Here's how to find the proof:

[0158] spadhi@glidein-2 ~$ ls /var/dropBox/spadhi_crab_0_090617_095513_lv5k24_spec/*lv5k24_job*.jdl |wc -l
399

Next it retries (at most 3 times), and during the retry it creates additional jobs, as well as duplicates. We discuss duplicates in the next section.

 

Multiple submission of same jdl for incompletely submitted tasks

Added:
>
>
For tasks that failed initially, it tries 3 times, and each time it potentially creates the same job again. This leads to cases where the same job runs more than once.
 
Changed:
<
<

"No site matched"

>
>

Identify the problem

 ls /var/dropBox/spadhi_crab_0_090617_095513_lv5k24_spec/*.jdl | sed -e 's/.*_job*//' | sort | uniq -c 
 
Changed:
<
<

site Id and monitor Id are inconsistent

>
>
This counts for every job how many different jdl's were created in the dropBox for that job.
 
Changed:
<
<
We see that
>
>

Proof of this problem

From the example below, you can see that out of 1609 jobs top submit, 403 were actually created, and out of those, 51 were created more than once, and three were created 3 times.

We can then look at the submission log for the 3 that were created 3 times, and find for example for job 147:

0231] spadhi@glidein-2 ~$ less /var/gftp_cache/spadhi_crab_0_090617_095513_lv5k24/CMSSW_147.log
000 (399472.000.000) 06/17 01:04:11 Job submitted from host: <169.228.130.11:48908>
...
000 (401888.000.000) 06/17 01:16:14 Job submitted from host: <169.228.130.11:48908>
...
000 (405516.000.000) 06/17 01:38:22 Job submitted from host: <169.228.130.11:48908>
...

So this job was indeed submitted 3 times.

Below is more details on this job:

[0226] spadhi@glidein-2 ~$ ls /var/dropBox/spadhi_crab_0_090617_095513_lv5k24_spec/*.jdl | sed -e 's/.*_job*//' | sort | uniq -c |grep -v " 1 "
      2 101.jdl
      2 103.jdl
      2 114.jdl
      2 132.jdl
      2 136.jdl
      2 137.jdl
      3 147.jdl
      2 149.jdl
      2 150.jdl
      2 164.jdl
      2 166.jdl
      2 172.jdl
      2 17.jdl
      2 191.jdl
      2 194.jdl
      2 197.jdl
      4 199.jdl
      2 19.jdl
      2 219.jdl
      2 229.jdl
      2 230.jdl
      2 232.jdl
      2 233.jdl
      2 239.jdl
      2 256.jdl
      2 263.jdl
      2 26.jdl
      2 281.jdl
      2 306.jdl
      2 307.jdl
      2 313.jdl
      2 315.jdl
      2 31.jdl
      2 32.jdl
      2 361.jdl
      2 382.jdl
      2 389.jdl
      2 38.jdl
      2 395.jdl
      2 396.jdl
      2 40.jdl
      2 45.jdl
      2 4.jdl
      3 63.jdl
      2 68.jdl
      2 7.jdl
      2 80.jdl
      2 90.jdl
      2 91.jdl
      3 97.jdl
      2 9.jdl

SchedulerJobId? not properly specified by crab at submit time

We see that the dashboard does not show the SchedulerJobId? for jobs that were submitted but are still pending. Julia tells us that this is because the crabserver does not form the SchedulerJobId? correctly at submit time. As a result, the dashboard does not parse it correctly, and thus we see nothing.

This makes the debugging of the multiple jobs per jobId more difficult because they actually overwrite each other in the dashboard.

No sites matched problem

We see in the ComponentLog? as the following:

2009-06-17 00:57:59,221:Registering information:
{'submittedJobs': None, 'SE-White': "['T2_PT_LIP_Coimbra']", 'exc': 'Traceback (most recent call last):\n  File "/home/hpi/CRABSERVER_Deployment/MYTESTAREA/slc4_ia32_gcc345/cms/crab-
server/CRABSERVER_1_0_8_pre3-cmp/lib/CrabServerWorker/FatWorker.py", line 150, in run\n    raise Exception("Unable to submit jobs %s: no sites matched!"%(str(sub_jobs)))\nException:
Unable to submit jobs [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]]: no sites matched!\n', 'skippedJob
s': None, 'error': 'WorkerError worker_2. Task spadhi_crab_0_090617_095636_8l07tc. listMatch.', 'reason': 'Failure in pre-submission init', 'SE-Black': None, 'unmatchedJobs': None, '
range': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]', 'CE-White': None, 'time': None, 'notSubmittedJo
bs': None, 'ev': 'Submission', 'CE-Black': "['fnal.gov', 'gridka.de', 'w-ce01.grid.sinica.edu.tw', 'w-ce02.grid.sinica.edu.tw', 'lcg00125.grid.sinica.edu.tw', 'gridpp.rl.ac.uk', 'ccl
cgceli03.in2p3.fr', 'cclcgceli04.in2p3.fr', 'pic.es', 'cnaf']"}
2009-06-17 00:57:59,221:WorkerError worker_2. Task spadhi_crab_0_090617_095636_8l07tc. listMatch.
2009-06-17 00:57:59,221:Unable to submit jobs [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]]: no sites
matched!
2009-06-17 00:57:59,221:FatWorker worker_2 performing submission
2009-06-17 00:57:59,222:Worker worker_2 unable to submit jobs [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33]. No sites matched
However the site do have a valid CE according to bdii:
ce01-cms.lip.pt:2119/jobmanager-lcgsge-cmsgrid

This leads to whole tasks disappearing in the crab server, without the user being informed about it.

Problem with the way proxies are updated

It appears that "something" in the crabserver keeps the user proxy updated from myproxyserver. Whatever mechanism does this, does it in such a way to cause problems with glexec, and condor. Basically, the proxy on disk is touched and maybe rewritten (?). This is not done as an atomic (i.e. fast) process. As a result, both condor and glexec sometimes find a corrupted/incomplete/inconsistent proxy on disk while they try to access it. This leads to both of them failing.

The way to do this better would be to write the new proxy into a seperate file, and then mv the file to its proper place.

As of right now, we do not know what piece of software inside crabserver does this.

Our knowledge of this going on comes from:

  • condor core dump analysis
  • glexec error "no file is found"
 

Where to find what logs on crabserver

Here we document the directories where you find stuff on the glidein-2 crab server.

Revision 12009/06/17 - Main.FkW

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="FkwSTEP09CRABserver"

This page lists problems with CRAB server.

Problems as of 6/17/09

Incompletely submitted tasks

Multiple submission of same jdl for incompletely submitted tasks

"No site matched"

site Id and monitor Id are inconsistent

We see that

Where to find what logs on crabserver

Here we document the directories where you find stuff on the glidein-2 crab server. All directories are on glidein-2, of course.

The jdl and such created for submission to condor

 /var/dropBox/ 

This directory has one subdirectory per task. The subdirectory names are of the following structure:

Example:
spadhi_crab_0_090615_220141_450pmi_spec

Structure:
_crab_0__

-- FkW - 2009/06/17

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback