TWiki> UCSDTier2 Web>FkwCCRC08OpenIssues (revision 10)EditAttach

Stage-out issues

IFCA and storm

IFCA is using storm. I'm told that the fnal srm client is not compatible with storm, and storm supports only srmv2.
  • I verified that lcg-cp and lcg-ls works.
  • need to follow up as this would imply that people assigned to IFCA as their tier-2 would not be able to stage-out from OSG.

T2_FR_CCIN2P3 has no TFC entry

This T2 shares storage with the T1. Started using T1_FR_CCIN2P3_Buffer to get TFC stuff. So far failed after that, send them error log.

Attempts at error accounting

Note: I learned today (May 18th) that the error accounting on the dashboard is incomplete. E.g., I have 18 jobs in Florida that are running for 4 days now. They are actually 9 jobs, each of which have two copies, as I am running twice over the same dataset. The 9 jobs end up being 3 groups of 3 consecutive jobs. I looked at the jdl for these jobs, and it looks like these jobs might be stuck on one file per group. Send Bockjoo email to verify.

These jobs would not show up in my accounting as they end up in the application unknown whenever a job fails to complete, and gets killed by the local batch system.

Moral of the story, I need to go through the logfile directories and sort out which jobs actually completed, and which didn't !!! A way to do this might be the following:

ls -altr WW_incl-1/glideinWMS-WW_incl-1-CMSSW_1_6_7-CSA07-1196178448/share/.condor_temp/*.stdout |grep " 0 May" |wc
This basically searches for 0 length stdout files, and counts them. By walking myself through all directories in this fashion, I can compute a complete tally, in principle. Will need to try this in practice !!!

All failed jobs in detail as of May 14th 7pm

This was based on Sanjay's submissions between May 10th 22:00 and all of mine. Need to redo this more carefully, starting for Sanjay from May 9th 22:00 to May 14th 17:00, and fkw May 9th 22:00 to May 19th 00:00. At this point, only the 18 jobs in Florida are still running for fkw. Those are the ones that get restarted over and over again as mentioned above.

Sanjay from May 14th 18:00 onwards should not be counted as part of this because he's running without the 1s sleep per event from then onwards.

Also, this stats here has the caveat above about making sure application unknown isn't hiding errors the end user sees as well as discussed above.

site success fail 50115 failed files comment
Beijing 1750 0 0    
BEgrid 443 7 7    
Budapest 1354 26 26   4 missing files
BelGrid? -UCL 1855 0
Caltech 3693 9 8   Some pools drop out of dcache. 133-4
CSCS 438 1 0   134
DESY 2117 0    
Wisconsin 2792 12 4   7x135 + 10034
Bari 1257 5 5   2 missing files.
LNL 1833 0 0    
Pisa 1194 5 3   2x-1: ML server can't be reached. This will not cause failed job error in CRAB 2_2 any more. Files did exist. Read problem temporary.
KNU 901 0 0    
MIT 3663 0 0    
Purdue 978 2   2x10016 = one bad node
Aachen 1215 0    
Estonia 785 1 0   bad file in tW dataset
Taiwan 1678 3 3   Files do exist. Seems to be a temporary DPM issue
Brunel 1631 7 7   all 7 files unavailable also to Stuart
London-IC 900 0    
SouthGrid? 1446 0    
IFCA 376 0      
JINR 1804 0    
Nebraska 1811 0      
Florida 1785 0      
Warsaw 1800 0      
UCSD 3421 4 0   4xbad file in tW dataset
UERJ 1748 64 64 80
RHUL 1730 66 65 78 one bad node
Total 42920 212 192  

Other Issues

Batch system config

  • Estonia has no fair share between analysis users. Whoever comes first gets served first. I.e. jobs are dealt with as they appear.
  • At CSCS I got 1/10 the resources that Ale got. Suspecting the same issue as for Estonia.
  • At BEGrid_ULB_VLB I got 1/7 what Ale got. Suspect the same issue as for Estonia.
  • glideins never worked at CIEMAT. We suspect a firewall issue.
  • There's a site at Helsinki that uses ARC
  • A total of at least 13 files across 3 sites were unavailable. Don't have info from UERJ yet. Seems likely to increase this count.
  • 1 bad file in tW dataset accounts for 5 failures.

Time out of reads

We presently have no timeouts in cmssw nor crab for files that hang once opened.

Issues in CRAB stage-out logic

The logic we found in cmscp makes no sense.
  • It starts with srmcp v1.
  • then srmcp v2
  • then lcg-cp

The checking after srmcp is insufficient. It relies on the return code, instead of doing srmls always after a failure to make sure the failure is really a failure.

Issue with gfactory monitoring

The monitoring plots should show in the legend only those sites that are actually used during the time that the plot covers.

-- FkW - 30 Apr 2008

Edit | Attach | Print version | History: r14 | r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2008/05/30 - 20:34:50 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback