The purpose of this page is to document the scalability testing we are doing right after CSA07. It's not pretty! It's mostly written for myself so that I don't get confused about what I'm doing.

Overview

Between the end of CSA07 and Thanksgiving 2007, we deployed glideinWMS at UCSD across the following hardware:

node hardware services
gftp-6b 8core 8GB RAM CRAB, Frontend, 10 schedds
pnfs-2 8core 8GB RAM gfactory
??? ??? GCB
??? ??? GCB
??? ??? negotiator/collector

We then used this system to reach 100,000 CRAB jobs successfully run in 24 hours.

This goal was accomplished between 6pm Wednesday Nov. 21st 2007 and 6pm Thursday Nov. 22nd. During the first 13 hours more than 94,000 jobs completed. In the remaining 11 hours close to 30,000 jobs completed.

Conceptual Detail and lessons learned

We ran the ntuplemaker that Haifeng Pi uses in his vector boson fusion Higgs analysis. We selected CMSSW_1_5_2 datasets at a few US T2 sites, and FNAL, We then prepared roughly 200,000 CRAB jobs using crab -create. Each of these jobs ran on 500 events, writing out an ntuple of up to 20MBytes. The ntuple for each job was staged out into the local srm at the site.

All the jobs were created over night via a simple set of scripts. They were created in batches of at most 1000 jobs per dataset. This was used to make sure that no more than 1000 files were staged out into the same directory.

Lesson 1: CRAB create takes too long by about an order of magnitude.

Lesson 2: CRAB is missing the functionality of creating directories for stage-out to guarantee that no more than ~1000 files end up in the same directory.

In the late afternoon on 11/21, we then did crab -submit on these batches of jobs. Submission of these jobs took what seemed forever. I.e. it was not done by7am the next morning.

Lesson 3: CRAB submissions via condor takes too long by more than an order of magnitude.

The submission system used by CRAB works as follows:

  • There is a set of schedd's on the CRAB client node. All jobs are submitted to these schedds as "vanilla universe" jobs. This was accomplished by hacking the condor_g_submit.pl script that is part of crab. We hacked the script such as to add a classAd attribute based on the globusgatekeeper name. In addition, we distributed the jobs across schedd's according to which site they were heading.
  • We loaded up one of the schedds with more than 100,000 jobs. This schedd was nevertheless still performing reasonably well, both in terms of response to condor_q as well as submissions. We attribute this to the condor_q querries going to a different quill server for each schedd.
  • A glideinWMS gfactory with 10 sites was used to establish the condor pool for vanilla universe. Only 6 of the 10 sites were used in the test. Two couldn't be submitted to for reasons we did not understand. One site was missing the data we had prepared to be used. I.e. the site deleted the datasets after we had created the crab jobs. As job creation takes too long, we decided not to bother using that site with a different dataset.
  • Each glidein pulls multiple jobs in sequence to the worker node. This way we could run glideins for at least an hour while the jobs we ran were designed to run only a couple of minutes.
  • We found that a significant number of jobs ran for much longer than expected because they got stuck failing in stage-out. CRAB's default retries back-off and retry many times over. Accordingly, failure of an SRM at a site leads to massive accumulation of jobs that do nothing. As a result, the wallclock time consumed at a site does not correlate well in all cases with the number of jobs run at a site.
  • Even when srm did not fail outright, we could clearly see stress on the srm in form of 5-10 fold increase in time to complete for a job when 100-200 jobs were running at the same site. The exact behaviour is not uniform across sites. I.e. some sites have srm's that are more scalable than others.

Lesson 4: crab -status is useless for even a task of 1000 jobs. The only way to get a sense of what's going on is by doing condor_q. This was annoying because all CRAB jobs have the same identical name. It would be desirable to change this such that at least the site the job is meant for is listed in the name.

Lesson 5: The way crab submits to condor really makes no sense. In CDF we use DAGs to handle tasks. This should be adopted in crab, instead of having one condor job per each crab job. The DAQ could then be given the task name to more easily monitor it in condor.

Lesson 6: Stage-out is a limiting factor at the scale at which we were running. This needs some thought! It's not clear to us what scale is reasonable.

Lesson 7: The CMS dashboard did not report the jobs properly. This is mostly due to the fact that it keys on a string that is unique for every worker node bach slot lease. As we ran several jobs per slot lease, we overwrote that key multiple times. There were other issues as well, that we need to follow up on.

Lesson 8: The gfactory ran with fkw's credentials. This was very annoying because I had to copy my proxy onto that machine once every 200 hours. There's a 200 hour limit on the cms attribute. To turn this into a production infrastructure, we need a mechanism that doesn't require human intervention every 200 hours.

Summary and next steps

We would like to rerun this exercise after the following changes were made:

  • Switch to CRAB server infrastructure.
  • Switch to CRAB interface to condor that makes use of DAGs as in CDF.
  • Fix the CMS dashboard accounting issues.

In addition it would be nice, but not essential:

  • To have a better proxy mechanism for the gfactory.
  • For crab -create to do what I did with my scripts with regard to assigning a new stage-out directory for every 1000 jobs.

Detailed job accounting

  all jobs run all output files found jobs run in 24h success rate
UCSD 26252 26223 23224 99.9%
FNAL 29663 29493 25777 99.4%
MIT 27027 26915 25512 99.6%
Nebraska 63161 47289 10663 74.9%
Purdue-RCAC 72011 71652 27019 99.5%
UF-PG 18027   11142  
Total 236141   123337  

Core dump of where we are at.

3rd large scale submission to multiple sites

Started around 6pm 11/21/07. Am having trouble with UF-HG, GLOW, and CIT. GLOW deleted the datasets I was running on. Need to figure out what to pick from what they have! The other two somehow did nothing during "crab -create" . The directories and crab directories got created but there's nothing in .boss_cache . No idea why.

There's a new set of scripts there for submission only:

  • submit
  • numbersSubmit
  • filesSubmit

Submission history:

  • By 11pm 11/21/07:
    • MIT all jobs that were prepared are being submitted
    • Nebraska all jobs that were prepared are being submitted
    • FNAL all jobs that were prepared are being submitted
    • Purdue-Lear series 31 is completely submitted
    • Purdue-RCAC all jobs that were prepared are being submitted
    • UCSD all jobs that were prepared are being submitted
    • UF-PG all jobs that were prepared are submitted
  • By 3:45pm 11/22/07:
    • it's been a painful recovery from the filled up root disk. Since the recovery this morning, only 19650 jobs completed, and the web monitoring is basically not working.
    • Will just let it run undisturbed. At this point we have the following tally of jobs in the schedds:
      • schedd_jobs1@ 73714 jobs; 73708 idle, 5 running, 1 held
      • schedd_jobs2@ 8500 jobs; 8447 idle, 53 running, 0 held
      • schedd_jobs4@ 49632 jobs; 49500 idle, 132 running, 0 held
      • schedd_jobs7@ 4318 jobs; 4265 idle, 52 running, 1 held
      • schedd_jobs8@ 13625 jobs; 13543 idle, 82 running, 0 held
    • Another interesting factoid is how long the recovery took for each schedd after the crash. I'm basing this in when the first job completed after the crash.
      • schedd_jobs1@ 15:26
      • schedd_jobs2@ 10:42
      • schedd_jobs4@ 10:39
      • schedd_jobs7@ 10:33
      • schedd_jobs8@ 10:33
    • They all recovered pretty quickly, except for the first. Unclear if this had anything to do with the fact that it was the heaviest loaded when it crashed!
  • by 9:30pm 11/23/07
    • there are jobs that have been stuck for many hours to days. I'm going to kill them now:
      • 38 jobs at schedd_jobs2@ all killed
      • 25 jobs with runtime more than 10 hours at schedd_jobs1@ killed. There's a couple hundred jobs total that have been there for hours. Have only deleted the 25 longest running. Will deal with the rest later.
      • there's 120 running jobs at schedd_jobs4@ with run times ranging form a few minutes to more than a day. Not sure what to make of that.
        • deleted the 17 jobs that were running for more than 10 hours.
    • schedd_jobs4@ and schedd_jobs8@ are still working through their queues. At this point both have a few thousand jobs left in the queue. Will elave them alone and see what happens.
    • the total tally at this point since 11/21/07 16:00 is (note, submissions didn't start until 18:00, I think):
      • schedd_jobs1@ 32184
      • schedd_jobs2@ 17994
      • schedd_jobs4@ 71046
      • schedd_jobs7@ 27031
      • schedd_jobs8@ 55919

Problems:

  • I know that lot's of jobs have finished at FNAL already, and they are supposed to have been submitted to schedd_jobs3@ but quill shows no history for them.
  • Purdue-Lear has all jobs held according to web display, but I can't see the same in condor_q ???
  • UF-HG no crab submissions possible. I suspect this is because of the bdii issue Bockjoo mentioned on cms-t2 list.
  • Glow has deleted my datasets. There's no CMSSW_1_5_2 data left at Wisconsin, and I'm not sure what other releases my job works on. Am thus giving up on Glow.
  • Filled up root partition on gftp-6b at around 7:22am on 11/22/07. Everything failed. Deleted /oldhome/fkw to gain back 5GB. Should have done that last night !!! Restarted /etc/init.d/condor start and am waiting to see how condor recovers with ~170k jobs still in the queues.
    • between 6pm and 7:22am 94218 jobs completed!

schedd_jobs1@ = Nebraska
schedd_jobs2@ = UF_PG  and UF_HG
schedd_jobs3@ = FNAL
schedd_jobs4@ = Purdue_RCAC and Purdue_Lear
schedd_jobs5@ = GLOW
schedd_jobs6@ = CIT_CMS_T2
schedd_jobs7@ = MITCMS
schedd_jobs8@ = UCSDT2B
schedd_jobs9@ = anybody else

How to clean up after a run

All sites except Caltech and Florida support srmls. srmls is set up with my standard source_me. For Florida I am using:
source /code/osgcode/glite/setup_glite_ui.sh 
edg-gridftp-ls gsiftp://wn100.ihepa.ufl.edu/cmsdata/test/scalability-testing/

2nd large scale submission to mutiple sites

Started another round at roughly 5pm pacific on November 11th. Running with same arrangement as suggested from first round experience:

schedd_jobs1@ = Nebraska
schedd_jobs2@ = UF_PG
schedd_jobs3@ = UF_HG
schedd_jobs4@ = Purdue_RCAC
schedd_jobs5@ = GLOW
schedd_jobs6@ = CIT_CMS_T2
schedd_jobs7@ = MITCMS
schedd_jobs8@ = UCSDT2B

Tailored submissions such that 20-30k or so are heading to each of the 8 sites. This time running stuff in the background: ./script >& script.log & in each directory.

*11/11/07 ~9pm Unfortunately, screwed up a bit:*

  • forgot to change the numbers file for MIT. Thus trying to write files that already exist in stage-out area. Decided to kill that script, and start over. There were 1836 jobs in the queue, 9 of which were running when I pulled the plug with condor_rm -name schedd_jobs7@ -all .
  • while trying to kill the MIT submission script, I killed the UF-HG instead at first. There are 344 jobs in the queue with 34 running. As I have killed only the scipt here, without removing jobs from queue, these will eventually finish. Will deal with resubmission with new numbers file later.
  • Brian aimed me. All of the jobs at Nebraska were still failing during stage-out. He figured out that crab was writing then deleting the written file. He figured out that this was because of the way crab parses srm-get-metadata . Nebraska has debug turned on in srm by default. This confuses the stage-out script in crab, and crab deletes the file assuming a failed stageout. Brian changed their srm configuration, and now all is well, and files get staged out successfully at Nebraska.

Results from 2nd round

site # of files staged out # of non-empty BossOut?
Nebraska 13952 16157
Florida PG   9634
Florida HG   12792
Purdue 19183 18324
Wisconsin 2394 2398
MIT 11135 11213
UCSD 16394 14861
FNAL   4710

Schedd based accounting:

name sites 1st id last completed id last - 1st id completion time of 1st id completion time of last id
schedd_jobs1@ Nebraska 13343.0 30641.0 17298 11/11 17:08 11/13 01:53
schedd_jobs2@ UF-PG 13287.0 22923.0 9636 11/11 17:24 11/12 16:37
schedd_jobs3@ UF-HG 246.0 13038 11792 11/11 18:11 11/12 20:39
schedd_jobs4@ Purdue-RCAC 290.0 24022 24268.0 11/11 17:33 11/13 01:54
schedd_jobs5@ Glow 287.0 2684.0 2397 11/11 18:44 11/13/ 11:40
schedd_jobs7@ MIT 276.0 14165.0 13889 11/11 17:44 11/12 17:33
schedd_jobs8@ FNAL, UCSD 145382.0 167236.0 21854 11/11 20:10 11/13 01:47

 It's clear that we didn't run 100,000 jobs a day.
Between 11/11 5pm and 11/13 2am 
- we ran at peak ~1800 jobs simultaneously
- 94237 jobs completed, and were recorded as such by condor on my submit host: grep "/1[123]" allFor*.log |wc

There are two reasons why the total wall clock time available to me during this time was not sufficient to pass the 100,000 jobs a day
by a few hours:

(1) 2418 jobs took more than an hour to complete. A nominal job was supposed to finish in a couple minutes, many under a minute.
      E.g. more than 20,000 jobs ended within 2 minutes. I suspect all 2418 long running jobs had stage-out problems, 
      I suspect this because Brian told me about stage-out problems in Nebraska, and Nebraska accounts for 596 of these.
      I thus lost capacity to run O(100,000) jobs due to stage-out problems. Am assuming that these jobs failed in their stage out and hung
      for a long time afterwards.   grep "/1[123]" allFor*.log |grep "0+0[1234567]" |wc

(2) 5219 jobs ran  more than 10min but less than an hour. Of these 3397 jobs ran between 10-20 minutes. 
      I suspect that most of these succeeded with stage-out but took most of their wall clock time stageing out because of load on the srm.
       grep "/1[123]" allFor*.log |grep "0+00:[12345]" |wc

69941 jobs completed within less than 5minutes. That's the kind of maximum wall clock time I expected.
grep "/1[123]" allFor*.log |grep "0+00:0[01234]" |wc

In all cases except FNAL, I staged out locally, at the site the jobs were running. For FNAL, I staged out in Florida.
One might argue that FNAL jobs are excused for taking longer to stage out. They account for at most 1443 of the 5219
jobs that took between 10min and an hour, and at most 1522 of the 2418 that took longer than an hour.

My conclusion is that the scalability exercise in all likelihood didn't even get close to pushing the glideinWMS, but rather pushed
our ability to stage-out data.

To be fair, to really nail this I need to compare these stats from quill with the stats from the CMSSW logfiles that came back from crab.

1st large scale submission to multiple sites

11/10/07 around 0:00 or so. Start of 1st large scale submission to multiple sites.
Am including those sites at which the 1001 jobs run succeeded. I.e. I managed to see several hundred output root files in the output directory using srmls. At present this includes UCSD, Wisconsin, Purdue-RCAC, MITCMS.

11/10/07 10:30am. Had the submission script running over night.
It finished for MIT and Purdue, i.e., all the jobs intended to be submitted where submitted. For Wisconsin and UCSD I was more greedy, had 2 submit scripts running in parallel. Not all jobs were submitted yet. I ctrl-c out of it at the dataset boundary, i.e. while it was trying to reach the DBS, and then cleaned up all the directories that CRAB was so rudely interupted. At this point, I have:

105487 jobs; 96871 idle, 368 running, 8248 held
in schedd_jobs8@ at gftp-6b.t2.ucsd.edu. That's probably unreasonable. Will let it run its course for a few hours and then regroup.

11/10/07 2:10pm. Status is now:

92437 jobs; 83717 idle, 466 running, 8254 held
Looks like we are doing ok. I will now start a second batch on a seperate schedd. This time I will give each site a seperate schedd so that I can keep track more easily.
schedd_jobs1@ = Nebraska
schedd_jobs2@ = UF_PG
schedd_jobs3@ = UF_HG
schedd_jobs4@ = Purdue_RCAC
schedd_jobs5@ = GLOW
schedd_jobs6@ = CIT_CMS_T2
schedd_jobs7@ = MITCMS
schedd_jobs8@ = UCSDT2B

As MIT, UCSD, Purdue, and GLOW were in the round of last night, and are probably still running jobs, I'll start with the other three to exercise this scheme.

11/10/07 5pm Well, Caltech had an NFS server failure. So I've started with only Nebraska and UF-PG in this scheme.
However, condor_q_submit.pl in BOSS was changed to fully implement this scheme. Despite the fact that I'm not sure if all my schedd's actually work. Will test each one before I use them. This is done from condorTest directory using something like: condor_submit -name schedd_jobs8@ classAd . Appropriate classAds are in the directory.

11/10/07 around 6pm I filled up the disk of gftp-6b. All of my jobs crashed because all my schedd's crashed. I didn't know what to do other than to delete all CRAB directories on gftp-6b. Then I went to hodads for dinner and beer.

11/10/07 around 9pm I got back and tried to resurrect things. Restarted condor via /etc/init.d/condor restart . This has the interesting side effect that it restarts all queues. I have again ~100,000 jobs queued up. However, given that all corresponding crab directories got deleted, I have no idea what will happen with these jobs. I'm inclined to just wait until tomorrow and see. To get a sense of what's going on, I did an inventory of output files as follows:

site # of files # of size 0 files
UCSDT2B? 29605 0
Purdue 2723 41
MIT 6664 5
Nebraska 34 0
UFL 958 unknown
Wisconsin 5724 0

Will need to check tomorrow morning if these numbers have increased.

11/11/07 around 11:30am when I looked at the system this morning I found about 100,000 held jobs. Apparently, this is normal for jobs that loose their directory. Issues condor_rm schedd_jobs1@ fkw command and am now waiting to see if this cleanes up my mess.

Repeated this on all schedd's. Waited for many hours. Most of them cleared up. schedd_jobs8@, the one that had close to 90,000 held jobs still had close to 40,000 in late afternoon. At that point I lost patience, and used -forcex .

Learned a lot in this first round. Time to move on.

Submission script I'm running for this round

#!/bin/bash

for Z in `cat numbers`
do
   for X in `cat files` 
   do
        Y=`echo $X | awk -F/ '{print $2}'`
        Y=`echo $Y"-"$Z`
        mkdir $Y
        cd $Y
        cp ../crab1.cfg tmp.cfg
        echo "datasetpath = $X" >> tmp.cfg
        cat ../crab2.cfg >> tmp.cfg
        sed 's/replace/'$Y'/' tmp.cfg > crab.cfg
        crab -create
        crab -submit
        cd -
   done
done

"numbers" is simply 1 2 3. I.e. for every dataset at the site, I am creating 3 directories, each of which will get at most 1001 files. The total number of jobs submitted to a site is thus 3 x number of datasets in files list x 1001. Though, some datasets are too small to make 1001 jobs out of, leading to less jobs overall. For UCSD and Wisconsin I ran a couple of these scripts in parallel with different "number" file.

This was the easiest way of making sure that no more than about 1000 files end up in one directory.

Prep prior to 1st large round

  • successfully run a CRAB job, and retrieved output back
  • successfully wrote a file and retrieved it to all 7 USCMS T2s
  • submitted 1001 jobs to Nebraska, Wisconsin, UCSD,
    • don't have properly configured gfactory for MIT, Purdue, and UFL Didn't notice this until it was too late for this round. Need to kill gfactory to add these sites. Don't want to do that until after this round of jobs are done.
    • all jobs at Nebraska failed when trying to stage out the output root file. Have an email in to Brian and Carl asking what's going on.
    • all jobs at UCSD succeeded. Verified that 1001 output root files with reasonable sizes are in dCache at UCSD.
    • 999 jobs at Wisconsin succeeded. Verified that 999 output root files with reasonable sizes are at UCSD.
  • added MITCMS, Purdue-Lear, Purdue-RCAC, UF-PG, and UF-HG to the gfactory. Now running v3.4
    • am a bit confused about what jobmanagers to use for Purdue. Emailed Preston about it. Am using pbs on Lear and condor on RCAC for now.
      • got info back from him that I'm doing the right thing at Purdue.
    • had aim with Bockjoo on why HPC wasn't visible to crab. Turns out he has a GOC ticket that has been unresolved for a week. The OSG bdii dropped the Florida HPC site, and nobody understands why. It appears as if crab uses the bdii to match SE's and CE's. Accordingly, HPC will not be available until the GOC has straightened this out.
    • sent 1001 jobs to MITCMS, Purdue-Lear, Purdue-RCAC, UF-PG, UF-HG
    • tried to submit to Caltech, but crab won't let me. It claims the dataset does not exist at the site. I know it does. DBS & PhEDEx? say so, and I have run on it before there.
      • have email in to Michael, Ilya, and Vladimir for help.
    • behaviour of srmls is different at UFL and Caltech than at any of the other places. Everywhere else, I get a directory listing. At Florida, I get nothing, unless I explicitly ask for the filename. I.e. it gives me only stats on the exact path I'm asking for.
      • sent email to osg-storage with cc to Bockjoo and Ilya.

-- FkW - 08 Nov 2007

Topic revision: r27 - 2007/12/12 - 17:01:20 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback