TWiki> UCSDTier2 Web>WSGramTests (revision 3)EditAttach

WS Gram Testing

Table of Contents

Introduction

The following is results from UCSD WS Gram testing with Pre-WS Gram for comparision.

Tests

WS GRAM Test 1000 Jobs Submitted from 1 Submitter (1K total Jobs)

This test is WS Gram All jobs finished but there were some held jobs. Probably in the 5% range.

  • OSG GW 5 CE:
    gw5-condor.png

  • OSG 5 Load 1K:
    gw5-load.png

  • OSG GW 5 Network 1K:
    gw5-network.png

  • OSG GW 5 Memory 1K:
    gw5-mem.png

  • UAF Load 1K:
    uaf-1-load.png

  • UAF Mem 1K:
    uaf-1-mem.png

WS GRAM Test 2000 Jobs Submitted from 1 Submitters (2K total jobs)

This test showed a higher count of held jobs. Other notable discoveries was that submission of addition trivial amounts of WS GRAM jobs, even targeting other CE were delayed by up to 1 hour.

  • OSG 5 Condor Load 2K:
    osg-5-condor-load-2K.png

  • OSG 5 Load 2K:
    osg-5-load-2K.png

  • OSG 5 Mem 2K:
    osg-5-mem-2k.png

  • UAF 1 Load 2K:
    uaf-1-load-2k.png

  • UAF 1 Memory 2K:
    uaf-1-mem-2k.png

  • GUMS Load 2K:
    Gums-Load-2K.png

WS GRam 2K x 2 Submitters

Significant amount of these jobs went into a hold state

  • GUMS Load WS 2K x 2:
    gums-load-2kx2.png

  • OSG 5 Condor Load WS 2K x 2:
    osg-5-condor-load-2kx2.png

  • OSG 5 Load WS 2K x 2:
    osg-5-load-2kx2.png

  • OSG 5 Mem WS 2K x 2:
    osg-5-mem-2kx2.png

  • UAF 1 Load WS 2K x 2:
    uaf-1-load-2kx2.png

  • UAF 2 Load WS 2K x 2:
    uaf-2-load-2kx2.png

Followup WS GRAM 2000 x 2 submitters 5%+ Hold result

This test resulted in a greater than 5% job hold rate and excessive gatekeeper load. Errors included problems with authentication.

  • Condor Load 2kx2 WS GRAM Test 2:
    osg-gw-5-CondorLoad-2kx2-2.png

  • OSG GW5 Load 2kx2 Test 2:
    osg-gw-5-load-2kx2-2.png

  • OSG GW5 Mem 2kx2 Test 2:
    osg-gw-5-mem-2kx2-2.png

Pre-WS GRAM Comparision Test 2000 Jobs Submitted from 2 Submitters (4K Jobs)

  • Of particular note is that rate of submission of Pre-WS GRAM jobs is approximated 1Hz. (1380/1356 jobs ~= 1.0Hz)

Notes

  • With increasing jobs queued (within acceptable levels for pre-ws gram) about 5% of jobs become held for one reason or another. Configuration changes to timeouts have reduced the variety of timeouts.

  • One of the more recent tests (2K x 1 submitter) resulted in 0 held jobs, however there are two jobs that never even started. A hold and release cycle got them started again.

  • In a followup 2k x 2 submitter test about 2 hours in 5% of the jobs (204/4000) had gone into various hold states.

HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
...
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
...
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
LastHoldReason = "Spooling input data files"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
...
HoldReason = "Globus error: org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException: \"/DC=org/DC=doegrids/OU=People/CN=Terr
ence Martin 525658\" is not authorized to use operation: {http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on this service"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException: \"/DC=org/DC=doegrids/OU=People/CN=Terr
ence Martin 525658\" is not authorized to use operation: {http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on this service"

  • Configuration changes to the submitter have resulted in Gridmanager java related process growing to greater than 1.1GB of used memory size. Each user that submits from a specific host will have their own java process. This is just the current largest Java process size. It may be necessary to allow Java to consume even more memory if > 2K jobs are submitted.

  • CPU Load on the gatekeeper that is recieving WSGram jobs is similar to the load on the gatekeeper that is recieving Pre-WS Gram jobs

  • CPU Load on the submitter is much higher with WS Gram than Pre-WS Gram.

  • GUMS server load is slightly higher with WS GRAM than Pre WS Gram

  • Removal of larger amounts of submitted jobs via Condor_rm (as low as 1K) is not possible without manual intervention on the submitter and gatekeeper. WS Gram fails to properly remove the jobs from the gatekeeper which requires that the persisted directory on the gatekeeper be manually purged. This is a known WS Gram limitation that is apparently fixed in a more recent version.

  • Pretty slow ramp up time. 20 minutes to start 76 jobs of 2K in one test. ~15s/job

-- TerrenceMartin - 17 Jan 2008

Topic attachments
I Attachment Action Size Date Who Comment
pngpng Gums-Load-2K.png manage 34.8 K 2008/01/18 - 00:10 TerrenceMartin GUMS Load 2K
pngpng gums-load-2kx2.png manage 34.1 K 2008/01/18 - 02:17 TerrenceMartin GUMS Load WS 2K x 2
pngpng gw5-condor.png manage 26.5 K 2008/01/17 - 23:55 TerrenceMartin OSG GW 5 CE 1K
pngpng gw5-load.png manage 27.3 K 2008/01/17 - 23:57 TerrenceMartin OSG 5 Load 1K
pngpng gw5-mem.png manage 23.9 K 2008/01/18 - 00:02 TerrenceMartin OSG GW 5 Memory 1K
pngpng gw5-network.png manage 24.6 K 2008/01/17 - 23:59 TerrenceMartin OSG GW 5 Network 1K
pngpng osg-5-condor-load-2K.png manage 29.9 K 2008/01/18 - 00:06 TerrenceMartin OSG 5 Condor Load 2K
pngpng osg-5-condor-load-2kx2.png manage 30.5 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Condor Load WS 2K x 2
pngpng osg-5-load-2K.png manage 31.4 K 2008/01/18 - 00:06 TerrenceMartin OSG 5 Load 2K
pngpng osg-5-load-2kx2.png manage 32.7 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Load WS 2K x 2
pngpng osg-5-mem-2k.png manage 29.1 K 2008/01/18 - 00:07 TerrenceMartin OSG 5 Mem 2K
pngpng osg-5-mem-2kx2.png manage 29.2 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Mem WS 2K x 2
pngpng osg-gw-5-CondorLoad-2kx2-2.png manage 30.7 K 2008/01/18 - 03:44 TerrenceMartin Condor Load 2kx2 WS GRAM Test 2
pngpng osg-gw-5-load-2kx2-2.png manage 36.2 K 2008/01/18 - 03:45 TerrenceMartin OSG GW5 Load 2kx2 Test 2
pngpng osg-gw-5-mem-2kx2-2.png manage 29.4 K 2008/01/18 - 03:45 TerrenceMartin OSG GW5 Mem 2kx2 Test 2
pngpng uaf-1-load-2k.png manage 30.6 K 2008/01/18 - 00:09 TerrenceMartin UAF 1 Load 2K
pngpng uaf-1-load-2kx2.png manage 32.3 K 2008/01/18 - 02:19 TerrenceMartin UAF 1 Load WS 2K x 2
pngpng uaf-1-load.png manage 25.3 K 2008/01/18 - 00:00 TerrenceMartin UAF Load 1K
pngpng uaf-1-mem-2k.png manage 27.3 K 2008/01/18 - 00:09 TerrenceMartin UAF 1 Memory 2K
pngpng uaf-1-mem.png manage 23.5 K 2008/01/18 - 00:00 TerrenceMartin UAF Mem 1K
pngpng uaf-2-load-2kx2.png manage 31.3 K 2008/01/18 - 02:20 TerrenceMartin UAF 2 Load WS 2K x 2
Edit | Attach | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2008/01/18 - 04:25:46 - TerrenceMartin
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback