WS Gram Testing
Table of Contents
Introduction
The following is results from UCSD WS Gram testing with Pre-WS Gram for comparision.
Tests
WS GRAM Test 1000 Jobs Submitted from 1 Submitter (1K total Jobs)
This test is WS Gram All jobs finished but there were some held jobs. Probably in the 5% range.
- OSG GW 5 CE:
- OSG 5 Load 1K:
- OSG GW 5 Network 1K:
- OSG GW 5 Memory 1K:
- UAF Load 1K:
- UAF Mem 1K:
WS GRAM Test 2000 Jobs Submitted from 1 Submitters (2K total jobs)
This test showed a higher count of held jobs. Other notable discoveries was that submission of addition trivial amounts of WS GRAM jobs, even targeting other CE were delayed by up to 1 hour.
- OSG 5 Condor Load 2K:
- OSG 5 Load 2K:
- OSG 5 Mem 2K:
- UAF 1 Load 2K:
- UAF 1 Memory 2K:
- GUMS Load 2K:
WS GRam 2K x 2 Submitters
Significant amount of these jobs went into a hold state
- GUMS Load WS 2K x 2:
- OSG 5 Condor Load WS 2K x 2:
- OSG 5 Load WS 2K x 2:
- OSG 5 Mem WS 2K x 2:
- UAF 1 Load WS 2K x 2:
- UAF 2 Load WS 2K x 2:
Followup WS GRAM 2000 x 2 submitters 5%+ Hold result
This test resulted in a greater than 5% job hold rate and excessive gatekeeper load. Errors included problems with authentication.
- Condor Load 2kx2 WS GRAM Test 2:
- OSG GW5 Load 2kx2 Test 2:
- OSG GW5 Mem 2kx2 Test 2:
Pre-WS GRAM Comparision Test 2000 Jobs Submitted from 2 Submitters (4K Jobs)
- Of particular note is that rate of submission of Pre-WS GRAM jobs is approximated 1Hz. (1380/1356 jobs ~= 1.0Hz)
Notes
- With increasing jobs queued (within acceptable levels for pre-ws gram) about 5% of jobs become held for one reason or another. Configuration changes to timeouts have reduced the variety of timeouts.
- One of the more recent tests (2K x 1 submitter) resulted in 0 held jobs, however there are two jobs that never even started. A hold and release cycle got them started again.
- In a followup 2k x 2 submitter test about 2 hours in 5% of the jobs (204/4000) had gone into various hold states.
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
...
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
...
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
HoldReason = "Globus error: GT4_GRAM_JOB_SUBMIT timed out"
LastHoldReason = "Spooling input data files"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
HoldReason = "Globus error 155: the job manager could not stage out a file"
...
HoldReason = "Globus error: org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException: \"/DC=org/DC=doegrids/OU=People/CN=Terr
ence Martin 525658\" is not authorized to use operation: {http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on this service"
HoldReason = "Globus error: GT4_GRAM_JOB_DESTROY timed out"
HoldReason = "Globus error: org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException: \"/DC=org/DC=doegrids/OU=People/CN=Terr
ence Martin 525658\" is not authorized to use operation: {http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on this service"
- Configuration changes to the submitter have resulted in Gridmanager java related process growing to greater than 1.1GB of used memory size. Each user that submits from a specific host will have their own java process. This is just the current largest Java process size. It may be necessary to allow Java to consume even more memory if > 2K jobs are submitted.
- CPU Load on the gatekeeper that is recieving WSGram jobs is similar to the load on the gatekeeper that is recieving Pre-WS Gram jobs
- CPU Load on the submitter is much higher with WS Gram than Pre-WS Gram.
- GUMS server load is slightly higher with WS GRAM than Pre WS Gram
- Removal of larger amounts of submitted jobs via Condor_rm (as low as 1K) is not possible without manual intervention on the submitter and gatekeeper. WS Gram fails to properly remove the jobs from the gatekeeper which requires that the persisted directory on the gatekeeper be manually purged. This is a known WS Gram limitation that is apparently fixed in a more recent version.
- Pretty slow ramp up time. 20 minutes to start 76 jobs of 2K in one test. ~15s/job
--
TerrenceMartin - 17 Jan 2008