WS Gram Testing
Table of Contents
Introduction
The following is results from UCSD WS Gram testing with Pre-WS Gram for comparison.
Notable Findings
The following were some of the behaviors of GT4 found during testing
Most Important Issues
- GT4 on loads of 1000 jobs or greater, or tests that involved two submitters with at least 1000 jobs each it was common to have up to 5% of submitted jobs go into a Hold state on the submitter(s). During larger tests the rate of held jobs increased as a percentage of total jobs. There were a variety of reasons given by the GRAM for these jobs to be held. When the jobs were released by condor_release command it was found that in most if not all cases the jobs did eventually finish successfully.
- One of the configuration changes to reduce the occurrence of certain GRAM errors involved increasing the maximum Java heap memory on the submitter to 1024MB of memory. This resulted in a much reduced frequency or the elimination of some of the errors that resulted in held jobs. While this change was successful in reducing or eliminating certain errors the increased resource consumption on the submitter is of concern. Particularly when considering multi-user submission nodes or nodes with multiple schedd.
- CPU load on the submitters was much higher with GT4 than it was with GT2
- Removal of larger amounts of submitted jobs via Condor_rm (as low as 1000 jobs) is not possible without administrator intervention on the submitter and gatekeeper. This is a known WS Gram limitation that is apparently fixed in a more recent version.
Remaining Findings
- GT4 was able to handle up to 1000 jobs that were submitted from 1 submit host. In the last test of 1000 submissions all jobs completed without any errors. This level of operation was achieved after a series of configuration modifications were performed to the standard OSG 0.8.0 WS GRAM installation and client configuration. A jar file was also provided that allowed for some additional configuration flexibility on the submitter.
- GT4 was considerably slower to submit jobs to the gatekeeper than GT2 at higher job counts. At 2000 jobs submitted each from 2 submitters GT2 was able to schedule 4000 jobs in about 1 hour with GT2. GT4 took several hours and due to the jobs completing in 1 hour was never able to achieve more than slightly over 2K jobs in the queue. At lower job counts GT4 was more responsive.
- During testing it was found that once the submitter was working any additional job submissions by the test user, even if they were targeted to another gatekeeper, experienced a significant delay before the submitter sent them to the gatekeeper. The delay was approximately an hour.
- CPU load on the gatekeeper was slightly lower with GT4 than it was with GT2
- GUMS server load was not notably higher with GT4 than it was with GT2
- The GT4 Gatekeeper service was stable during the testing. The only time the service was restarted was to perform a configuration change or in order to cleanly remove the persistent areas due to problems removing large numbers of jobs from the submission queue.
- The GT4 gatekeeper never experienced excessive loads during the tests. The gatekeeper remained responsive.
Tests
The following Graphs show the loads of both the submitter(s) and the gatekeeper under different levels of total job submissions. All jobs were submitted as a single condor_submit command.
WS GRAM Test 1000 Jobs Submitted from 1 Submitter (1K total Jobs)
During this test some of the jobs had to be released by condor_release as they became held. In a subsequent 1000 job test not shown here all jobs completed without any held jobs. This might indicated this is a threshold with the current configuration and code.
- OSG GW 5 CE:
- Ramp up time was fairly quick
- OSG 5 Load 1K:
- Load spikes during submission and then trails off
- OSG GW 5 Network 1K:
- As expected the network traffic was minimal with small spikes at job submission and then again at job completion
- OSG GW 5 Memory 1K:
- Memory consumption increased during the test and then returned to about pre-test levels.
- UAF Load 1K:
- Submitter load spikes during initial submission
- UAF Mem 1K:
- The submission consumes some of the free ram
WS GRAM Test 2000 Jobs Submitted from 1 Submitters (2K total jobs)
This test showed a higher count of held jobs and an increase in resource consumption.
- OSG 5 Condor Load 2K:
- OSG 5 Load 2K:
- OSG 5 Mem 2K:
- UAF 1 Load 2K:
- UAF 1 Memory 2K:
- GUMS Load 2K:
WS GRam 2K x 2 Submitters
Significant amount of these jobs went into a hold state going as high as around 7% held jobs before the test was stopped,
- GUMS Load WS 2K x 2:
- OSG 5 Condor Load WS 2K x 2:
- The rate of jobs being queued on the gatekeeper was fairly low, taking 2 hours before the queue hit its peak and the peak was well under 4K jobs indicating a significant amount of jobs had already completed at that point.
- OSG 5 Load WS 2K x 2:
- OSG 5 Mem WS 2K x 2:
- Unused memory dropped very low in this test.
- UAF 1 Load WS 2K x 2:
- UAF 2 Load WS 2K x 2:
Followup WS GRAM 2000 x 2 submitters 5%+ Hold result
This test was run with the same job counts as the previous test. Again a greater than 5% job hold rate and excessive gatekeeper load. Hold errors included problems with authentication.
- Condor Load 2kx2 WS GRAM Test 2:
- As before the submission rate was fairly slow, about 2.5 hours to reach a peak of a little over 2000 jobs out of 4000 submitted total
- OSG GW5 Load 2kx2 Test 2:
- OSG GW5 Mem 2kx2 Test 2:
Pre-WS GRAM Comparision Test 2000 Jobs Submitted from 2 Submitters (4K Jobs)
- Rate of submission of Pre-WS GRAM jobs is approximated 1Hz. (1380/1356 jobs ~= 1.0Hz)
- Overall load was not problematic on the gatekeeper, and not significant on the submitter
- Comparison Condor Load WS vs Pre WS:
- This graph shows the difference in submission rate between GT2 and GT4 at this level
- Comparison Condor Load WS vs Pre WS (long view):
- This is the same graph as above but at a lower level of detail.
- Final Pre WS Condor Load:
- This is the complete GT2 cycle
- Final Pre WS System Load:
- The initial load spike in this graph was at the tail end of the last GT4 test and likely caused by the removal of a few dozen lingering jobs. During the GT2 test the gatekeeper load was around 20-25 (approximately a load of 6/CPU).
- UAF 1 System Load Pre WS:
- GT2 has a much lower CPU load on submitters than GT4
- UAF 1 System Mem Pre WS:
- UAF 2 System Load Pre WS:
- This graph is a bit spikey, possibly due to user activity on the submitter
- UAF 2 System Mem Pre WS:
Test Setup
The following details some of the hardware and software configurations used during testing. Included are the Condor submit scripts and scripts used to test the submission infrastructure. The focus of these tests was to exercise the submitter and gatekeeper abilities to handle increasing amounts of job request. To isolate the activity to submission the job itself was very light with minimal data transfer between submitter and gatekeeper.
The job submission counts were chosen based on experience running production OSG clusters. The job counts of 1000-2000 jobs per submitter, as well as the combination of multiple submitters are fairly typical in the Open Science Grid.
All job submissions were submitted through the Condor-g infrastructure.
The test job run on the compute nodes consisted of a single shell script that went into a 3600 second sleep after running a small sub shell script. This sub shell script was downloaded via the GT infrastructure as an included file as part of the job.
The condor pool was configured to offer the test jobs ~1700 free job slots.
Gatekeeper
- Dual Core Dual CPU Opteron 275 (64bit)
- 8GB RAM
- Centos 5 (RHEL)
- OSG 0.8.0 CE (Configured with NFS Lite condor.pm)
- Condor 6.9.5
Submitters
- Dual Processor Single Core 3.06Ghz Intel (32bit)
- 4GB RAM
- Centos 4 (RHEL)
- OSG 0.8.0 Client
- Condor 6.9.5
- Dual Processor Dual Core 2.0Ghz Intel (64bit)
- 8GB RAM
- Centos 4 (RHEL)
- OSG 0.8.0 Client
- Condor 6.9.5
Test Scripts
Condor Submit Script WS GRAM
Universe=grid
Grid_Type = gt4
Jobmanager_Type = Condor
GlobusScheduler=https://osg-gw-5.t2.ucsd.edu:9443
executable=/home/users/tmartin/Cluster_Tests/ENV_test/re.sh
globus_rsl=(condor_submit=('+SleepSlot' 'TRUE'))
transfer_executable=True
stream_output = False
stream_error = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /data/tmp/myscript.sh
log = /tmp/wsgram.log
arguments = 3600
output = output/wsgram.out.$(Process)
error = output/wsgram.err.$(Process)
notification=Never
queue 2000
Condor Submit Script PreWS? (GT2) GRAM
universe=globus
GlobusScheduler=osg-gw-5.t2.ucsd.edu:/jobmanager-condor
executable=/home/users/tmartin/Cluster_Tests/ENV_test/re.sh
globus_rsl=(condor_submit=('+SleepSlot' 'TRUE'))
transfer_executable=True
stream_output = False
stream_error = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /data/tmp/myscript.sh
log = /tmp/wsgram.log
arguments = 3600
output = output/wsgram.out.$(Process)
error = output/wsgram.err.$(Process)
notification=Never
queue 2000
Test Script
#!/bin/sh
echo "starting WS Test...."
/usr/bin/whoami
/bin/hostname
chmod 755 myscript.sh
./myscript.sh
COUNT=$1
sleep $COUNT
echo "done WS Test.
Test Sub Script
Downloaded to the node via GRAM and then called from the test program.
#!/bin/sh
echo "I am myscript!!!"
Gridmanager Settings
These settings were the ones used for most of the results below including the pre WS GT2 comparisons.
MAX_GRIDMANAGER_LOG = 10000000
GRIDMANAGER_DEBUG = D_COMMAND D_FULLDEBUG
# GT4 requires a gftp server to be running locally
#GRIDFTP_URL_BASE = gsiftp://$(FULL_HOSTNAME)
GRIDFTP_SERVER = $(LIBEXEC)/globus-gridftp-server
GRIDFTP_SERVER_WRAPPER = $(LIBEXEC)/gridftp_wrapper.sh
# Bump up the max jobs that can be submitted via condor-g
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50
JAVA_EXTRA_ARGUMENTS = -Xmx1024M
GRIDMANAGER_GAHP_CALL_TIMEOUT = 600
Submitter GFTP server configuration
control_preauth_timeout 300
Gatekeeper GFTP server configuration
control_preauth_timeout 300
GT4 Server Side Settings
inetd 1
log_level ERROR,WARN,INFO
log_single /osglocal/osgce/globus/var/log/gridftp-auth.log
log_transfer /osglocal/osgce/globus/var/log/gridftp.log
control_preauth_timeout 300
GT4 Server Configuration
Server Configuration
--
TerrenceMartin - 17 Jan 2008