WS Gram Testing

Table of Contents

Introduction

The following is results from UCSD WS Gram testing with Pre-WS Gram for comparison.

Notable Findings

The following were some of the behaviors of GT4 found during testing

Most Important Issues

  • GT4 on loads of 1000 jobs or greater, or tests that involved two submitters with at least 1000 jobs each it was common to have up to 5% of submitted jobs go into a Hold state on the submitter(s). During larger tests the rate of held jobs increased as a percentage of total jobs. There were a variety of reasons given by the GRAM for these jobs to be held. When the jobs were released by condor_release command it was found that in most if not all cases the jobs did eventually finish successfully.

  • One of the configuration changes to reduce the occurrence of certain GRAM errors involved increasing the maximum Java heap memory on the submitter to 1024MB of memory. This resulted in a much reduced frequency or the elimination of some of the errors that resulted in held jobs. While this change was successful in reducing or eliminating certain errors the increased resource consumption on the submitter is of concern. Particularly when considering multi-user submission nodes or nodes with multiple schedd.

  • CPU load on the submitters was much higher with GT4 than it was with GT2

  • Removal of larger amounts of submitted jobs via Condor_rm (as low as 1000 jobs) is not possible without administrator intervention on the submitter and gatekeeper. This is a known WS Gram limitation that is apparently fixed in a more recent version.

Remaining Findings

  • GT4 was able to handle up to 1000 jobs that were submitted from 1 submit host. In the last test of 1000 submissions all jobs completed without any errors. This level of operation was achieved after a series of configuration modifications were performed to the standard OSG 0.8.0 WS GRAM installation and client configuration. A jar file was also provided that allowed for some additional configuration flexibility on the submitter.

  • GT4 was considerably slower to submit jobs to the gatekeeper than GT2 at higher job counts. At 2000 jobs submitted each from 2 submitters GT2 was able to schedule 4000 jobs in about 1 hour with GT2. GT4 took several hours and due to the jobs completing in 1 hour was never able to achieve more than slightly over 2K jobs in the queue. At lower job counts GT4 was more responsive.

  • During testing it was found that once the submitter was working any additional job submissions by the test user, even if they were targeted to another gatekeeper, experienced a significant delay before the submitter sent them to the gatekeeper. The delay was approximately an hour.

  • CPU load on the gatekeeper was slightly lower with GT4 than it was with GT2

  • GUMS server load was not notably higher with GT4 than it was with GT2

  • The GT4 Gatekeeper service was stable during the testing. The only time the service was restarted was to perform a configuration change or in order to cleanly remove the persistent areas due to problems removing large numbers of jobs from the submission queue.

  • The GT4 gatekeeper never experienced excessive loads during the tests. The gatekeeper remained responsive.

Tests

The following Graphs show the loads of both the submitter(s) and the gatekeeper under different levels of total job submissions. All jobs were submitted as a single condor_submit command.

WS GRAM Test 1000 Jobs Submitted from 1 Submitter (1K total Jobs)

During this test some of the jobs had to be released by condor_release as they became held. In a subsequent 1000 job test not shown here all jobs completed without any held jobs. This might indicated this is a threshold with the current configuration and code.

  • OSG GW 5 CE:
    gw5-condor.png
  • Ramp up time was fairly quick

  • OSG 5 Load 1K:
    gw5-load.png
  • Load spikes during submission and then trails off

  • OSG GW 5 Network 1K:
    gw5-network.png
  • As expected the network traffic was minimal with small spikes at job submission and then again at job completion

  • OSG GW 5 Memory 1K:
    gw5-mem.png
  • Memory consumption increased during the test and then returned to about pre-test levels.

  • UAF Load 1K:
    uaf-1-load.png
  • Submitter load spikes during initial submission

  • UAF Mem 1K:
    uaf-1-mem.png
  • The submission consumes some of the free ram

WS GRAM Test 2000 Jobs Submitted from 1 Submitters (2K total jobs)

This test showed a higher count of held jobs and an increase in resource consumption.

  • OSG 5 Condor Load 2K:
    osg-5-condor-load-2K.png

  • OSG 5 Load 2K:
    osg-5-load-2K.png

  • OSG 5 Mem 2K:
    osg-5-mem-2k.png

  • UAF 1 Load 2K:
    uaf-1-load-2k.png

  • UAF 1 Memory 2K:
    uaf-1-mem-2k.png

  • GUMS Load 2K:
    Gums-Load-2K.png

WS GRam 2K x 2 Submitters

Significant amount of these jobs went into a hold state going as high as around 7% held jobs before the test was stopped,

  • GUMS Load WS 2K x 2:
    gums-load-2kx2.png

  • OSG 5 Condor Load WS 2K x 2:
    osg-5-condor-load-2kx2.png
  • The rate of jobs being queued on the gatekeeper was fairly low, taking 2 hours before the queue hit its peak and the peak was well under 4K jobs indicating a significant amount of jobs had already completed at that point.

  • OSG 5 Load WS 2K x 2:
    osg-5-load-2kx2.png

  • OSG 5 Mem WS 2K x 2:
    osg-5-mem-2kx2.png
  • Unused memory dropped very low in this test.

  • UAF 1 Load WS 2K x 2:
    uaf-1-load-2kx2.png

  • UAF 2 Load WS 2K x 2:
    uaf-2-load-2kx2.png

Followup WS GRAM 2000 x 2 submitters 5%+ Hold result

This test was run with the same job counts as the previous test. Again a greater than 5% job hold rate and excessive gatekeeper load. Hold errors included problems with authentication.

  • Condor Load 2kx2 WS GRAM Test 2:
    osg-gw-5-CondorLoad-2kx2-2.png
  • As before the submission rate was fairly slow, about 2.5 hours to reach a peak of a little over 2000 jobs out of 4000 submitted total

  • OSG GW5 Load 2kx2 Test 2:
    osg-gw-5-load-2kx2-2.png

  • OSG GW5 Mem 2kx2 Test 2:
    osg-gw-5-mem-2kx2-2.png

Pre-WS GRAM Comparision Test 2000 Jobs Submitted from 2 Submitters (4K Jobs)

  • Rate of submission of Pre-WS GRAM jobs is approximated 1Hz. (1380/1356 jobs ~= 1.0Hz)

  • No jobs were held

  • Overall load was not problematic on the gatekeeper, and not significant on the submitter

  • Comparison Condor Load WS vs Pre WS:
    comparisonload-2.png
  • This graph shows the difference in submission rate between GT2 and GT4 at this level

  • Comparison Condor Load WS vs Pre WS (long view):
    comparisonload.png
  • This is the same graph as above but at a lower level of detail.

  • Final Pre WS Condor Load:
    FinalLoad.png
  • This is the complete GT2 cycle

  • Final Pre WS System Load:
    finalsystemload.png
  • The initial load spike in this graph was at the tail end of the last GT4 test and likely caused by the removal of a few dozen lingering jobs. During the GT2 test the gatekeeper load was around 20-25 (approximately a load of 6/CPU).

  • UAF 1 System Load Pre WS:
    uaf-1loadprews.png
  • GT2 has a much lower CPU load on submitters than GT4

  • UAF 1 System Mem Pre WS:
    uaf-1memprews.png

  • UAF 2 System Load Pre WS:
    uaf-2loadprews.png
  • This graph is a bit spikey, possibly due to user activity on the submitter

  • UAF 2 System Mem Pre WS:
    uaf-2memprews.png

Test Setup

The following details some of the hardware and software configurations used during testing. Included are the Condor submit scripts and scripts used to test the submission infrastructure. The focus of these tests was to exercise the submitter and gatekeeper abilities to handle increasing amounts of job request. To isolate the activity to submission the job itself was very light with minimal data transfer between submitter and gatekeeper.

The job submission counts were chosen based on experience running production OSG clusters. The job counts of 1000-2000 jobs per submitter, as well as the combination of multiple submitters are fairly typical in the Open Science Grid.

All job submissions were submitted through the Condor-g infrastructure.

The test job run on the compute nodes consisted of a single shell script that went into a 3600 second sleep after running a small sub shell script. This sub shell script was downloaded via the GT infrastructure as an included file as part of the job.

The condor pool was configured to offer the test jobs ~1700 free job slots.

Gatekeeper

  • Dual Core Dual CPU Opteron 275 (64bit)
  • 8GB RAM
  • Centos 5 (RHEL)
  • OSG 0.8.0 CE (Configured with NFS Lite condor.pm)
  • Condor 6.9.5

Submitters

  • Dual Processor Single Core 3.06Ghz Intel (32bit)
  • 4GB RAM
  • Centos 4 (RHEL)
  • OSG 0.8.0 Client
  • Condor 6.9.5

  • Dual Processor Dual Core 2.0Ghz Intel (64bit)
  • 8GB RAM
  • Centos 4 (RHEL)
  • OSG 0.8.0 Client
  • Condor 6.9.5

Test Scripts

Condor Submit Script WS GRAM

Universe=grid
Grid_Type = gt4
Jobmanager_Type = Condor
GlobusScheduler=https://osg-gw-5.t2.ucsd.edu:9443
executable=/home/users/tmartin/Cluster_Tests/ENV_test/re.sh
globus_rsl=(condor_submit=('+SleepSlot' 'TRUE'))
transfer_executable=True
stream_output = False
stream_error  = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /data/tmp/myscript.sh
log    = /tmp/wsgram.log

arguments = 3600
output = output/wsgram.out.$(Process)
error  = output/wsgram.err.$(Process)
notification=Never
queue 2000

Condor Submit Script PreWS? (GT2) GRAM

universe=globus
GlobusScheduler=osg-gw-5.t2.ucsd.edu:/jobmanager-condor
executable=/home/users/tmartin/Cluster_Tests/ENV_test/re.sh
globus_rsl=(condor_submit=('+SleepSlot' 'TRUE'))
transfer_executable=True
stream_output = False
stream_error  = False
WhenToTransferOutput = ON_EXIT
transfer_input_files = /data/tmp/myscript.sh
log    = /tmp/wsgram.log

arguments = 3600
output = output/wsgram.out.$(Process)
error  = output/wsgram.err.$(Process)
notification=Never
queue 2000

Test Script

#!/bin/sh

echo "starting WS Test...."
/usr/bin/whoami
/bin/hostname
chmod 755 myscript.sh
./myscript.sh
COUNT=$1
sleep $COUNT
echo "done WS Test.

Test Sub Script

Downloaded to the node via GRAM and then called from the test program.

#!/bin/sh

echo "I am myscript!!!"

Gridmanager Settings

These settings were the ones used for most of the results below including the pre WS GT2 comparisons.

MAX_GRIDMANAGER_LOG     = 10000000
GRIDMANAGER_DEBUG       = D_COMMAND D_FULLDEBUG

# GT4 requires a gftp server to be running locally
#GRIDFTP_URL_BASE = gsiftp://$(FULL_HOSTNAME)
GRIDFTP_SERVER = $(LIBEXEC)/globus-gridftp-server
GRIDFTP_SERVER_WRAPPER = $(LIBEXEC)/gridftp_wrapper.sh

# Bump up the max jobs that can be submitted via condor-g
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=2000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=50

JAVA_EXTRA_ARGUMENTS = -Xmx1024M
GRIDMANAGER_GAHP_CALL_TIMEOUT = 600

Submitter GFTP server configuration

control_preauth_timeout 300

Gatekeeper GFTP server configuration

control_preauth_timeout 300

GT4 Server Side Settings

inetd 1
log_level ERROR,WARN,INFO
log_single /osglocal/osgce/globus/var/log/gridftp-auth.log
log_transfer /osglocal/osgce/globus/var/log/gridftp.log
control_preauth_timeout 300

GT4 Server Configuration

Server Configuration

-- TerrenceMartin - 17 Jan 2008

Topic attachments
I Attachment Action Size Date Who Comment
pngpng FinalLoad.png manage 32.3 K 2008/01/18 - 07:31 TerrenceMartin Final Pre WS Condor Load
pngpng Gums-Load-2K.png manage 34.8 K 2008/01/18 - 00:10 TerrenceMartin GUMS Load 2K
pngpng comparisonload-2.png manage 30.9 K 2008/01/18 - 07:30 TerrenceMartin Comparison Condor Load WS vs Pre WS
pngpng comparisonload.png manage 34.0 K 2008/01/18 - 07:31 TerrenceMartin Comparison Condor Load WS vs Pre WS (long view)
pngpng finalsystemload.png manage 32.6 K 2008/01/18 - 07:31 TerrenceMartin Final Pre WS System Load
pngpng gums-load-2kx2.png manage 34.1 K 2008/01/18 - 02:17 TerrenceMartin GUMS Load WS 2K x 2
pngpng gw5-condor.png manage 26.5 K 2008/01/17 - 23:55 TerrenceMartin OSG GW 5 CE 1K
pngpng gw5-load.png manage 27.3 K 2008/01/17 - 23:57 TerrenceMartin OSG 5 Load 1K
pngpng gw5-mem.png manage 23.9 K 2008/01/18 - 00:02 TerrenceMartin OSG GW 5 Memory 1K
pngpng gw5-network.png manage 24.6 K 2008/01/17 - 23:59 TerrenceMartin OSG GW 5 Network 1K
pngpng osg-5-condor-load-2K.png manage 29.9 K 2008/01/18 - 00:06 TerrenceMartin OSG 5 Condor Load 2K
pngpng osg-5-condor-load-2kx2.png manage 30.5 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Condor Load WS 2K x 2
pngpng osg-5-load-2K.png manage 31.4 K 2008/01/18 - 00:06 TerrenceMartin OSG 5 Load 2K
pngpng osg-5-load-2kx2.png manage 32.7 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Load WS 2K x 2
pngpng osg-5-mem-2k.png manage 29.1 K 2008/01/18 - 00:07 TerrenceMartin OSG 5 Mem 2K
pngpng osg-5-mem-2kx2.png manage 29.2 K 2008/01/18 - 02:18 TerrenceMartin OSG 5 Mem WS 2K x 2
pngpng osg-gw-5-CondorLoad-2kx2-2.png manage 30.7 K 2008/01/18 - 03:44 TerrenceMartin Condor Load 2kx2 WS GRAM Test 2
pngpng osg-gw-5-load-2kx2-2.png manage 36.2 K 2008/01/18 - 03:45 TerrenceMartin OSG GW5 Load 2kx2 Test 2
pngpng osg-gw-5-mem-2kx2-2.png manage 29.4 K 2008/01/18 - 03:45 TerrenceMartin OSG GW5 Mem 2kx2 Test 2
elsewsdd server-config.wsdd manage 14.3 K 2008/01/18 - 22:40 RamiVanguri GT4 Server Configuration
pngpng uaf-1-load-2k.png manage 30.6 K 2008/01/18 - 00:09 TerrenceMartin UAF 1 Load 2K
pngpng uaf-1-load-2kx2.png manage 32.3 K 2008/01/18 - 02:19 TerrenceMartin UAF 1 Load WS 2K x 2
pngpng uaf-1-load.png manage 25.3 K 2008/01/18 - 00:00 TerrenceMartin UAF Load 1K
pngpng uaf-1-mem-2k.png manage 27.3 K 2008/01/18 - 00:09 TerrenceMartin UAF 1 Memory 2K
pngpng uaf-1-mem.png manage 23.5 K 2008/01/18 - 00:00 TerrenceMartin UAF Mem 1K
pngpng uaf-1loadprews.png manage 30.9 K 2008/01/18 - 07:32 TerrenceMartin UAF 1 System Load Pre WS
pngpng uaf-1memprews.png manage 26.9 K 2008/01/18 - 07:32 TerrenceMartin UAF 1 System Mem Pre WS
pngpng uaf-2-load-2kx2.png manage 31.3 K 2008/01/18 - 02:20 TerrenceMartin UAF 2 Load WS 2K x 2
pngpng uaf-2loadprews.png manage 35.3 K 2008/01/18 - 07:33 TerrenceMartin UAF 2 System Load Pre WS
pngpng uaf-2memprews.png manage 26.5 K 2008/01/18 - 07:33 TerrenceMartin UAF 2 System Mem Pre WS
Topic revision: r6 - 2008/01/19 - 00:29:56 - TerrenceMartin
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback