Benchmarking the OSG Compute Element
Overview
As part of the OSG extensions work, we want to gain a quantitative understanding of the
scalability and reliability of the OSG Compute Element. This includes understanding failure
rates and load on the host system for different configurations of the Compute Element as a function
of rate of submission, and IO required for the submission.
We will want to be able to easily repeat these tests for different configurations of the CE, including
WS-GRAM versus pre-WS-GRAM, as well as minor modifications of the way the CE is configured.
We are thus going to first develop a testing framework.
We will want to first guarantee that the node that hosts the OSG client is not overloaded in order to
cleanly seperate any client load issues from server load issues. In a second step, we will then want to
understand the impact of an overloaded client node on both server, as well as overall perceived reliability.
It is thus crucial to have trustworthy measurements of load on client and server throughout these tests.
Similarily, it is crucial to measure job success independent of the condor-g client information in order
to validate the latter.
Scope of testing
The purpose of the testing is to find the operational point where the CE breaks down,
and to understand the behaviour of the CE leading up to this point.
We hypothesize that the CE will break down for one of the following reasons:
- The steady rate of job submissions exceeds a max the CE can handle
- A burst of submissions exceeds a max the CE can handle
- The steady rate of job terminations exceeds a max the CE can handle, especially in combination with increasing stdout/stderr.
- A burst of job terminations exceeds a max the CE can handle, especially in combination with increasing stdout/stderr.
- A combination of all of the above exceeds the max the CE can handle.
We are particularly interested in documenting failure modes, as well as failure rates before the
CE completely falls apart. In addition, we are interested to understand if there are scenarios for which
the CE can get damaged by load in ways that aren't immediately apparent. E.g., does the reliability
decrease, or functionality partially disappear after an "incomplete" recovery from an overload.
In addition to straightforward use for job submission, we are interested in testing for overload conditions
related to simultaneous load from:
- gridftp transfers and job submission
- use of managed fork and job submission, including the LIGO use case of gridftp jobs submitted via managed fork.
- impact of overloading the client host on the CE.
The last item may deserve some explanation. We understand that the gridmonitor depends on proper functioning
of the client. We are thus curious to see if it is possible to negatively impact a CE by driving the node that hosts the submission
client into overload.
In all of this we should distinguish failure modes according to the following:
- submissions fail, i.e. the CE is unable to accept additional submissions.
- jobs are successfuly submitted according to condor-g info at the client but never successfully started up.
- jobs start up successfully but then end in hold state, or fail according to condor-g info. For these cases we will want to distinguish between jobs that successfully complete but fail to move stdout/stderr back to the client, and jobs that somehow get killed due to CE overloads. This distinction is practically important because some VOs might choose not to depend at all on the grid middleware for the job exit handling. Such a VO would not be affected by exit handling failures on the CE.
- jobs that disappear without a trace from the perspective of the condor-g info. Here we want to understand in detail why this happens.
CE Configurations to test
The base configuration we start with is an NFS-lite compute element with Condor as batch system
for which both the condor collector and negotiator are hosted on a different system. We use Condor 6.9.1
to benefit from the latest condor scalability improvements. We mimic a large cluster by adding
additional 3200 vm's to the UCSD production cluster. These additional vm's are accessible via two dedicated
CE's. The tests will include sleep jobs of varying types and rates as discussed above.
- OSG CE as in OSG 0.4.1 (however with managed fork propperly configured!)
- OSG CE as in OSG 0.4.1 but with changes to which processes run at what priority at the OS
- OSG CE as in previous test but replacing ML with GRATIA for accounting and BDII with CEMon for advertizing.
- OSG CE as in previous test but with the new exponential back-off feature in the GRAM<->condor-g communications implemtented.
--
FkW - 13 Dec 2006