Benchmarking the OSG Compute Element

Overview

As part of the OSG extensions work, we want to gain a quantitative understanding of the scalability and reliability of the OSG Compute Element. This includes understanding failure rates and load on the host system for different configurations of the Compute Element as a function of rate of submission, and IO required for the submission.

We will want to be able to easily repeat these tests for different configurations of the CE, including WS-GRAM versus pre-WS-GRAM, as well as minor modifications of the way the CE is configured. We are thus going to first develop a testing framework.

We will want to first guarantee that the node that hosts the OSG client is not overloaded in order to cleanly seperate any client load issues from server load issues. In a second step, we will then want to understand the impact of an overloaded client node on both server, as well as overall perceived reliability.

It is thus crucial to have trustworthy measurements of load on client and server throughout these tests. Similarily, it is crucial to measure job success independent of the condor-g client information in order to validate the latter.

Scope of testing

The purpose of the testing is to find the operational point where the CE breaks down, and to understand the behaviour of the CE leading up to this point. We hypothesize that the CE will break down for one of the following reasons:

  • The steady rate of job submissions exceeds a max the CE can handle
  • A burst of submissions exceeds a max the CE can handle
  • The steady rate of job terminations exceeds a max the CE can handle, especially in combination with increasing stdout/stderr.
  • A burst of job terminations exceeds a max the CE can handle, especially in combination with increasing stdout/stderr.
  • A combination of all of the above exceeds the max the CE can handle.

We are particularly interested in documenting failure modes, as well as failure rates before the CE completely falls apart. In addition, we are interested to understand if there are scenarios for which the CE can get damaged by load in ways that aren't immediately apparent. E.g., does the reliability decrease, or functionality partially disappear after an "incomplete" recovery from an overload.

In addition to straightforward use for job submission, we are interested in testing for overload conditions related to simultaneous load from:

  • gridftp transfers and job submission
  • use of managed fork and job submission, including the LIGO use case of gridftp jobs submitted via managed fork.
  • impact of overloading the client host on the CE.

The last item may deserve some explanation. We understand that the gridmonitor depends on proper functioning of the client. We are thus curious to see if it is possible to negatively impact a CE by driving the node that hosts the submission client into overload.

In all of this we should distinguish failure modes according to the following:

  • submissions fail, i.e. the CE is unable to accept additional submissions.
  • jobs are successfuly submitted according to condor-g info at the client but never successfully started up.
  • jobs start up successfully but then end in hold state, or fail according to condor-g info. For these cases we will want to distinguish between jobs that successfully complete but fail to move stdout/stderr back to the client, and jobs that somehow get killed due to CE overloads. This distinction is practically important because some VOs might choose not to depend at all on the grid middleware for the job exit handling. Such a VO would not be affected by exit handling failures on the CE.
  • jobs that disappear without a trace from the perspective of the condor-g info. Here we want to understand in detail why this happens.

CE Configurations to test

The base configuration we start with is an NFS-lite compute element with Condor as batch system for which both the condor collector and negotiator are hosted on a different system. We use Condor 6.9.1 to benefit from the latest condor scalability improvements. We mimic a large cluster by adding additional 3200 vm's to the UCSD production cluster. These additional vm's are accessible via two dedicated CE's. The tests will include sleep jobs of varying types and rates as discussed above.

  • OSG CE as in OSG 0.4.1 (however with managed fork propperly configured!)
  • OSG CE as in OSG 0.4.1 but with changes to which processes run at what priority at the OS
  • OSG CE as in previous test but replacing ML with GRATIA for accounting and BDII with CEMon for advertizing.
  • OSG CE as in previous test but with the new exponential back-off feature in the GRAM<->condor-g communications implemtented.

-- FkW - 13 Dec 2006

Topic revision: r2 - 2006/12/13 - 05:44:13 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback