Condor Scalability tests
This page contains the tests performed against Condor to push the scalability limits.
Oct 2010
Condor 7.5.4 pre-release, glideinWMS v2, loadtest_condor 1.1
Using a 64GB schedd node at FNAL, a 16GB collector node (1+400) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, we were able to achieve ~40k long running jobs on a single schedd. After than, the system becomes unstable.
Using several schedd, we were able to run 90k jobs on a single collector. We did not observe any limits, and just stopped at that treshold due to lack of aditional compute resources.
Jan/Feb 2011
Condor 7.5.5 pre-releas, glideinWMS v2, loadtest_condor 1.1
Using a 64GB schedd node at FNAL, a 16GB collector node (1+200) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, we were able to achieve 60k long running jobs for extended period of time with no user-level problems. The limit was purely memory availalble on the schedd node.
Using 10 minute jobs submitted by a single dagman, the same system stabilizes around 6k running jobs.
6
During the scalability tests, we also measured the matching speed of the negotiator; the test was the best-case scenario with a single autocluster and very basic requirements. On the test node (dual Intel Xeon E5430 @ 2.66GHz) it wasmanaging to match between 8 and 15 jobs per second.
During the test, we noticed that the Negotiator was wasting a lot of time gathering statistics when O(3k) jobs were matched in a single cycle. This seems to be due to heap management; dynamically linking the negotiator with
TCMalloc seems to solve the problem.
We also observed the collector entering into a very low-response state, especially when a large number of glideins terminated at the same time. Again, the problem seemed to be related to heap management, and using TCMalloc solved the problem.
Apr 2011
Condor 7.6.0, glideinWMS v2_5_1 (+minor patches), loadtest_condor 1.1 (+minor patches)
This time the test was about Negotiator scalability.
The test consisted in running jobs with 10k-15k glideins, once with a simple requirement, and once with a complex one that created one autocluster per job:
- simple:
Requirements=True
- complex:
Requirements = ( ( stringListMember(GLIDEIN_Site,string(ClusterId? )) || stringListMember(GLIDEIN_Gatekeeper,string(ClusterId? )) || (GLIDEIN_Fake=?=UNDEFINED)) && Arch=!Dummy) ) && ( ( Memory > 1 ) && ( Disk >= 1 )
Simple glidein start expression
Using a single sleeper pool, with a accept all Start condition, and with the negotiator limited to 20s per cycle
NEGOTIATOR_MAX_TIME_PER_SUBMITTER=40
NEGOTIATOR_MAX_TIME_PER_PIESPIN=20
the system behaved pretty much the same way with either the simpler or the complex job requirements.
However, looking closer to the Negotiator behavior, it is clear that most jobs don't get considered for matching;the
NegotiatorLog? has a
Reached max time per spin: 20 ... stopping
line at the end of each cycle, and the Negotiator
ClassAd? reports that only ~250 job, out of a total of ~1.5k have been considered for matchmaking:
LastNegotiationCycleNumJobsConsidered0 = 254
LastNegotiationCycleRejections0 = 234
LastNegotiationCycleNumIdleJobs0 = 1620
LastNegotiationCycleTotalSlots0 = 15081
LastNegotiationCycleCandidateSlots0 = 2002
Selective glidein start expression
To test how the above described behaviour affects the system (when there are many autoclusters), new glideins were configured to only access a subset of the jobs:
GLIDEIN_Entry_Start="(round(ClusterId? /10)*10==ClusterId)"
As expected, only the first ~20 idle jobs that matched (out of ~200) started running, even as there were plenty (>200) unclaimed glideins in the system.
The net result was both delayed job execution and wasted cpu cycles.
So this configuration is not really functional.
To correct for the above, the negotiator limits were commented out:
#
NEGOTIATOR_MAX_TIME_PER_SUBMITTER=40
#NEGOTIATOR_MAX_TIME_PER_PIESPIN=20
All the deserving jobs thus started to run, but at the expense of the negotiator cycle time, which now increased to ~3 minutes (with 1.2k idle jobs in the queue) compared to ~50s it took before.
--
IgorSfiligoi - 2011/02/08