TWiki> UCSDTier2 Web>OSGScal>CondorScal (revision 1)EditAttach

Condor Scalability tests

This page contains the tests performed against Condor to push the scalability limits.

Jan/Feb 2011

Condor 7.5.5 pre-releas, glideinWMS v2

Using a 64GB schedd node at FNAL, a 16GB collector node (1+200) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, was able to achieve 60k long running jobs for extended period of time with no user-level problems.
cq_60k.png Using 10 minute jobs submitted by a single dagman, the same system stabilizes around 6k running jobs.


During the scalability tests, we also measured the matching speed of the negotiator; the test was the best-case scenario with a single autocluster and very basic requirements. On the test node (dual Intel Xeon E5430 @ 2.66GHz) it wasmanaging to match between 8 and 15 jobs per second.

During the test, we noticed that the Negotiator was wasting a lot of time gathering statistics when O(3k) jobs were matched in a single cycle. This seems to be due to heap management; dynamically linking the negotiator with TCMalloc seems to solve the problem.

We also observed the collector entering into a very low-response state, especially when a large number of glideins terminated at the same time. Again, the problem seemed to be related to heap management, and using TCMalloc solved the problem.

-- IgorSfiligoi - 2011/02/08

Edit | Attach | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2011/02/09 - 00:09:29 - IgorSfiligoi
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback