Difference: CondorScal (1 vs. 2)

Revision 22011/02/09 - Main.IgorSfiligoi

Line: 1 to 1
 
META TOPICPARENT name="OSGScal"

Condor Scalability tests

This page contains the tests performed against Condor to push the scalability limits.

Added:
>
>

Oct 2010

Condor 7.5.4 pre-release, glideinWMS v2, loadtest_condor 1.1

Using a 64GB schedd node at FNAL, a 16GB collector node (1+400) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, we were able to achieve ~40k long running jobs on a single schedd. After than, the system becomes unstable.
50k_one_q_2.png

Using several schedd, we were able to run 90k jobs on a single collector. We did not observe any limits, and just stopped at that treshold due to lack of aditional compute resources.
90k_s.png

 

Jan/Feb 2011

Changed:
<
<
Condor 7.5.5 pre-releas, glideinWMS v2
>
>
Condor 7.5.5 pre-releas, glideinWMS v2, loadtest_condor 1.1

Using a 64GB schedd node at FNAL, a 16GB collector node (1+200) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, we were able to achieve 60k long running jobs for extended period of time with no user-level problems. The limit was purely memory availalble on the schedd node.

cq_60k.png

 
Deleted:
<
<
Using a 64GB schedd node at FNAL, a 16GB collector node (1+200) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, was able to achieve 60k long running jobs for extended period of time with no user-level problems.
cq_60k.png
 Using 10 minute jobs submitted by a single dagman, the same system stabilizes around 6k running jobs.

cq_10min.png

Line: 22 to 32
 
META FILEATTACHMENT attachment="cq_60k.png" attr="h" comment="60k jobs running on a single schedd" date="1297207822" name="cq_60k.png" path="cq_60k.png" size="27970" stream="cq_60k.png" tmpFilename="/tmp/6yQdyom60F" user="IgorSfiligoi" version="1"
META FILEATTACHMENT attachment="cq_10min.png" attr="h" comment="Condor with 10 min jobs" date="1297208089" name="cq_10min.png" path="cq_10min.png" size="28186" stream="cq_10min.png" tmpFilename="/tmp/qPE1HiB88V" user="IgorSfiligoi" version="1"
Added:
>
>
META FILEATTACHMENT attachment="50k_one_q_2.png" attr="h" comment="40k jobs running on a single schedd" date="1297210546" name="50k_one_q_2.png" path="50k_one_q_2.png" size="53751" stream="50k_one_q_2.png" tmpFilename="/tmp/NK5GvLGoXD" user="IgorSfiligoi" version="1"
META FILEATTACHMENT attachment="90k_s.png" attr="h" comment="90k jobs on a single collector" date="1297210585" name="90k_s.png" path="90k_s.png" size="53170" stream="90k_s.png" tmpFilename="/tmp/BKCI89yUNO" user="IgorSfiligoi" version="1"

Revision 12011/02/09 - Main.IgorSfiligoi

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="OSGScal"

Condor Scalability tests

This page contains the tests performed against Condor to push the scalability limits.

Jan/Feb 2011

Condor 7.5.5 pre-releas, glideinWMS v2

Using a 64GB schedd node at FNAL, a 16GB collector node (1+200) at FNAL, and getting glideins from shadow pools at FNAL, UCSD and Madison, was able to achieve 60k long running jobs for extended period of time with no user-level problems.
cq_60k.png Using 10 minute jobs submitted by a single dagman, the same system stabilizes around 6k running jobs.

cq_10min.png

During the scalability tests, we also measured the matching speed of the negotiator; the test was the best-case scenario with a single autocluster and very basic requirements. On the test node (dual Intel Xeon E5430 @ 2.66GHz) it wasmanaging to match between 8 and 15 jobs per second.

During the test, we noticed that the Negotiator was wasting a lot of time gathering statistics when O(3k) jobs were matched in a single cycle. This seems to be due to heap management; dynamically linking the negotiator with TCMalloc seems to solve the problem.

We also observed the collector entering into a very low-response state, especially when a large number of glideins terminated at the same time. Again, the problem seemed to be related to heap management, and using TCMalloc solved the problem.

-- IgorSfiligoi - 2011/02/08

META FILEATTACHMENT attachment="cq_60k.png" attr="h" comment="60k jobs running on a single schedd" date="1297207822" name="cq_60k.png" path="cq_60k.png" size="27970" stream="cq_60k.png" tmpFilename="/tmp/6yQdyom60F" user="IgorSfiligoi" version="1"
META FILEATTACHMENT attachment="cq_10min.png" attr="h" comment="Condor with 10 min jobs" date="1297208089" name="cq_10min.png" path="cq_10min.png" size="28186" stream="cq_10min.png" tmpFilename="/tmp/qPE1HiB88V" user="IgorSfiligoi" version="1"
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback