Glidein WMS Scale Tests 2014

Available nodes are described in this page.

First level (Idling Scaling)

Installation

Nodes

Use node name
Collector test-001.t2.ucsd.edu
Schedd test-010.t2.ucsd.edu
GFactory glidein-1.t2.ucsd.edu
GFrontEnd test-008.t2.ucsd.edu

Tests

Test Conditions

  • Keep jobs in the 2h neighbourhood (but randomized +-1h), so they are short enough to have significant turnover, but long enough to span multiple Condor leases.
  • Use attributes that are comparable with what CMS AnaOps? uses. ~100 - Autocluster per schedd
  • We should try to randomize the submissions, both in how many jobs submit at once, and how much time between submits.
  • Each 5 mins submit (unless the idle limit is reached)
    • Average of 1000 submissions
    • Time between submissions randomized
    • For each submission chose from a pool of (100) random users. And sudo submit from that user

Results

# of Idle Jobs # of users Passed/No Notes
10k 1 Yes  
10k 10 Yes  
50k 10 Yes  
100k 50 Yes  
200k 100 Yes  
250k 100 Yes  
500k 100 No *

*At 500k idle jobs at the schedd the glideinFrontend becomes basically unresponsive with local condor_q taking ~3/4 min.

It takes 20 min to go over the scheeds, so a single schedd can block the entire pool


[2014-09-04 13:11:21,151] INFO: Querying schedd, entry, and glidein status using child processes.
[2014-09-04 13:38:18,136] INFO: All children terminated
 
 

500k Idle jobs Average time for local condor_q
500kIdleJobs.gif

Suggested course of action is communicate this to the GlideinWMS? team.

Second level (Running Scaling)

Tests

Test Conditions

  • Keep jobs in the 2h neighbourhood (but randomized +-1h), so they are short enough to have significant turnover, but long enough to span multiple Condor leases.
  • Use attributes that are comparable with what CMS AnaOps? uses. ~100 - Autocluster per schedd
  • We should try to randomize the submissions, both in how many jobs submit at once, and how much time between submits.
  • Each 5 mins submit (unless the running limit is reached)
    • Average of 1000 submissions
    • Time between submissions randomized
    • For each submission chose from a pool of (100) random users. And sudo submit from that user

Results

# of RunningJobs? # of users Passed/No Notes
10k 100 Yes  
20k 100 Yes  
30k 100 No *See below
35k 100    
50k 100    

Results at 30k running jobs.

At 30k parallel running jobs there seems a strong correlation between the condor_submit_succes and the scheduler daemon core duty cycle:

Parallel IdleJobs? Parallel Running Jobs
condor_30k_running_15kidle.gif condor_30k_running_jobs.gif
condor_submit_success Scheduler_daemon_core duty cycle
condor_submit_succes_30k_running_jobs.gif core_duty_cycle_30k_running_jobs.gif
  System CPU
cpu_system_30k_running_jobs.gif

Third level (Scaling of the Pool)

Tests

Test Conditions

  • Keep jobs in the 3h neighbourhood (but randomized +-1h), so they are short enough to have significant turnover, but long enough to span multiple Condor leases.
  • Use attributes that are comparable with what CMS AnaOps? uses. ~50 - Autocluster per schedd
  • We should try to randomize the submissions, both in how many jobs submit at once, and how much time between submits.
  • Each 5 mins submit (unless the idle limit is reached)
    • Average of 1000 submissions
    • Time between submissions randomized
    • For each submission chose from a pool of (20) random users. And sudo submit from that user

Results

# of RunningJobs? # of users Passed/No Notes
50k 100 Yes  
100k 100 Yes  
150k 100 Yes *
200k 100 In Progress  

Results at 150k

Run consistently at 150k jobs:

150kJobs.gif

| -- EdgarHernandez - 2014/08/11

Topic attachments
I Attachment Action Size Date Who Comment
gifgif 150kJobs.gif manage 45.5 K 2014/10/01 - 22:06 EdgarHernandez  
gifgif 16k.gif manage 41.1 K 2014/12/23 - 19:21 EdgarHernandez  
gifgif 500kIdleJobs.gif manage 36.2 K 2014/09/04 - 20:51 EdgarHernandez 500kIdleJobs
gifgif condor_30k_running_15kidle.gif manage 48.2 K 2014/09/16 - 16:27 EdgarHernandez  
gifgif condor_30k_running_jobs.gif manage 48.9 K 2014/09/16 - 16:28 EdgarHernandez  
gifgif condor_q_500k_idle.gif manage 55.2 K 2014/09/16 - 16:28 EdgarHernandez  
gifgif condor_submit_succes_30k_running_jobs.gif manage 48.4 K 2014/09/16 - 16:29 EdgarHernandez  
gifgif core_duty_cycle_30k_running_jobs.gif manage 52.3 K 2014/09/16 - 16:29 EdgarHernandez  
gifgif cpu_system_30k_running_jobs.gif manage 55.4 K 2014/09/16 - 16:29 EdgarHernandez  
Topic revision: r17 - 2014/12/23 - 19:22:32 - EdgarHernandez
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback