OSG CE And Submitter Testing And Monitoring

Mission statement

This page is meant to gather information about the OSG CE scalability activity.

Tools

Two software packages have been developped:

  1. a load generating engine
    • The first version was developed by Toni Coarasa.
      Follow the links for installation instructions and usage instructions.
    • The code was refactored by Igor Sfiligoi to make it more user friendly, flexible and able to stress test non-Grid resources.
      The package can be downloaded here: loadtest_condor.v1_0_0.tgz
      The programs are in the bin directory; it only relies on command line options that are documented in the binary help screen.
  2. a load monitoring package
    • The first version was developed by Toni Coarasa.
      Follow the links for a description.

Test hardware

UCSD Test Cluster

  • CE hardware
    • 2 x AMD Opteron 275 (4 cores total)
    • 8GB of memory
    • 2TB disk space, mounted as RAID
  • CE software
    • CentOS? 5.2 (x86_64)
    • Condor 7.2
  • worker nodes
    • uses test slots on the production worker nodes
      (shadow pool)
    • there are about 4 test slots per production slot (about 3k total)
    • only very low resource jobs are allowed on these test nodes (policy)

FNAL Test Cluster (operated by FNAL)

  • CE hardware
    • 2 x Intel Xeon E5335 (8 cores total)
    • 8GB of memory
    • 500GB disk space
  • CE software
    • Scientific Linux 4.4 (x86_64)
    • Condor 6.9
  • worker nodes
    • each production worker node runs a second copy of the Condor daemons as a non-privileged user
      (shadow pool)
    • there are 1 test slot per production slot (about 5k total, but can inscrease to 4 to 1 when needed
    • only very low resource jobs are allowed on these test nodes (policy)

UCSD test client

  • 1x Intel Xeon 3.0GHz (2 hyperthreaded cores, 4 threads total)
  • 4GB of memory
  • CentOS? 4.7 (i386)

FNAL test client (owned by FNAL)

  • 2x Intel Xeon 3.2GHz (4 cores total)
  • 4GB of memory
  • Scientific Linux 4.2 (i386)

Italy test client (owned by INFN Pisa)

  • 1x AMD Opteron 148 (1 core total)
  • 1GB of memory
  • Scientific Linux 5.2 (x86_64)

Results

First round of tests were performed in Fall 2008 (by Toni Coarasa):

  • GT2 at UCSD:
    Submitted more than 30k jobs.
    The observed limit was 0.44Hz (i.e. 26.4 jobs/min).
    See the following presentation.

Second round of tests were performed in Summer 2009 (by Igor Sfiligoi):

  • GT2 at UCSD:
    Submitted 5k jobs.
    Job rate limited to 7.5 jobs/min under heavy load.
    Can sustain 33 jobs/min if monitoring not important.
    More details at: rates_gt2_ucsd.pdf
  • GT2 between FNAL and UCSD :
    Submitted 5k jobs - from UCSD to FNAL, and from FNAL to UCSD.
    Job rate limited to 6.9 jobs/min under heavy load.
    Can sustain 25-28 jobs/min if monitoring not important (CPU speed seems a factor here).
    Network latency seems to be a factor.
    More details for UCSD to FNAL at: rates_gt2_fnal_r.pdf
    and for FNAL to UCSD at: rates_gt2_ucsd_r.pdf
  • GT2 between Italy and UCSD :
    Submitted 5k jobs - from Italy to UCSD.
    Job rate limited to 3.5 jobs/min under heavy load.
    Can sustain 7 jobs/min if monitoring not important.
    Network latency is definitely a factor.
    More details: rates_gt2_ucsd_r2.pdf
  • Network latency is definitely a factor with GT2; below you can see it graphically:
    Image of GT2 scalability vs RTT
    The following shows only the limiting rate:
    GT2 scaling limit vs RTT

A third round of tests, using multiple users, were performed in Summer 2009 (by Igor Sfiligoi):

  • GT2 at UCSD:
    Submitted 5k jobs per user.
    With 4 parallel users the job rate limit is 26 jobs/min (compare this to 7.5 jobs/min using a single user).
    Can sustain 47 jobs/min if monitoring not important (compare this to 33 jobs/min using a single user).
    More details at: rates_gt2_single_vs_multi_30min.ods or rates_gt2_single_vs_multi_30min.pdf
  • GT2 between Italy and UCSD"
    Submitted 5k jobs per user. Due to limited client resources (single CPU, 1G of memory) -maxidle 1k was used when using multiple DNs.
    With 4 parallel users the job rate limit is 21 jobs/min (compare this to 5.2 jobs/min using a single user).
    Can sustain 43 jobs/min if monitoring not important (compare this to 14 jobs/min using a single user).
    More details at: rates_gt2_r2_single_vs_multi_30min.ods or rates_gt2_r2_single_vs_multi_30min.pdf
  • Network latency is much more noticeable when a single DN is used. Using multiple users, the networking latencies don't seem to be a major issue.
  • GT2 submissions from a resource constrained client
    Submitting 5k jobs for 4 users from a 1 CPU, 1GB machine.
    By trying to submit all jobs as fast as possible (vs using -maxidle 1k), the average job rate drops from 20 job/min to 11 jobs/min.
    More details at: rates_gt2_r2_resource_constraint.ods or rates_gt2_r2_resource_constraint.pdf

-- IgorSfiligoi - 2009/08/28

Deprecated information... left only for historical reference

Monitoring

The monitoring has been done using the package described in "Description of the osgmonitoring.rpm package". and the cacti installed in t2sentry0.t2.ucsd.edu.

Installation Instructions for dummys

Running the tests for dummys

Things left to do

  1. create the client side tarball, and document it on this twiki, and attach the tarball to the twiki page.
  2. communicate with Terrence to make sure that we have the client monitoring online on uaf-2, and uaf-1.
  3. document the process of putting monitoring via t2sentry0 and cacti into place

Goal to be finished: Monday April 6th

-- ToniCoarasa - 19 Sep 2008

Topic attachments
I Attachment Action Size Date Who Comment
pngpng gt2_scaling_limit_rtt.png manage 5.5 K 2009/08/24 - 21:50 IgorSfiligoi Image: GT2 scaling vs RTT - Limiting factor
pngpng gt2_scaling_rtt.png manage 10.5 K 2009/08/24 - 21:50 IgorSfiligoi Image: GT2 scaling vs RTT
ziptgz loadtest_condor.v1_0_0.tgz manage 6.8 K 2009/08/17 - 22:07 IgorSfiligoi  
elseodt rates_gt2_fnal_r.odt manage 28.6 K 2009/08/20 - 17:16 IgorSfiligoi Job startup rates for GT2 for FNAL - from UCSD
pdfpdf rates_gt2_fnal_r.pdf manage 391.2 K 2009/08/20 - 17:17 IgorSfiligoi Job startup rates for GT2 for FNAL - from UCSD
elseods rates_gt2_r2_resource_constraint.ods manage 103.3 K 2009/09/03 - 21:38 IgorSfiligoi Job startup rates for GT2 on a resource constrained client - from Italy
pdfpdf rates_gt2_r2_resource_constraint.pdf manage 299.3 K 2009/09/03 - 21:39 IgorSfiligoi Job startup rates for GT2 on a resource constrained client - from Italy
elseods rates_gt2_r2_single_vs_multi_30min.ods manage 99.7 K 2009/09/03 - 17:40 IgorSfiligoi Job startup rates for GT2 - Single vs multi user - from Italy
pdfpdf rates_gt2_r2_single_vs_multi_30min.pdf manage 294.9 K 2009/09/03 - 17:41 IgorSfiligoi Job startup rates for GT2 - Single vs multi user - from Italy
elseods rates_gt2_single_vs_multi_30min.ods manage 96.5 K 2009/08/27 - 23:28 IgorSfiligoi Job startup rates for GT2 - Single vs multi user
pdfpdf rates_gt2_single_vs_multi_30min.pdf manage 293.9 K 2009/08/27 - 23:28 IgorSfiligoi Job startup rates for GT2 - Single vs multi user
elseodt rates_gt2_ucsd.odt manage 83.0 K 2009/08/18 - 20:16 IgorSfiligoi Job startup rates for GT2 at UCSD
pdfpdf rates_gt2_ucsd.pdf manage 531.2 K 2009/08/18 - 20:16 IgorSfiligoi Job startup rates for GT2 at UCSD
elseodp rates_gt2_ucsd_08.odp manage 275.6 K 2009/08/17 - 22:46 IgorSfiligoi Job startup rates for GT2 - Toni's tests
pdfpdf rates_gt2_ucsd_08.pdf manage 317.5 K 2009/08/17 - 22:46 IgorSfiligoi Job startup rates for GT2 - Toni's tests
elseodt rates_gt2_ucsd_r.odt manage 68.3 K 2009/08/18 - 22:15 IgorSfiligoi Job startup rates for GT2 for UCSD - from FNAL
pdfpdf rates_gt2_ucsd_r.pdf manage 515.6 K 2009/08/18 - 22:15 IgorSfiligoi Job startup rates for GT2 for UCSD - from FNAL
elseodt rates_gt2_ucsd_r2.odt manage 78.7 K 2009/08/24 - 21:32 IgorSfiligoi Job startup rates for GT2 for UCSD - from Italy
pdfpdf rates_gt2_ucsd_r2.pdf manage 525.0 K 2009/08/24 - 21:32 IgorSfiligoi Job startup rates for GT2 for UCSD - from Italy
elseods rates_gt2_ucsd_rtt.ods manage 23.5 K 2009/08/24 - 21:49 IgorSfiligoi Spreadsheet with rates vs RTT - UCSD
Topic revision: r18 - 2009/09/10 - 21:33:31 - IgorSfiligoi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback