Difference: BDIItests (4 vs. 5)

Revision 52008/09/20 - Main.FkW

Line: 1 to 1
 

This page documents bdii loadtests done by Sanjay Padhi for OSG.

Line: 26 to 26
 For the final results, we ran this as follows:
  • Once at CERN for 1h with N=15.
  • 8 instances of the test program run in parallel on our 8core desktop at CERN
Changed:
<
<
  • Submitted 50 instances of the test program to the UCSD cluster.
>
>
  • Submitted 50 instances of the test program to the several cluster.
 
    • As those 50 instances don't start all at the same time, we then add the histograms up from all the jobs in such a way as to get an appropriate time history.
Added:
>
>
    • We decided to do this by counting minutes since the beginning of the first of these jobs, until the end of the last of these jobs. This is epoch time 1221827546 until 1221885855.
    • We make 4 plots:
      • the total number of successful querries
      • the total number of failed querries.
      • the average time it takes for querries
      • the total number of "entries", where entries means the number of jobs that ran simultaneously at a given point of time.
 
Changed:
<
<

One instance run from CERN

>
>

50 instances run as jobs submitted to various clusters

 
Changed:
<
<

8 instances run in parallel from CERN

>
>
* success.gif:
success.gif
 
Changed:
<
<

50 instances run as jobs submitted to various clusters

>
>
  • failure.gif:
    failure.gif

  • avgqtime.gif:
    avgqtime.gif
 
Added:
>
>
  • entries.gif:
    entries.gif

Consistency Checks on these data

It's probably a good idea to do some consistency checks of these data by comparing the entries of the histograms at the same time, and checking if it all makes sense.

  • Take the time at the very beginning. At this time only one job ran, thus N=15.
    • According to the avgqtime, dt = 3.4s, and there were thus 15x60s/3.4s = 265 or so queries per minute. This is consistent with the # of successes at that time (no failures).
  • Take the time at the very beginning of the long stretch of tests (t=300). At this time two jobs ran, thus N=30.
    • According to the avgqtime, dt = 3.95, and there were thus 30x60/3.95 = 456 queries per minute. This is consistent with the # of successes at t=300 (no failures)
  • Take t=400. At this time roughly 14 jobs ran simultaneously, thus N=14x15=210.
    • According to the avgqtime, dt = 9s, and there were thus 210x60/8.5 = 1500 or so queries per minute. This is not very consistent with the 2000 successes and no failures!
  • Take t=550. At this time the avgqtime dips down to about 1.5s, and there were a peak of 85 jobs running simultaneously, leading to N=85x15=1275.
    • We thus have up to about 1275x60/1.5=50k querries per minute at the peek. We see about 45k for success and failure combined, 90% of it being failures.
    • So this sort of hangs together as well.

There is a spike near time=0 in the entries plot which makes very little sense. Maybe there was still a bug in the filling of that histogram at this time?

(Preliminary) Conclusion on these data

  • We did succeed to reach a scale at which the bdii fails to function properly.
    • Between the epoch times 1221857546 and 1221866546 the bdii was basically unuseable !!!
  • The turn-on curve where failures are starting to happen is very sharp at around 200-300 queries per minute, or 15-20 jobs with 15 threads of queries in parallel.
  • Surprisingly enough, the bdii recovers from this after the load subsides.
    • In fact, at around time=700min has completely recovered, and is operating at about 15x15=225 queries per minute, with each query taking about 4seconds, thus reaching a peak of more than 3000 successful queries per minute.

One instance run from CERN

8 instances run in parallel from CERN

 

Results from miscellaneous initial testruns

Line: 84 to 126
 
META FILEATTACHMENT attachment="riley-processes-day.png" attr="" comment="bdii host system monitoring: processes" date="1221476109" name="riley-processes-day.png" path="riley-processes-day.png" size="23149" stream="riley-processes-day.png" tmpFilename="/usr/tmp/CGItemp37463" user="FkW" version="1"
META FILEATTACHMENT attachment="riley-load-day.png" attr="" comment="bdii host system monitoring: loadavg" date="1221476130" name="riley-load-day.png" path="riley-load-day.png" size="29687" stream="riley-load-day.png" tmpFilename="/usr/tmp/CGItemp37522" user="FkW" version="1"
META FILEATTACHMENT attachment="bdii-from-ucsd.gif" attr="" comment="bdii response fro UCSD" date="1221476413" name="bdii-from-ucsd.gif" path="bdii-from-ucsd.gif" size="30115" stream="bdii-from-ucsd.gif" tmpFilename="/usr/tmp/CGItemp37458" user="FkW" version="1"
Added:
>
>
META FILEATTACHMENT attachment="success.gif" attr="" comment="" date="1221922405" name="success.gif" path="success.gif" size="10494" stream="success.gif" tmpFilename="/usr/tmp/CGItemp35592" user="FkW" version="1"
META FILEATTACHMENT attachment="failure.gif" attr="" comment="" date="1221922421" name="failure.gif" path="failure.gif" size="8248" stream="failure.gif" tmpFilename="/usr/tmp/CGItemp35519" user="FkW" version="1"
META FILEATTACHMENT attachment="avgqtime.gif" attr="" comment="" date="1221922437" name="avgqtime.gif" path="avgqtime.gif" size="10463" stream="avgqtime.gif" tmpFilename="/usr/tmp/CGItemp35543" user="FkW" version="1"
META FILEATTACHMENT attachment="entries.gif" attr="" comment="" date="1221922452" name="entries.gif" path="entries.gif" size="7201" stream="entries.gif" tmpFilename="/usr/tmp/CGItemp35560" user="FkW" version="1"
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback