This page documents bdii loadtests done by Sanjay Padhi for OSG.
Description of tools used
We use the python interface to root in order to plot results in realtime while the test is running, and then store the results histograms in rootfile. The logic of the test is as follows:
- Run N python threads
- Each thread queries the bdii with:
- os.popen("ldapsearch -xLLL -p2170 -h is-dev.grid.iu.edu -b o=grid","r")
- Each thread collects the return, and records the time it took to complete the query.
- The average return time for the N threads is logged in hprof histogram, after all N threads have returned.
- Run the next N threads. The N threads are launched at most once per second. However, given that the return time of the ldapsearch is several seconds, the launch of the N threads is really much less often than once per second.
- Continue doing this for a fixed amount of time dt
We then record a few different things:
- The average time it takes for a query to return is plotted versus epoch time (hprof1 histogram)
- The number of queries executed is plotted versus epoch time (htime histogram)
- This is done for all queries as well as separately for those that fail only. A query is defined as failure if it does not come back with content.
Final Results
Minimum sustainability:
- Rate is 20.289463005545787 Hz with 1 failed thread
For the final results, we ran this as follows:
- Once at CERN for 1h with N=15.
- 8 instances of the test program run in parallel on our 8core desktop at CERN
- Submitted 50 instances of the test program to the several cluster.
- As those 50 instances don't start all at the same time, we then add the histograms up from all the jobs in such a way as to get an appropriate time history.
- We decided to do this by counting minutes since the beginning of the first of these jobs, until the end of the last of these jobs. This is epoch time 1221827546 until 1221885855.
- We make 4 plots:
- the total number of successful querries
- the total number of failed querries.
- the average time it takes for querries
- the total number of "entries", where entries means the number of jobs that ran simultaneously at a given point of time.
50 instances run as jobs submitted to various clusters
* success.gif:
- failure.gif:
- avgqtime.gif:
- entries.gif:
Consistency Checks on these data
It's probably a good idea to do some consistency checks of these data by comparing the entries of the histograms at the same time, and checking if it all makes sense.
- Take the time at the very beginning. At this time only one job ran, thus N=15.
- According to the avgqtime, dt = 3.4s, and there were thus 15x60s/3.4s = 265 or so queries per minute. This is consistent with the # of successes at that time (no failures).
- Take the time at the very beginning of the long stretch of tests (t=300). At this time two jobs ran, thus N=30.
- According to the avgqtime, dt = 3.95, and there were thus 30x60/3.95 = 456 queries per minute. This is consistent with the # of successes at t=300 (no failures)
- Take t=400. At this time roughly 14 jobs ran simultaneously, thus N=14x15=210.
- According to the avgqtime, dt = 9s, and there were thus 210x60/8.5 = 1500 or so queries per minute. This is not very consistent with the 2000 successes and no failures!
- Take t=550. At this time the avgqtime dips down to about 1.5s, and there were a peak of 85 jobs running simultaneously, leading to N=85x15=1275.
- We thus have up to about 1275x60/1.5=50k querries per minute at the peek. We see about 45k for success and failure combined, 90% of it being failures.
- So this sort of hangs together as well.
There is a spike near time=0 in the entries plot which makes very little sense. Maybe there was still a bug in the filling of that histogram at this time?
(Preliminary) Conclusion on these data
- We did succeed to reach a scale at which the bdii fails to function properly.
- Between the epoch times 1221857546 and 1221866546 the bdii was basically unuseable !!!
- The turn-on curve where failures are starting to happen is very sharp at around 200-300 queries per minute, or 15-20 jobs with 15 threads of queries in parallel.
- Surprisingly enough, the bdii recovers from this after the load subsides.
- In fact, at around time=700min has completely recovered, and is operating at about 15x15=225 queries per minute, with each query taking about 4seconds, thus reaching a peak of more than 3000 successful queries per minute.
One instance run from CERN
8 instances run in parallel from CERN
Results from miscellaneous initial testruns
Sunday September 14th
Ran a few different short tests, then one longer test of a few hours. For the longer run we picked: N = 15 and dt = 12000 seconds = 200 minutes = 3h 20min, and 18000 seconds = 300min = 5h respectively.
We then ran this test simultaneously from CERN (12000 seconds) and UCSD (18000 seconds). The CERN test ended at 2:35 Monday September 15th CERN time, while the UCSD one ended at 19:39 pacific on the 14th, i.e. 2h and 4min later.
- Response time for the bdii querries from CERN:
- Response time for the bdii queries from UCSD:
- bdii host system monitoring: network traffic:
- bdii host system monitoring: netstat:
- bdii host system monitoring: processes:
- bdii host system monitoring: loadavg:
Understanding the client profile better (Monday September 15th)
To understand the client profile better, we did a series of tests where we varied N first on just one machine, and then having the same N but running the test program 4 times in parallel on 4 different (but identical hardware) hosts.
We find the the time per query depends significantly on the number of parallel python threads, but not significantly on whether we run one or 4 simultaneously.
--
FkW - 10 Sep 2008
Topic revision: r6 - 2008/10/10 - 08:41:15 -
SanjayPadhi