OSG Bdii Testing
Introduction
A series of tests are done in order to better understand how OSG's central US server will respond to a high volume of bdii queries. The tests are done using the script bdii_par.py (from bdiisrc.tar.gz). One of the main goals is to meausre the failure rate of bdii queries as a function of the query rate. Also measured is the rate of both successful and failed queries as a function of the number of script instances running.
The tar can be found at: /pnfs/t2.ucsd.edu/data4/cms/phedex/store/user/spadhi/osg/bdii/bdiisrc.tar.gz
Implementation
The script bdii_par.py is configured to send 15 queries at a time to the server is-dev.grid.iu.edu:2170. The script then waits for the queries to return, counting successful and failed queries for a designated total amount of time.
In order to test the success and failure rates, we vary the number of script instances running simultaneously. For all of the tests, we distribute the instances over five 8 core machines at the UCSD T2. The number of instances run are 5, 10, 15, 20, 25, and 30. There are thus at most 6 instances per 8 core machine running. For each of these points, the test was run for 12 hours, with the 5, 20, and 30 points also being tested for shorter time periods (2 or 3 hours).
Results
Summary of Results
Our main findings are:
- The number of fails (and the rate of fails) increase as the number of script instances increase.
- There are no fails below a certain number of instances: 10.
- There are no fails below a certain rate of querries: 10Hz.
- The total rate of queries never exceeds 20 Hz. As we run more simultaneous querries,the querries take longer, and a larger fraction of them fails.
Detailed Results
Run Summary Table
The following table summarizes the results of the different runs. Table entries are the total of all the processes (script instances) in a given run, unless the column says average, in which case it is the average of the different processes.
What is meant by 'Input time' is the parameter in the script which tells the script how long to run. 'Average run time' is the average of the actual time each script runs for, as output by the script. 'Average pass (fail) rate' is 'Total queries passed (failed)' divided by 'Average run time.'
Num processes |
Input time (h) |
Average run time (s) |
Total queries passed |
Average pass rate (Hz) |
Total queries failed |
Average fail rate (Hz) |
5 |
2 |
7204 |
70095 |
9.73 |
0 |
0 |
5 |
12 |
43202 |
425055 |
9.84 |
0 |
0 |
10 |
12 |
43218 |
635995 |
14.7 |
5 |
0.00012 |
15 |
3 |
10872 |
160183 |
14.7 |
47 |
0.0043 |
15 |
12 |
43221 |
820458 |
19.0 |
102 |
0.0023 |
20 |
3 |
10847 |
161719 |
14.90 |
176 |
0.016 |
20 |
12 |
43312 |
834140 |
19.3 |
775 |
0.017 |
25 |
12 |
43380 |
666060 |
15.4 |
1665 |
0.038 |
30 |
2 |
7337 |
109202 |
14.9 |
388 |
0.052 |
30 |
12 |
43464 |
659036 |
15.16 |
2419 |
0.055 |
Failure rate versus Number of processes
What is plotted here is the failure rate as a function of the total number of script instances running concurrently. This data is in the table above in the first and last columns. Recall that each of these processes is an instance of our script. Each script runs 15 querries in parallel.
Script Instances Histogram
What is plotted below is a histogram of the individual script instances. The top histogram is the individual script success rate, that is, the number of queries passed for that script divided by the total time that that script ran for. The bottom histogram is the individual script failure rate: number of failed queries for each script divided by that script's total run time. In both histograms, the colors are the total number of script instances running for that instance, as indicated in the legend. The different plots are overlayed, not stacked.
--
WarrenAndrews - 2008/12/18