dCache Scalability Tests
This page contains the results of tests ran against dCache as part of the OSG scalability and reliability area activity.
FNAL test instance
The FNAL team, composed mainly by Tanya Levshina and Neha Sharma, operates a test instance of dCache.
The test instance is composed of 6 nodes:
- dCache admin node
- chimera/pnfs node
- srm node
- a dCache door, and
- 2 pool nodes.
The SRM machine is a single 3.6GHz Xeon, 4GB of RAM and
GigE? Ethernet, running a 32-bit version of SL4.
FNAL lcg-ls tests
The tests reported in this section were performed against that instance, using a glideTester instance using FNAL sleeper pool resources.
All glideTester jobs were running lcg-ls in a tight loop for a specified amount of time.
lcg-ls -b -D srmv2 $mytestdir
Run 1
This run ran against dCache v1.5.9-17 with Chimera. Default parameters were used.
The glideTester jobs were configured to run for 20 minutes (1200s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
5 |
6264 (5Hz) |
0 |
20 |
10510 (9Hz) |
0 |
50 |
10324 (9Hz) |
0 |
100 |
716 (0.5Hz) |
0 |
150 |
300 (<0.5Hz) |
0 |
200 |
400 (<0.5Hz) |
0 |
300 |
600 (<0.5Hz) |
0 |
The tested system peaked at about 9Hz with 50 clients. After that level it became painfully slow, basically unusable, although no user visible errors could be noticed.
It should be noted that once the unstable situation was reached, the system remained very slow even with a small number of clients, until a human operator fixed the problem on the server side.
Run 2
This run ran against dCache v1.5.9-12 with PNFS. Default parameters were used.
The glideTester jobs were configured to run for 20 minutes (1200s) up to 200 jobs, and for 40 minutes (2400s) for higher concurrencies.
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
5 |
2191 (2Hz) |
0 |
20 |
2230 (2Hz) |
0 |
50 |
2243 (2Hz) |
0 |
100 |
2270 (2Hz) |
0 |
150 |
2307 (2Hz) |
0 |
200 |
2312 (2Hz) |
0 |
300 |
4607 (2Hz) |
0 |
400 |
4618 (2Hz) |
0 |
500 |
4676 (2Hz) |
0 |
600 |
4790 (2Hz) |
0 |
700 |
4709 (2Hz) |
1280 |
800 |
4899 (2Hz) |
4722 |
900 |
2645 (1Hz) |
8011 |
1000 |
2625 (1Hz) |
11847 |
The tested system was pretty consistent at 2Hz up to 800 clients. However, starting at 700 clients users start to see errors.
Run 3
This run ran against dCache v1.9.5-19 with Chimera. The following parameters have been tuned:
- modify max_connections from default 100 to 250 in postgresql.conf of srm database
- in dCacheSetup, make sure:
srmAsynchronousLs=true
- in /opt/d-cache/libexec/apache-tomcat-5.5.20/conf/server.xml,
find element: <Connector className="org.globus.tomcat.coyote.net.HTTPSConnector"
set these parameters:
- maxThreads="1000"
- minSpareThreads="25"
- maxSpareThreads="200"
- maxProcessors="1000"
- minProcessors="25"
- maxSpareProcessors="200"
- enableLookups="false"
- disableUploadTimeout="true"
- acceptCount="1024"
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
10 |
4408 (1.8Hz) |
0 |
50 |
10450 (4.3Hz) |
0 |
100 |
8087 (3.4Hz) |
0 |
150 |
6922 (2.9Hz) |
0 |
200 |
6375 (2,7Hz) |
0 |
300 |
5442 (2.3Hz) |
0 |
400 |
4931 (2.1Hz) |
0 |
600 |
4469 (1.9Hz) |
231 |
800 |
4193 (1.7Hz) |
784 |
1000 |
3791 (1.6Hz) |
2682 |
The tested system peaked at ~4.3Hz with 50 clients, and was slowly degrading to just below 2Hz at 1000 client. The error rates started to show up with 600 clients, but were really problematic after 800 clients.
Run 4
--
IgorSfiligoi - 2010/06/17
This test used postgres and kwp, with the following paramaters:
max_connections - 250 shared_buffers - 512MB work_mem - 16MB max_fsm_pages - 1000000
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
10 |
4520 (1.8Hz) |
0 |
50 |
6527 (2.7Hz) |
0 |
100 |
5457 (2.3Hz) |
0 |
150 |
4933 (2.1Hz) |
0 |
200 |
4697 (2.0Hz) |
0 |
300 |
4216 (1.8Hz) |
0 |
400 |
4038 (1.7Hz) |
62 |
600 |
3707 (1.5Hz) |
102 |
800 |
3524 (1.5Hz) |
850 |
1000 |
2338 (1.0Hz) |
9727 |
The tested system was peaked at 2.7Hz with 50 clients, stays in the 2Hz range until abount 400 clients and then starts to degrate. Errors start to appear with 400 clients, but get really problematic at around 800.
Run 5
--
IgorSfiligoi - 2010/06/17
This test was similar to Run 4, but using GUMS.
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
10 |
4499 (1.8Hz) |
0 |
50 |
6482 (2.7Hz) |
0 |
100 |
5537 (2.3Hz) |
0 |
150 |
4947 (2.1Hz) |
0 |
200 |
4719 (2.0Hz) |
0 |
300 |
4123 (1.8Hz) |
0 |
400 |
4025 (1.7Hz) |
0 |
600 |
3721 (1.5Hz) |
114 |
800 |
2806 (1.2Hz) |
6491 |
1000 |
931 (0.4Hz) |
21980+38 hung clients |
At low concurrencies, the tested system gave similar results as Run 4. It peaked at 2.7Hz with 50 clients, stays in the 2Hz range until abount 400 clients and then starts to degrate.
However, once reached 800 clients, the system seems to misbehave badly.
Run 6
--
IgorSfiligoi - 2010/12/22
This test was similar to Run 5, but using XACML GUMS and dcache version 1.9.5-23.
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
50 |
11.3k (4.7Hz) |
0 |
100 |
9.0k (3.7Hz) |
0 |
150 |
7.5k (3.1Hz) |
0 |
200 |
6.4k (2.7Hz) |
0 |
300 |
5.9k (2.5Hz) |
0 |
400 |
5.0k (2.1Hz) |
0 |
600 |
4.9k (2.0Hz) |
135 |
800 |
0 |
all |
1000 |
0 |
all |
The tested system performs significantly better at low concurrencies ( 4.7Hz vs 2.7Hz), but it steadily declines and is only marginally better at higher concurrencies.
The concurrency limit is still around the 600 mark.
Run 7
--
IgorSfiligoi - 2011/04/19
This test was similar to Run 6, just the OS was upgraded to SL5 and the dcache version is dcache-server-1.9.5-25.The clients were running on the UCSD sleeper pool.
The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.
Complete results can be seen below:
Concurrency |
Succeeded (Rate) |
Failed |
25 |
4.5k (3.7Hz) |
0 |
50 |
5.5k (4.6Hz) |
0 |
75 |
5.1k (4.3Hz) |
0 |
100 |
4.6k (3.9Hz) |
0 |
150 |
4.3k (3.6Hz) |
0 |
200 |
7.4k (3.1Hz) |
0 |
300 |
6.3k (2.6Hz) |
0 |
400 |
6.8k (2.8Hz) |
0 |
600 |
12k (2.5Hz) |
1.1k |
800 |
7.6k (1.6Hz) |
8.5k |
1000 |
3.4k (0.7Hz) |
2.1M |
1200 |
0 |
6.9M |
The tested system performs similarly to the previous test, although it is marginally better. The concurrency limit seems to have improved to about the 800 mark.
The server hung up during the 1.2k run, and had to be manually restarted.
Run 8
--
IgorSfiligoi - 2011/05/23
This test was similar to Run 7, just the the dcache version is now dcache-server-1.9.5-26.The clients were running on the UCSD sleeper pool.
The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.
Complete results can be seen below:
Concurrency |
Succeeded (Rate) |
Failed |
25 |
4.5k (3.7Hz) |
0 |
50 |
5.5k (4.6Hz) |
1 |
75 |
5.0k (4.3Hz) |
0 |
100 |
4.5k (3.9Hz) |
0 |
150 |
4.0k (3.3Hz) |
0 |
200 |
6.7k (2.8Hz) |
2 |
300 |
6.3k (2.6Hz) |
0.4k |
400 |
5.1k (2.1Hz) |
1.0k |
600 |
6.3k (1.3Hz) |
1.7k |
800 |
2.5k (0.5Hz) |
76k |
1000 |
0 |
ALL |
The system performed slightly worse than before, hitting the wall at 1k. But it did recover by itself once the jobs stopped.
FNAL lcg-cp tests
The tests reported in this section were performed against that instance, using a glideTester instance using FNAL sleeper pool resources.
All glideTester jobs were running lcg-cp from the SE to the local disk of a 10Mbyte file in a tight loop for a specified amount of time.
lcg-cp -b -D srmv2 $mytestdir/igors_file.dat file:$PWD/igors_file.dat
Note: lcg-cp returns 0 even when it fails!
Run 1
This run ran against dCache v1.9.5-19 with Chimera. The following parameters have been tuned:
- modify max_connections from default 100 to 250 in postgresql.conf of srm database
- in dCacheSetup, make sure:
srmAsynchronousLs=true
- in /opt/d-cache/libexec/apache-tomcat-5.5.20/conf/server.xml,
find element: <Connector className="org.globus.tomcat.coyote.net.HTTPSConnector"
set these parameters:
- maxThreads="1000"
- minSpareThreads="25"
- maxSpareThreads="200"
- maxProcessors="1000"
- minProcessors="25"
- maxSpareProcessors="200"
- enableLookups="false"
- disableUploadTimeout="true"
- acceptCount="1024"
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
10 |
1933 (0.8Hz) |
0 |
50 |
5372 (2.2Hz) |
0 |
100 |
4810 (2Hz) |
0 |
150 |
4090 (1.7Hz) |
39 |
200 |
3403 (1,4Hz) |
398* |
300 |
3442 (1.4Hz) |
642 |
400 |
3404 (1.4Hz) |
421 |
600 |
3946 (1.6Hz) |
767* |
800 |
2287 (1Hz) |
3072^ |
1000 |
0 |
all failed |
* - Two jobs got stuck and did not finish for over 1 hour and had to be hard killed.
^ - Twenty jobs got stuck and did not finish for over 1 hour and had to be hard killed.
The tested system peaked at 50 clients, delivering files at 2.2Hz, or 200Mbit/s, and then declining to ~1.4Hz
The first error appear with 150 clients, but are still bearable up to about 600 client.
With 800 clients, more than half of all attempts failed, while with 1000 clients all the attempts failed.
Run 2
--
IgorSfiligoi - 2010/06/17
This test used postgres and kwp, with the following paramaters:
max_connections - 250 shared_buffers - 512MB work_mem - 16MB max_fsm_pages - 1000000
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
50 |
5353 (2.2Hz) |
0 |
100 |
4678 (1.9Hz) |
0 |
200 |
3472 (1.4Hz) |
320 + 1 hung client |
400 |
391 (0.2Hz) |
29198 |
600 |
0 |
24000 |
The tested system was peaked at 2.2Hz with 50 clients and 200Mbit/s, and then rapidly deteriorates.
By 400 clients the system was practically unusable.
Run 3
--
IgorSfiligoi - 2010/06/17
This test was similar to Run 2, but using GUMS.
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
10 |
1921 (0.8Hz) |
1 |
50 |
5399 (2.2Hz) |
1 |
100 |
4694 (2.0Hz) |
1 |
200 |
113 (0.0Hz) |
4790 + 12 hung |
Like with Run 2, the tested system peaked at 2.2Hz with 50 clients and 200Mbit/s, and then rapidly deteriorated.
The deterioration rate was much faster, though. At 200 clients the system was already unusable.
Run 4
--
IgorSfiligoi - 2010/12/22
This test was similar to Run 3, but using XACML GUMS and dcache version 1.9.5-23.
The glideTester jobs were configured to run for 40 minutes (2400s).
Complete results can be seen below:
Concurrency |
Succeded (rate) |
Failed |
25 |
4.3k (1.8Hz) |
0 |
50 |
5.2k (2.2Hz) |
0 |
75 |
5.0k (2.1Hz) |
0 |
100 |
4.9k (2.0Hz) |
0 |
125 |
4.7k (2.0Hz) |
0 |
150 |
4.5k (1.9Hz) |
15 |
175 |
4.1k (1.7Hz) |
250 |
200 |
3.8k (1.6Hz) |
520 |
250 |
3.4k (1.4Hz) |
720 |
300 |
3.3k (1.4Hz) |
860 |
350 |
3.4k (1.4Hz) |
840 |
400 |
3.0k (1.3Hz) |
1200 |
450 |
3.3k (1.4Hz) |
530 |
500 |
3.2k (1.3Hz) |
820 |
550 |
1.6k (0.7Hz) |
3.2k + 26 hung |
600 |
0 |
all |
650 |
0 |
all |
Like with Run 3, the tested system peaked at 2.2Hz with 50 clients and 200Mbit/s. But the deteriorarion is much slower; while errors start to appear around the 150 mark, the system is still usable(with retries) up to about 500 concurrent clients.
Run 5
--
IgorSfiligoi - 2011/04/20
This test was similar to Run 4, just the OS was upgraded to SL5 and the dcache version is dcache-server-1.9.5-25.The clients were running on the UCSD sleeper pool.
The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.
Complete results can be seen below:
Concurrency |
Succeeded (Rate) |
Failed |
25 |
1.7k (1,4Hz) |
0 |
50 |
2.4k (2.0Hz) |
1 |
75 |
2.4k (2.0Hz) |
0 |
100 |
2.4k (2.0Hz) |
0 |
150 |
2.3k (1.9Hz) |
33 |
200 |
3.6k (1.5Hz) |
712 |
300 |
3.4k (1.4Hz) |
907 |
400 |
3.3k (1.4Hz) |
1.2k |
600 |
5.6k (1.2Hz) |
2.8k |
800 |
6.1k (1.3Hz) |
4.6k |
1000 |
0.9k |
5.6M |
1200 |
0 |
6.9M |
The tested system performs similarly to the previous test, although it is marginally better. The concurrency limit seems to have improved to about the 800 mark.
The server hung up during the 1.2k run, and had to be manually restarted.
Run 5
--
IgorSfiligoi - 2011/05/23
This test was similar to Run 5, just the the dcache version is now dcache-server-1.9.5-26.The clients were running on the UCSD sleeper pool.
The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 80 minutes (4800s) after that.
Complete results can be seen below:
Concurrency |
Succeeded (Rate) |
Failed |
25 |
2.0k (1,7Hz) |
0 |
50 |
2.5k (2.1Hz) |
1 |
75 |
2.4k (2.0Hz) |
0 |
100 |
2.3k (1.9Hz) |
0 |
150 |
2.1k (1.8Hz) |
43 |
200 |
6.7k (1.4Hz) |
1.0k |
300 |
5.9k (1.2Hz) |
2.3k |
400 |
4.9k (1.0Hz) |
6.7k |
600 |
2.6k (0.5Hz) |
28k |
The system performed slightly worse than before, severely degrading already at 600. But it never hang.