dCache Scalability Tests

This page contains the results of tests ran against dCache as part of the OSG scalability and reliability area activity.

FNAL test instance

The FNAL team, composed mainly by Tanya Levshina and Neha Sharma, operates a test instance of dCache.

The test instance is composed of 6 nodes:

  • dCache admin node
  • chimera/pnfs node
  • srm node
  • a dCache door, and
  • 2 pool nodes.

The SRM machine is a single 3.6GHz Xeon, 4GB of RAM and GigE? Ethernet, running a 32-bit version of SL4.

FNAL lcg-ls tests

The tests reported in this section were performed against that instance, using a glideTester instance using FNAL sleeper pool resources.

All glideTester jobs were running lcg-ls in a tight loop for a specified amount of time.

 lcg-ls -b -D srmv2 $mytestdir

Run 1

This run ran against dCache v1.5.9-17 with Chimera. Default parameters were used.

The glideTester jobs were configured to run for 20 minutes (1200s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
5 6264 (5Hz) 0
20 10510 (9Hz) 0
50 10324 (9Hz) 0
100 716 (0.5Hz) 0
150 300 (<0.5Hz) 0
200 400 (<0.5Hz) 0
300 600 (<0.5Hz) 0

The tested system peaked at about 9Hz with 50 clients. After that level it became painfully slow, basically unusable, although no user visible errors could be noticed.

It should be noted that once the unstable situation was reached, the system remained very slow even with a small number of clients, until a human operator fixed the problem on the server side.

Run 2

This run ran against dCache v1.5.9-12 with PNFS. Default parameters were used.

The glideTester jobs were configured to run for 20 minutes (1200s) up to 200 jobs, and for 40 minutes (2400s) for higher concurrencies.

Complete results can be seen below:

Concurrency Succeded (rate) Failed
5 2191 (2Hz) 0
20 2230 (2Hz) 0
50 2243 (2Hz) 0
100 2270 (2Hz) 0
150 2307 (2Hz) 0
200 2312 (2Hz) 0
300 4607 (2Hz) 0
400 4618 (2Hz) 0
500 4676 (2Hz) 0
600 4790 (2Hz) 0
700 4709 (2Hz) 1280
800 4899 (2Hz) 4722
900 2645 (1Hz) 8011
1000 2625 (1Hz) 11847

The tested system was pretty consistent at 2Hz up to 800 clients. However, starting at 700 clients users start to see errors.

Run 3

This run ran against dCache v1.9.5-19 with Chimera. The following parameters have been tuned:

  1. modify max_connections from default 100 to 250 in postgresql.conf of srm database
  2. in dCacheSetup, make sure:
    srmAsynchronousLs=true
  3. in /opt/d-cache/libexec/apache-tomcat-5.5.20/conf/server.xml,
    find element: <Connector className="org.globus.tomcat.coyote.net.HTTPSConnector"
    set these parameters:
  • maxThreads="1000"
  • minSpareThreads="25"
  • maxSpareThreads="200"
  • maxProcessors="1000"
  • minProcessors="25"
  • maxSpareProcessors="200"
  • enableLookups="false"
  • disableUploadTimeout="true"
  • acceptCount="1024"

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
10 4408 (1.8Hz) 0
50 10450 (4.3Hz) 0
100 8087 (3.4Hz) 0
150 6922 (2.9Hz) 0
200 6375 (2,7Hz) 0
300 5442 (2.3Hz) 0
400 4931 (2.1Hz) 0
600 4469 (1.9Hz) 231
800 4193 (1.7Hz) 784
1000 3791 (1.6Hz) 2682

The tested system peaked at ~4.3Hz with 50 clients, and was slowly degrading to just below 2Hz at 1000 client. The error rates started to show up with 600 clients, but were really problematic after 800 clients.

Run 4

-- IgorSfiligoi - 2010/06/17

This test used postgres and kwp, with the following paramaters:

max_connections - 250  shared_buffers - 512MB work_mem - 16MB  max_fsm_pages - 1000000  

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
10 4520 (1.8Hz) 0
50 6527 (2.7Hz) 0
100 5457 (2.3Hz) 0
150 4933 (2.1Hz) 0
200 4697 (2.0Hz) 0
300 4216 (1.8Hz) 0
400 4038 (1.7Hz) 62
600 3707 (1.5Hz) 102
800 3524 (1.5Hz) 850
1000 2338 (1.0Hz) 9727

The tested system was peaked at 2.7Hz with 50 clients, stays in the 2Hz range until abount 400 clients and then starts to degrate. Errors start to appear with 400 clients, but get really problematic at around 800.

Run 5

-- IgorSfiligoi - 2010/06/17

This test was similar to Run 4, but using GUMS.

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
10 4499 (1.8Hz) 0
50 6482 (2.7Hz) 0
100 5537 (2.3Hz) 0
150 4947 (2.1Hz) 0
200 4719 (2.0Hz) 0
300 4123 (1.8Hz) 0
400 4025 (1.7Hz) 0
600 3721 (1.5Hz) 114
800 2806 (1.2Hz) 6491
1000 931 (0.4Hz) 21980+38 hung clients

At low concurrencies, the tested system gave similar results as Run 4. It peaked at 2.7Hz with 50 clients, stays in the 2Hz range until abount 400 clients and then starts to degrate.

However, once reached 800 clients, the system seems to misbehave badly.

Run 6

-- IgorSfiligoi - 2010/12/22

This test was similar to Run 5, but using XACML GUMS and dcache version 1.9.5-23.

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
50 11.3k (4.7Hz) 0
100 9.0k (3.7Hz) 0
150 7.5k (3.1Hz) 0
200 6.4k (2.7Hz) 0
300 5.9k (2.5Hz) 0
400 5.0k (2.1Hz) 0
600 4.9k (2.0Hz) 135
800 0 all
1000 0 all

The tested system performs significantly better at low concurrencies ( 4.7Hz vs 2.7Hz), but it steadily declines and is only marginally better at higher concurrencies.

The concurrency limit is still around the 600 mark.

Run 7

-- IgorSfiligoi - 2011/04/19

This test was similar to Run 6, just the OS was upgraded to SL5 and the dcache version is dcache-server-1.9.5-25.The clients were running on the UCSD sleeper pool.

The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.

Complete results can be seen below:

Concurrency Succeeded (Rate) Failed
25 4.5k (3.7Hz) 0
50 5.5k (4.6Hz) 0
75 5.1k (4.3Hz) 0
100 4.6k (3.9Hz) 0
150 4.3k (3.6Hz) 0
200 7.4k (3.1Hz) 0
300 6.3k (2.6Hz) 0
400 6.8k (2.8Hz) 0
600 12k (2.5Hz) 1.1k
800 7.6k (1.6Hz) 8.5k
1000 3.4k (0.7Hz) 2.1M
1200 0 6.9M

The tested system performs similarly to the previous test, although it is marginally better. The concurrency limit seems to have improved to about the 800 mark.

The server hung up during the 1.2k run, and had to be manually restarted.

Run 8

-- IgorSfiligoi - 2011/05/23

This test was similar to Run 7, just the the dcache version is now dcache-server-1.9.5-26.The clients were running on the UCSD sleeper pool.

The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.

Complete results can be seen below:

Concurrency Succeeded (Rate) Failed
25 4.5k (3.7Hz) 0
50 5.5k (4.6Hz) 1
75 5.0k (4.3Hz) 0
100 4.5k (3.9Hz) 0
150 4.0k (3.3Hz) 0
200 6.7k (2.8Hz) 2
300 6.3k (2.6Hz) 0.4k
400 5.1k (2.1Hz) 1.0k
600 6.3k (1.3Hz) 1.7k
800 2.5k (0.5Hz) 76k
1000 0 ALL

The system performed slightly worse than before, hitting the wall at 1k. But it did recover by itself once the jobs stopped.

FNAL lcg-cp tests

The tests reported in this section were performed against that instance, using a glideTester instance using FNAL sleeper pool resources.

All glideTester jobs were running lcg-cp from the SE to the local disk of a 10Mbyte file in a tight loop for a specified amount of time.

 lcg-cp -b -D srmv2 $mytestdir/igors_file.dat file:$PWD/igors_file.dat 
Note: lcg-cp returns 0 even when it fails!

Run 1

This run ran against dCache v1.9.5-19 with Chimera. The following parameters have been tuned:

  1. modify max_connections from default 100 to 250 in postgresql.conf of srm database
  2. in dCacheSetup, make sure:
    srmAsynchronousLs=true
  3. in /opt/d-cache/libexec/apache-tomcat-5.5.20/conf/server.xml,
    find element: <Connector className="org.globus.tomcat.coyote.net.HTTPSConnector"
    set these parameters:
  • maxThreads="1000"
  • minSpareThreads="25"
  • maxSpareThreads="200"
  • maxProcessors="1000"
  • minProcessors="25"
  • maxSpareProcessors="200"
  • enableLookups="false"
  • disableUploadTimeout="true"
  • acceptCount="1024"

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
10 1933 (0.8Hz) 0
50 5372 (2.2Hz) 0
100 4810 (2Hz) 0
150 4090 (1.7Hz) 39
200 3403 (1,4Hz) 398*
300 3442 (1.4Hz) 642
400 3404 (1.4Hz) 421
600 3946 (1.6Hz) 767*
800 2287 (1Hz) 3072^
1000 0 all failed

* - Two jobs got stuck and did not finish for over 1 hour and had to be hard killed.
^ - Twenty jobs got stuck and did not finish for over 1 hour and had to be hard killed.

The tested system peaked at 50 clients, delivering files at 2.2Hz, or 200Mbit/s, and then declining to ~1.4Hz
The first error appear with 150 clients, but are still bearable up to about 600 client.
With 800 clients, more than half of all attempts failed, while with 1000 clients all the attempts failed.

Run 2

-- IgorSfiligoi - 2010/06/17

This test used postgres and kwp, with the following paramaters:

max_connections - 250  shared_buffers - 512MB work_mem - 16MB  max_fsm_pages - 1000000  

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
50 5353 (2.2Hz) 0
100 4678 (1.9Hz) 0
200 3472 (1.4Hz) 320 + 1 hung client
400 391 (0.2Hz) 29198
600 0 24000

The tested system was peaked at 2.2Hz with 50 clients and 200Mbit/s, and then rapidly deteriorates.

By 400 clients the system was practically unusable.

Run 3

-- IgorSfiligoi - 2010/06/17

This test was similar to Run 2, but using GUMS.

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
10 1921 (0.8Hz) 1
50 5399 (2.2Hz) 1
100 4694 (2.0Hz) 1
200 113 (0.0Hz) 4790 + 12 hung

Like with Run 2, the tested system peaked at 2.2Hz with 50 clients and 200Mbit/s, and then rapidly deteriorated.

The deterioration rate was much faster, though. At 200 clients the system was already unusable.

Run 4

-- IgorSfiligoi - 2010/12/22

This test was similar to Run 3, but using XACML GUMS and dcache version 1.9.5-23.

The glideTester jobs were configured to run for 40 minutes (2400s).

Complete results can be seen below:

Concurrency Succeded (rate) Failed
25 4.3k (1.8Hz) 0
50 5.2k (2.2Hz) 0
75 5.0k (2.1Hz) 0
100 4.9k (2.0Hz) 0
125 4.7k (2.0Hz) 0
150 4.5k (1.9Hz) 15
175 4.1k (1.7Hz) 250
200 3.8k (1.6Hz) 520
250 3.4k (1.4Hz) 720
300 3.3k (1.4Hz) 860
350 3.4k (1.4Hz) 840
400 3.0k (1.3Hz) 1200
450 3.3k (1.4Hz) 530
500 3.2k (1.3Hz) 820
550 1.6k (0.7Hz) 3.2k + 26 hung
600 0 all
650 0 all

Like with Run 3, the tested system peaked at 2.2Hz with 50 clients and 200Mbit/s. But the deteriorarion is much slower; while errors start to appear around the 150 mark, the system is still usable(with retries) up to about 500 concurrent clients.

Run 5

-- IgorSfiligoi - 2011/04/20

This test was similar to Run 4, just the OS was upgraded to SL5 and the dcache version is dcache-server-1.9.5-25.The clients were running on the UCSD sleeper pool.

The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 40 minutes (2400s) up to 400 and 80 minutes (4800s) after that.

Complete results can be seen below:

Concurrency Succeeded (Rate) Failed
25 1.7k (1,4Hz) 0
50 2.4k (2.0Hz) 1
75 2.4k (2.0Hz) 0
100 2.4k (2.0Hz) 0
150 2.3k (1.9Hz) 33
200 3.6k (1.5Hz) 712
300 3.4k (1.4Hz) 907
400 3.3k (1.4Hz) 1.2k
600 5.6k (1.2Hz) 2.8k
800 6.1k (1.3Hz) 4.6k
1000 0.9k 5.6M
1200 0 6.9M

The tested system performs similarly to the previous test, although it is marginally better. The concurrency limit seems to have improved to about the 800 mark.

The server hung up during the 1.2k run, and had to be manually restarted.

Run 5

-- IgorSfiligoi - 2011/05/23

This test was similar to Run 5, just the the dcache version is now dcache-server-1.9.5-26.The clients were running on the UCSD sleeper pool.

The glideTester jobs were configured to run for 20 minutes (1200s) for concurrencies up to 150, and 80 minutes (4800s) after that.

Complete results can be seen below:

Concurrency Succeeded (Rate) Failed
25 2.0k (1,7Hz) 0
50 2.5k (2.1Hz) 1
75 2.4k (2.0Hz) 0
100 2.3k (1.9Hz) 0
150 2.1k (1.8Hz) 43
200 6.7k (1.4Hz) 1.0k
300 5.9k (1.2Hz) 2.3k
400 4.9k (1.0Hz) 6.7k
600 2.6k (0.5Hz) 28k

The system performed slightly worse than before, severely degrading already at 600. But it never hang.

Topic revision: r15 - 2011/06/03 - 17:21:32 - IgorSfiligoi
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback