GridFTP? -LVS Tests (gfal-copy writing to /dev/null to test total throughput)

This page details scalability tests on the GridFTP? server/LVS run over the course of November 2016-January 2017

Overview

Using a sleeper pool at Caltech T2, a variable number of scripts which run gfal-copy to dev/null every 1 second to test the throughput and client load limits and overall performance of the LVS and the GridFTP? servers.

  • 1000, 2000, 3000, 4000 and 5000 instances of gfcTest.sh were submitted to condor queue
  • gfcTest.sh selects a file at random listed in fileList.txt and runs gfal-copy writing to dev/null
  • Upon completion of gfal-copy the script sleeps for 1 second and then executes gfal-copy again. (Total Time: 60 minutes)
  • The total load on each individual gftp-x.t2.ucsd.edu server and total number of active jobs in the condor_q were recorded every 30 seconds
  • The total throughput (sum of all individual gftp-x loads) was calculated.
  • The total throughput per job was also calculated.
  • Fast approximation method splitting batch throughput into subintervals identifying 4 distinct phases of the overall Condor Queuing process.
  • 4th order approximation functions for both throughput and throughput/job and verification of optimum number of jobs (~2100) as well as an average expected throughput (on the order of 29 Gbit/s)

Data

1000 Jobs:
Throughput:
1000jTEST.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484711637329&to=1484717359888
Number Jobs:
1000j_active.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484711637329&to=1484717359888

2000 Jobs:
Throughput:
2000jTEST.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484719120593&to=1484724632263
Number Jobs:
2000j_active.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484719120593&to=1484724632263

3000 Jobs:
Throughput:
3000jTEST.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484725398613&to=1484730470759
Number Jobs:
3000j_active.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484725398613&to=1484730470759

4000 Jobs:
Throughput:
4000jTEST.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484735972258&to=1484741688843
Number Jobs:
4000j_active.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484735972258&to=1484741688843

5000 Jobs:
Throughput:
5000jTEST.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484742602465&to=1484747749937
Number Jobs:
5000j_active.csv http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484742602465&to=1484747749937

Overall:
FIGURE 1:

Screen_Shot_2017-02-06_at_7.01.31_AM.png

http://condorflux.t2.ucsd.edu:3000/dashboard/db/caltech-network-tests?from=1484711447764&to=1484749117351


FIGURE 2:

Screen_Shot_2017-02-06_at_7.04.19_AM.png

http://condorflux.t2.ucsd.edu:3000/dashboard/db/condor-metrics-test-001?from=1484711447764&to=1484749117351

Analysis:

The total duration (minutes), number of data points, average total throughput and maximum throughput for each batch submission were found.


# Jobs Submitted Duration (Minutes) # Data Points # Data Points per Subinterval Average Throughput Maximum Throughput
1000 94 191 47 22.945 Gbit/s 32.934 Gbit/s
2000 92 185 46 26.853 Gbit/s 34.211 Gbit/s
3000 84 169 42 28.384 Gbit/s 37.190 Gbit/s
4000 95 191 48 26.263 Gbit/s 36.328 Gbit/s
5000 85 172 43 23.334 Gbit/s 37.076 Gbit/s


Looking at the overall behavior of the each batch's individual job submission, execution, and completion (Figure 1), an immediate problem arises. The individual jobs do not execute simultaneously. This would lead to skewed overall averages. To compensate for this, each of the batches were split into four equally divided subintervals:


First Subinterval: Jobs Submitting (Ramp Up)
Second Subinterval : All Jobs Active (Steady State)
Third Subinterval : All Jobs Active (Steady State)
Fourth Subinterval: Jobs Completing (Cool Down)

1000 Jobs: Avg Throughput:(Gbits/sec)Avg # Jobs: Avg Throughput per Job:
S1:24.48367459564.050.04340692241
S2:24.0918667710000.02409186677
S3:24.93353163969.90.02570732202
S4:18.4091187269.66666670.06826620037
2000 Jobs: Avg Throughput: (Gbits/sec)Avg # Jobs: Avg Throughput per Job:
S1:28.227300831249.950.02258274398
S2:28.1361569220000.01406807846
S3:29.755994961992.3571430.01493507079
S4:19.22643441808.640.02377625941
3000 Jobs:

Avg Throughput:
(Gbits/sec)

Avg # Jobs: Avg Throughput per Job:
S1:30.090107672244.4444440.01340648362
S2:27.033198892995.6842110.009024048259
S3:28.913248952884.0526320.01002521543
S4:27.38546522918.11111110.02982805119
4000 Jobs: Avg Throughput:
(Gbits/sec)
Avg # Jobs:Avg Throughput per Job:
S1:29.997149492271.90.01320355187
S2:23.130032473999.60.005783086426
S3:23.985784013777.9523810.006348884684
S4:28.039235191256.9523810.02230731698
5000 Jobs: Avg Throughput:
(Gbits/sec)
Avg # Jobs:Avg Throughput per Job:
S1:27.7564232917260.01608135764
S2:22.478167414707.3157890.004775156038
S3:17.6205829850000.003524116595
S4:26.099128044617.1666670.005652628532


FIGURE 3: Each of the batches subintervals average total throughputs (Column 2) were graphed with respect to the average number of jobs (Column 3):
The plotted subinterval averages show a very obvious pattern with a maximum at a point slightly left of center.
A function was fitted using Mathematica and the maximum average total throughput was found to be approximately 29 Gbit/s occuring when there were 2161 active jobs.

FIGURE 4: Each of the batches subintervals average total throughputs per job (Column 3) were graphed with respect to the average number of jobs (Column 3):
The plotted subinterval average total throughput per job shows, unexpectedly, an exponentially decaying function. Rejecting outliers, the trend still decreases linearly.
The average throughput per job decreasing as the number of jobs increases is unexpected. It was expected to remain relatively constant. This will require further investigation.

Code

setup.sh

#!/bin/bash
#$1 = total number of clients
#$2 = number of seconds to sleep after gfal-copy
#$3 = number of minutes to run script for before exit
echo $1 $2 $3
mkdir "out/output_"$1"j_"$2"s_"$3"m"
cat blank.submit | sed "s@%@$1@g" | sed "s@*@$2@g" | sed "s@\^@$3@g" > $1"j_"$2"s_"$3"m.submit"
echo "$1 $2 $3" | awk '{print "condor_submit "$1"j_"$2"s_"$3"m.submit"}' | sh

blank.submit

#!/bin/bash
#$1 = total number of clients
#$2 = number of seconds to sleep after gfal-copy
#$3 = number of minutes to run script for before exit
echo $1 $2 $3
mkdir "out/output_"$1"j_"$2"s_"$3"m"
cat blank.submit | sed "s@%@$1@g" | sed "s@*@$2@g" | sed "s@\^@$3@g" > $1"j_"$2"s_"$3"m.submit"
echo "$1 $2 $3" | awk '{print "condor_submit "$1"j_"$2"s_"$3"m.submit"}' | sh


blank.submit

#Argument 1 is the number of seconds of sleep for after each gfal-copy
#Argument 2 is the number of minutes this script will run for before quit
executable = gfcTest.sh
error = out/output_%j_*s_^m/test-$(Cluster).$(Process).error
log = out/output_%j_*s_^m/test-$(Cluster).$(Process).log
output = out/output_%j_*s_^m/test-$(Cluster).$(Process).out
transfer_input_files = fileList.txt
RequestMemory = 1000
arguments = * ^
queue %

gfcTest.sh

#!/bin/bash
sleepTime=$1 #seconds to sleep after each gfal-copy
totalTime=$2*60 #minutes to execute script for
while "$SECONDS" -lt "$totalTime"?
do
file=$(cat fileList.txt | sort -R | head -1)
home=$(echo "gsiftp://gftp.t2.ucsd.edu/hadoop")
path=$home$file
gfal-copy -f -v $path file:/dev/null
sleep "$sleepTime"s
done

fileList.txt

/Path/To/File/test_1.file
/Path/To/File/test_2.file
...
/Path/To/File/test_n.file



Relevant Mathematica Commands

MathematicaNotebook.nb

Average Throughput Per Subinterval:

Importing Data: (Check linked CSV for import file formatting style)
data = Import[" AverageThroughput.csv "]

{{269.667, 18.4091}, {564.05, 24.4837}, {808.64, 19.2264}, {918.111, 27.3855}, {969.9, 24.9335}, {1000, 24.0919}, {1249.95, 28.2273}, {1256.95, 28.0392}, {1726, 27.7564}, {1992.36, 29.756},
{2000, 28.1362}, {2244.44, 30.0901}, {2271.9, 29.9971}, {2884.05, 28.9132}, {2995.68, 27.0332}, {3777.95, 23.9858}, {3999.6, 23.13}, {4617.17, 26.0991}, {4707.32, 22.4782}, {5000, 17.6206}}

Fitting a 4th Order Approximation Function:
fitFunction = Fit[data, {1, x , x^2, x^3, x^4}, x]

Creating the Plot:
AvgThroughput = ListPlot? [data, ImageSize? -> Full, Frame -> True, FrameLabel -> {{"Average Throughput (Gbps)", " "}, {"Number of Active Jobs", "Average Throughput at t=[0.25,0.5,0.75,1.0] }, PlotTheme -> "Detailed",
PlotRangeClipping? -> True,
Plot[fitFunction, {x, 0, 5000}, PlotLabels? -> "Expressions"], PlotLabel -> None, LabelStyle? -> {GrayLevel[0], Bold}]

Show the Plot:
Show[AvgThroughput]

Export the Plot to JPEG:

Export["~/AverageThroughput.jpg", AvgThroughput? , "JPEG"]

Average Throughput per Job per Subinterval:


Importing Data: (
Check linked CSV for import file formatting style)

data = Import[" AverageThroughputPerJob.csv "]

{{269.667, 0.0682662}, {564.05, 0.0434069}, {808.64, 0.0237763}, {918.111, 0.0298281}, {969.9, 0.0257073}, {1000, 0.0240919}, {1249.95, 0.0225827}, {1256.95, 0.0223073},
{1726, 0.0160814}, {1992.36, 0.0149351}, {2000, 0.0140681}, {2244.44, 0.0134065}, {2271.9, 0.0132036}, {2884.05, 0.0100252}, {2995.68, 0.00902405}, {3777.95, 0.00634888},
{3999.6, 0.00578309}, {4617.17, 0.00565263}, {4707.32, 0.00477516}, {5000, 0.00352412}}

Fitting a 4th Order Approximation Function:

fitFunction = Fit[data, {1, x , x^2, x^3, x^4}, x]

Creating the Plot:

AvgThroughputPerJob = ListPlot? [data, ImageSize? -> Full, Frame -> True, FrameLabel -> {{"Average Throughput Per Job (Gbps)", " "}, {"Number of Active Jobs", "Average Throughput at t=[0.25,0.5,0.75,1.0] }, PlotTheme -> "Detailed",
PlotRangeClipping? -> True,
Plot[fitFunction, {x, 0, 5000}, PlotLabels? -> "Expressions"], PlotLabel -> None, LabelStyle? -> {GrayLevel[0], Bold}]

Show the Plot:

Show[AvgThroughputPerJob]

Export the Plot to JPEG:

Export["~/AverageThroughputPerJob.jpg", AvgThroughputPerJob? , "JPEG"]









Initial Attempts *Archiving initial attempts

This page details scalability tests on the GridFTP? server/LVS run over the course of November 2016

Overview

To test the throughput and client load limits of the LVS and the GridFTP? servers

  • A variable (range: 100-8k) number of 30-45 minute jobs were submitted via condor_submit and put into the condor_q(ueue)
  • Each job proceeds by selecting a file from a list at random, gfal-copying that file (writing it to /dev/null) and then sleeping for a variable amount of time (2-10 sec) before repeating the process and continuing for 30-45 minutes.
  • The throughput (Gbps) of each individual gftp-X.t2.ucsd.edu server as well as the number of active jobs was recorded at 30 second time intervals through each client's execution
  • Objective is to test 6 GridFTP? Servers on the LVS setup and to see if the system perform as expected
  • !!!Number of clients active when maximum throughput occurs
  • !!!Total Throughput/Client (Gbps/Client)

Example Code

gfcTest.sh

#!/bin/bash
sleepTime=$1 #seconds to sleep after each gfal-copy
totalTime=$2*60 #minutes to execute script for
while "$SECONDS" -lt "$totalTime"?
do
file=$(cat fileList.txt | sort -R | head -1)
home=$(echo "gsiftp://gftp.t2.ucsd.edu/hadoop")
path=$home$file
gfal-copy -f -v $path file:/dev/null
sleep "$sleepTime"s
done

fileList.txt

/Path/To/File/test_1.file
/Path/To/File/test_2.file
...
/Path/To/File/test_n.file

100_30.submit

executable = gfcTest.sh

error = out/output_100_30/test-$(Cluster).$(Process).error

log = out/output_100_30/test-$(Cluster).$(Process).log

output = out/output_100_30/test-$(Cluster).$(Process).out

transfer_input_files = fileList.txt

RequestMemory? = 1000

arguments = 10 30

queue 100

Data

10 second sleep time

Initially the jobs submitted slept for 10 seconds in between each execution of gfal-copy. The data was inconsistent and no clear correlations were found. The average total bandwidth at max throughput was 17.5 Gbps with an average of 740 clients active each having an individual bandwidth of ~0.032 Gbps.

2 second sleep time

The jobs submitted which slept for 2 seconds in between each instance of gfal-copy were much more consistent. The average total bandwidth at max throughput was ~17.5 Gbps. The number of clients active at max throughput greatly stabilized and remained consistent in the range of 900-1200 jobs active with an average of 1011 active clients at max throughput each having an individual bandwidth of ~0.017 Gbps.

Topic attachments
I Attachment Action Size Date Who Comment
elsecsv 1000jTEST.csv manage 18.9 K 2017/02/06 - 15:12 CliftonPotter  
elsecsv 1000j_active.csv manage 16.8 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 2000jTEST.csv manage 17.9 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 2000j_active.csv manage 8.1 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 3000jTEST.csv manage 16.2 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 3000j_active.csv manage 7.5 K 2017/02/06 - 15:15 CliftonPotter  
elsecsv 4000jTEST.csv manage 18.5 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 4000j_active.csv manage 8.4 K 2017/02/06 - 15:15 CliftonPotter  
elsecsv 5000jTEST.csv manage 16.8 K 2017/02/06 - 15:14 CliftonPotter  
elsecsv 5000j_active.csv manage 7.6 K 2017/02/06 - 15:17 CliftonPotter  
elsecsv AverageThroughput.csv manage 0.4 K 2017/02/06 - 17:10 CliftonPotter  
jpgjpg AverageThroughput.jpg manage 52.1 K 2017/02/06 - 16:22 CliftonPotter  
elsecsv AverageThroughputPerJob.csv manage 0.5 K 2017/02/06 - 17:11 CliftonPotter  
jpgjpg ClientsActiveOutOfQueued_10secSleep.jpg manage 117.9 K 2016/12/14 - 20:06 CliftonPotter  
pdfpdf ClientsActiveOutOfQueued_10secSleep.pdf manage 41.0 K 2016/12/14 - 20:01 CliftonPotter  
jpgjpg ClientsActiveOutOfQueued_2secSleep.jpg manage 118.8 K 2016/12/14 - 20:06 CliftonPotter  
elsenb MathematicaNotebook.nb manage 44.5 K 2017/02/06 - 17:17 CliftonPotter  
pdfpdf MaxThroughputPerClient.pdf manage 41.5 K 2016/12/09 - 05:59 CliftonPotter  
jpgjpg MaxThruPerClientVsActiveClients_10secSleep.jpg manage 102.7 K 2016/12/14 - 20:06 CliftonPotter  
pdfpdf MaxThruPerClientVsActiveClients_10secSleep.pdf manage 42.5 K 2016/12/14 - 20:00 CliftonPotter  
jpgjpg MaxThruPerClientVsActiveClients_2secSleep.jpg manage 99.7 K 2016/12/14 - 20:06 CliftonPotter  
pdfpdf MaxThruPerClientVsActiveClients_2secSleep.pdf manage 42.1 K 2016/12/14 - 19:59 CliftonPotter  
jpgjpg MaxThruVsNumActiveClients_10secSleep.jpg manage 96.0 K 2016/12/14 - 20:07 CliftonPotter  
pdfpdf MaxThruVsNumActiveClients_10secSleep.pdf manage 41.2 K 2016/12/14 - 20:00 CliftonPotter  
jpgjpg MaxThruVsNumActiveClients_2secSleep.jpg manage 92.5 K 2016/12/14 - 20:07 CliftonPotter  
pdfpdf MaxThruVsNumActiveClients_2secSleep.pdf manage 40.8 K 2016/12/14 - 20:00 CliftonPotter  
pdfpdf TotalMaxThroughputPerGFTPNumClientsActive.pdf manage 40.2 K 2016/12/09 - 05:58 CliftonPotter  
jpgjpg averagethroughput.jpg manage 52.1 K 2017/02/06 - 16:25 CliftonPotter  
jpgjpg averagethroughputperjob.jpg manage 54.1 K 2017/02/06 - 16:26 CliftonPotter  

This topic: UCSDScaleTests > WebHome > GridFTPLVSTests2016
Topic revision: r9 - 2017/02/07 - 06:50:34 - CliftonPotter
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback