Measuring Data Transfer Latency
We are measuring data transfer latencies by requesting one dataset per day via the Prod instance of PhEDEx, and tracking until the end of CAS07 how many of those actually arrived within a reasonable time. We will pick datasets that are around 1TB, and delete them hopefully once a day as well, assuming they complete within a day.
The table below shows the request time (PDT) for each dataset, where it is available, size, and latency (time to completion of the transfer to our SE).
PhEDEx hourly plots for each dataset are linked in at the bottom of the page.
| Request time (PDT)
|| Available at
| 03 October 21:35
|| FNAL, IN2P3? *
| 04 October 07:35
|| CERN*, FNAL
| 04 October 12:40
|| CERN*, FNAL
| 04 October 19:09
| 05 October 04:42
| 05 October 13:30
|| FNAL, FZK*, Caltech, Purdue
| 05 October 22:20
|| ASGC, CERN (near end)
| 07 October 20:15
| 07 October 21:12
|| CERN, FNAL
| 10 October 08:49
|| CERN, FNAL
| 10 October 17:43
|| CERN, FNAL
| 11 October 08:56
|| CERN, FNAL
| 11 October 22:06
|| CERN, FZK*
| 12 October 21:23
|| CERN, FNAL
| 17 October 00:02
| 17 October 01:31
- 04 Oct 2007
- It took several hours for the last few files to arrive.
- /PhotonJets_Pt_50_80/CMSSW_1_5_2-CSA07-1186411740 and /PhotonJets_Pt_15_20/CMSSW_1_5_2-CSA07-2164
- /PhotonJets_Pt_20_30/CMSSW_1_5_2-CSA07-2177 is the left-most spike. The middle small one is a relval sample, and the right-most is the beginning of BB2MuMu? below.
- 49 data files were lost in from a pool which went permanently offline
- Triggered a dataset DELETION with the PhEDEx web page at 13:30 PDT
- Files started retransferring from FNAL at 14:49 PDT; however, FNAL does not have a complete dataset. Only Purdue and Caltech do and also have commissioned links to UCSD
- We received 83 files in the first 30 minutes of transfers and then nothing since 15:16 PDT.
- Made "High Priority" at 21:27 PDT.
- All transfer failures from sample /QCD_Pt_800_1000/CMSSW_1_5_2-CSA07-2028/GEN-SIM-DIGI-RECO, suspending ... it sits in the subscription list ahead of this sample ... is that the problem? Just blocking whats below it?
- ASGC shut down because of Typhoon
- Typhoon over, restarted ASGC download agent 19:10 PDT 07 October 2007
- ASGC has CASTOR problems
- ASGC is now affected by an FTS bug. Four files left to transfer. Files are also at CERN now. 12 October 2007.
- We eventually got the last files from CERN on 13 October 2007.
- ASGC still having FTS issues. Our link is in danger of DECOMMISSIONING because of extremely poor performance due to problems at ASGC (castor, fts)
- The "yellow" 6 hour spike starting at 3:00GMT belongs to this dataset.
- The transfers from CERN soon thereafter are datasets that we requested a long time ago, before starting this exercise. They started moving only now because our link to CERN was commissioned and enabled only now.
- The data from ASGC visible here is the /Z1jet_0ptw100-alpgen/CMSSW_1_4_6-CSA07-2139 which is still not complete at this point.
- It should be noted that something weird happened with this dataset. It was fully transfered in the evening, and then magically disappeared partially. Since then, we have had trouble getting it completely moved again. As of 10-09-07 8:39am only 1.4TB is present out of 1.8TB at UCSD.
- the yellow spike is the transfer of this dataset. The "grass" in different colors are various problem datasets that are still coming in, as well as the dataset from ASGC listed above.
- last high spike is /PhotonJets_Pt_Pt_170_300/CMSSW_1_5_2-CSA07-2053 while the dribble over ~12hours is the /W3jet_0ptw100-alpgen/CMSSW_1_5_2-CSA07-2223
- The transfer quality from fnal seems significantly worse during the period with the dribble than during the previous two peaks:
- most of this was transfered very fast from CERN. However, 5 files are still missing on 10/13/07 19:48. It's not clear that these remaining files even exist, given that CERN buffer has two more files than CERN MSS for a long time.
- Five files are still missing 15 October 2007 05:47 PDT. Total number of files on site at UCSD is 3311
- Four of the missing became routed to UCSD from CERN on 15 October after I asked about them and were transferred by 2007-10-15 15:15:02 UTC. The fifth file does even exist in T1_CERN_Buffer but not T1_CERN_MSS. This is the explanation from Douglas:
Those 5 files are routed to UCSD. They should move to there soon. I guess,
it is probably due to the massive (and not well designed) T0 usage of the
t0export for processing and copying files from pools to pools done by the
T0 team, causing the exporting of those been delayed (it completely block
CERN exports this weekend for instance).
We don't request files to be staged on disk if the LSF queue is bigger than
3K pending requests. CASTOR team suggestion.
- Asked again about the last file, and Douglas told me it was routed and was transferred by 2007-10-16 12:55:38 UTC.
- This is the red little blip above.
- 131 files downloaded to UCSD by 2007-10-13 05:11:41 UTC
- Sample requested by Purdue from CERN but Purdue does not have a commissioned link from CERN. This is another peer-to-peer experiment in a relayed data transfer. We will record how long it takes to download from CERN but also to pass on to Purdue.
- First file downloaded to UCSD at 2007-10-17 09:43:25 PDT, over 8h after the dataset was requested. File is also copied to Purdue already (before 11:13am).
- All files downloaded and transferred to Purdue after about 3 days (91h).
- The transfer of this dataset was very varied. We believe this is in part due to the CSA07 processing and skimming activities which put a lot of load on both CERN and FNAL.
- In addition, we got stuck on October 21st with 5 files left to transfer. All 5 supposedly existed at CERN and FNAL.
- The detailed plot shows clearly that nothing moved despite the 5 missing files.
- By October 24th, only one of the 5 files has moved to UCSD, i.e. still 4 files missing. At this point we are asking phedex ops for help.
- The 4 files are lost. The whole dataset existed only on FNAL_buffer. Yujun says that files on FNAL_buffer are automatically saved to tape. Not clear why these 4 files got missed.
- Yujun is going to request that these 4 files be deleted from TMDB as well as DBS.
History of Commissioned Downlinks to UCSD (beginning 04 October 2007)
- 04 Oct 2007 - Commissioned downlinks are available from T1_ASGC and T1_FNAL (as well as from T2 sites at T2_Caltech, T2_Purdue and T2_Estonia)
- 08 Oct 2007 - Link from CERN to UCSD enabled in Prod.
- 16 Oct 2007 - Link from ASGC to UCSD disabled in Prod. ASGC has been affected by debiltating FTS problems since upgrading their FTS server to FTS 2.0 several days ago, and is in danger of decommissioning almost all of their downlinks from ASGC to T2 sites and links from several T1 sites as well.
- New links added here when they become activated or deactivated.
Plot of Latency vs Data Set Size for Quickly Transferred Data Sets
History of some problem datasets
- available only at RAL as of 10/09/07. No wonder that we have not seen any of it at UCSD for weeks.
- 10/11/07 22:29 requested subscription for both FNAL and CERN. FNAL accepted. CERN hasn't yet. 41 files have by now made it to both FNAL and UCSD.
- 10/12/07 21:27 one day later, and still only 121 files made it to UCSD (141 to FNAL). It seems that data movement from RAL to FNAL is very slow!
- As of 10/09/07 This datset has 2 missing files at UCSD for a long time. Same for FNAL and a lot of other sites. CERN_Buffer has the dataset complete. How can it be that the 2 missing files aren't moving from CERN to FNAL? Or from CERN to UCSD for that matter?
- Was told that the two files are on a tape at CERN that is inaccessible. Am told that this has been filed as a castor bug. Am told that these are not the only "lost files" at CERN. Am told that "Castor has a habit of loosing files".
- As of 10/09/07 We got 170 files. Nobody else in the world has more. Why isn't this considered complete?
- Turns out that 59 files were stuck in FZK. CERN had no subscription for them, and neither RAL nor us had a link to FZK in production. Accordingly, neither of us could get the 59 files to complete our subscription.
- fkw requested the data for both CERN and FNAL. The data stuck at FZK moved to FNAL, and we completed our subscription, apparently by getting the remaining files from FNAL.
- Interestingly, FNAL got the FZK data quickly, but hasn't gotten the CERN data yet. As of now, CERN and RAL have 170 files, while FZK and FNAL have 59, and us have all 170+59.
- As of 10/09/07 we got 173 files. CERN has 185. CNAF has the full set of 192 files. How come this isn't moving from CNAF to CERN?
- CNAF's uplink to CERN wasn't commissioned. Once CNAF's link was back in production the data moved to CERN, and from there to UCSD.
- 09 Oct 2007
- 05 Oct 2007
Topic revision: r39 - 2007/10/25 - 18:56:11 - JamesLetts