General Hadoop Documentation and Links
Useful links:
Hadoop IO Tests
Hadoop IO Test, Read Single File, Write Many
Test on May 19, 2009 in which glidein jobs running at various OSG sites read a single file from the UCSD Hadoop SE and write a unique file back to the UCSD Hadoop SE. The 125MBs background is from the LoadTest transfers from Caltech's dCache to UCSD's Hadoop SE
Technical Summary of Test on May 19
Sending 3000 jobs to glideinWMS. It went well in first 3 hours with active transfers from 5 tier-2 sites and peaking at 700 running jobs. But later two datanodes were lost, ~700 active transfer jobs in the queue and doing nothing, most likely stuck in the gridftp transfer. Got enquiry from MIT again about what's happening. Looks like once a datanode is lost (possibly memory problem due to too many transfers), the gridftp continues accept transfer request from Bestman.
To-do-list:
- figure out the cause of dead datanode and resolve the memeory or stream limit issue in the gridftp. The constant loadtest shows the datanaode and gridftp are quite stable under normal condition, so the problem must come from too many srm transfer streams or related.
- let the bestman have a better control in assigning the srmcp to functioning gridftp. The problem might get worse if more gridftp are added to the hadoop under current setting.
The test will resume after (1) and (2) are solved.
- Network traffic into the UCSD T2 on May 19, 2009.:
- Network traffic out of the UCSD T2 on May 19, 2009.:
Hadoop LoadTest Transfers in PhEDEx
- LoadTest transfers were established between Caltech and UCSD in both directions, from dCache to Hadoop storage systems.
- Tests with srmcp had problems that they were limited to approximately 25MBs and
- LoadTest from Caltech to UCSD was converted to FTS transfers, no problems found and steady transfers at about 125MBs are observed. Transfer failures (about 50% of total traffic) need to be investigated. At first glance, most of the errors seem to be from Bestman trying to write to the dead pools. Once a stable operation returns, we will look to see what the failure rate is with a fully-functional hadoop (currently two pools are off).
- Successful PhEDEx transfer writes to UCSD T2. Only writes from Caltech are to the Hadoop SE.:
Note that times in PhEDEx plots are in UTC, while other networking plots are in local time (PDT).
* Read failures from UCSD Hadoop SE to Caltech dCache also have about 50% error rates.
Improvement of Transfer Quality After gridftp Changes on May 27, 2009
Improvement in transfer quality in the LoadTest from Caltech to UCSD corresponded to the new scheme for selecting gridftp doors in hadoop at UCSD.
- LoadTest transfer quality from Caltech to UCSD:
--
JamesLetts - 2009/05/20
Topic revision: r5 - 2009/05/27 - 19:11:53 -
JamesLetts