General Hadoop Documentation and Links

Useful links:

Hadoop IO Tests

Hadoop IO Test, Read Single File, Write Many

Test on May 19, 2009 in which glidein jobs running at various OSG sites read a single file from the UCSD Hadoop SE and write a unique file back to the UCSD Hadoop SE. The 125MBs background is from the LoadTest transfers from Caltech's dCache to UCSD's Hadoop SE

Technical Summary of Test on May 19

Sending 3000 jobs to glideinWMS. It went well in first 3 hours with active transfers from 5 tier-2 sites and peaking at 700 running jobs. But later two datanodes were lost, ~700 active transfer jobs in the queue and doing nothing, most likely stuck in the gridftp transfer. Got enquiry from MIT again about what's happening. Looks like once a datanode is lost (possibly memory problem due to too many transfers), the gridftp continues accept transfer request from Bestman.

To-do-list:

  1. figure out the cause of dead datanode and resolve the memeory or stream limit issue in the gridftp. The constant loadtest shows the datanaode and gridftp are quite stable under normal condition, so the problem must come from too many srm transfer streams or related.
  2. let the bestman have a better control in assigning the srmcp to functioning gridftp. The problem might get worse if more gridftp are added to the hadoop under current setting.

The test will resume after (1) and (2) are solved.

  • Network traffic into the UCSD T2 on May 19, 2009.:
    io-in.jpg

  • Network traffic out of the UCSD T2 on May 19, 2009.:
    io-out.jpg

Hadoop LoadTest Transfers in PhEDEx

  • LoadTest transfers were established between Caltech and UCSD in both directions, from dCache to Hadoop storage systems.
  • Tests with srmcp had problems that they were limited to approximately 25MBs and
  • LoadTest from Caltech to UCSD was converted to FTS transfers, no problems found and steady transfers at about 125MBs are observed. Transfer failures (about 50% of total traffic) need to be investigated. At first glance, most of the errors seem to be from Bestman trying to write to the dead pools. Once a stable operation returns, we will look to see what the failure rate is with a fully-functional hadoop (currently two pools are off).

  • Successful PhEDEx transfer writes to UCSD T2. Only writes from Caltech are to the Hadoop SE.:
    phedex-in.jpg

Note that times in PhEDEx plots are in UTC, while other networking plots are in local time (PDT).

* Read failures from UCSD Hadoop SE to Caltech dCache also have about 50% error rates.

Improvement of Transfer Quality After gridftp Changes on May 27, 2009

Improvement in transfer quality in the LoadTest from Caltech to UCSD corresponded to the new scheme for selecting gridftp doors in hadoop at UCSD.

  • LoadTest transfer quality from Caltech to UCSD:
    quality_caltech_ucsd.jpg

-- JamesLetts - 2009/05/20

Topic attachments
I Attachment Action Size Date Who Comment
jpgjpg io-in.jpg manage 131.4 K 2009/05/20 - 17:57 JamesLetts Network traffic into the UCSD T2 on May 19, 2009.
jpgjpg io-out.jpg manage 121.8 K 2009/05/20 - 17:57 JamesLetts Network traffic out of the UCSD T2 on May 19, 2009.
jpgjpg phedex-in.jpg manage 233.8 K 2009/05/20 - 18:04 JamesLetts Successful PhEDEx transfer writes to UCSD T2. Only writes from Caltech are to the Hadoop SE.
jpgjpg quality_caltech_ucsd.jpg manage 199.2 K 2009/05/27 - 19:08 JamesLetts LoadTest transfer quality from Caltech to UCSD
Topic revision: r5 - 2009/05/27 - 19:11:53 - JamesLetts
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback