This page is to keep track of Haifeng's accomplishments in computing on a weekly basis.

May 2007

May 18th 2007

  • Reached sustained glide-in operations of ~700 across 6 US T2's plus TTU, UCR, Vanderbilt (?)
    • Using 4 nodes to do this.
      • one for frontend (submit-1)
      • one for condor collector and negotiator (submit-2)
      • two old desktops for one GCB each

May 7th 2007

  • Validated that the new 32 node cluster of 8 ways is fully functional for cms, and has srmcp installed and working.
  • Hardware planning for large scale system:
I do plan to install GCB locally. I need to locate a dedicated machine
first. 

It is expected I will need to run ~1000-1500 jobs simultaneously over
the grid. Based on my knowledge, The minimum needs for hardware are:

1. two frontend machines (assuming one-CPU per machine, each handle
500-1000 jobs) 
2. two services machines with collector + factory running one each
machine
3. two GCB servers

So the minimum needs are; 6 CPUs in order to get this scale. Please
comment on whether this architecture is reasonable. I put collector and
factory into one machine, whether this serious hurt the scalability. Do
we need to use different machine to run collector and factory.
  • Presently have available for use:
    • submit1 and submit2, both are Dual Xeon machines located at SDSC.
    • 3 desktop machines of varying ages

April 2007

April 30th 2007

(2.1) scalability of glideinWMS

  • you managed to run 400 jobs simultaneously using glidein.
  • this is still a factor of two away from what Igor says the system should easily be able to do.

On Friday we decided that you would document your experience, and send it to Igor to comment on. the goal here is clearly to get an installation up and running that is as performant as advertized by Igor. I got several emails from Igor today, indicating he's back from vacation.

It's my understanding that he's in Wisconsin this week at Condor week, and can help you with any issues you have.

(2.2) Understand the availability of opportunistic resources across all sites that have cmssw installed on OSG

  • you were going to submit a small test job today to all sites on OSG that have cmssw installed. This job will mimic the workflow of production, and allow both an assessment of how much resources Haifeng can realistically expect, as well as the quality of the sites.
  • this submission was going to be done using the glideinWMS. The assumption was that we'd be happy if we can get 400 jobs to run at all times for a period of several days, and wouldn't try to shoot for more at this point.
  • you were going to look into dashboard reporting for these short jobs. I just checked the dashboard, and don't see any jobs of yours.
http://lxarda09.cern.ch/dashboard/request.py/jobsummary?user=HaifengPi&site=&ce=&submissiontool=&dataset=&application=&rb=&activity=&grid=&status=unknown&status=pending&status=running&status=terminated&status=done&status=cancelled&status=aborted&status=gridunknown&status=success&status=failed&status=appunknown&status=donesuccess&date1=2007-04-30+06%3A46%3A42&date2=2007-05-01+06%3A46%3A42&sortby=activity&nbars=

April 23rd 2007

Haifeng and fkw sat down this morning to discuss Haifeng's program of work in computing. Here's the summary.

There are three major components:

(1) ProdAgent development towards full support of Alpgen as part of the standard MC workflow. Haifeng points out that he proposed a strategy for how to deal with this, and was told by Elmer and Evans that this is not on the critical path, right now, and will be picked up again after the DBS2 migration.

Haifeng and fkw agreed that it is in fkw's court to find out what's going on here, and in what form contributions from Haifeng are still desired here by Evans & Elmer.

There are thus no agreed upon deliverables here until fkw has done his part to negotiate them.

(2) MC production on opportunistic resource in WLCG, i.e. OSG as well as LCG. Haifeng is concerned that there isn't enough spare capacity, and that he'd be competing with Wisconsin. He is concerned that Wisconsin has prefered CPU access, and that he himself will not be able to compete.

fkw explained that the issue is to show whether or not throughput on OSG (and LCG for that matter) is limited by human effort or available resources. The present claims are its limited by human effort. If that is true then better computing infrastructure, e.g. streamlined processes for doing the production, more fault tolerance, etc. would increase productivity. The goal for Haifeng's involvment would not be to "compete" with Wisconsin but to show a path towards more streamlined processes.

The goal here is thus to work with the glideinWMS software, and see if this, as well as other improvements, can improve productivity.

*Deliverables and milestones in this part.*
These will be redefined for now on a weekly basis as we move forwards, and get a better understanding of what's involved.

(2.1) Understand scalability of the glideinWMS installation as is.

  • fkw organized access to testclusters at FNAL and Caltech This was done this morning. Haifeng thus has access to large clusters that he can use submitting sleep jobs.
  • Haifeng will report on this at our Friday meeting, and follow up with fkw whenever Haifeng gets stuck on some technical issue.

(2.2) Understand the amount of available resources for opportunistic production

  • Haifeng will submit a small test job to all sites on OSG that have cmssw installed. This job will mimic the workflow of production, and allow both an assessment of how much resources Haifeng can realistically expect, as well as the quality of the sites.
  • ideally these tests jobs will be submitted using the glideinWMS. However, if that turns out to be problematic, then Haifeng will simply submit them using condor-g.
  • Haifeng will report on the status of this at this Friday's computing meeting.
  • as a corollary: Haifeng will look into dashboard reporting for these test jobs. It is desirable to have an official CMS record of the scale achieved in this way.

(2.3) Longer term, we agreed that we are shooting for accepting actual mc assignments by May 22nd.

(3) Scalability of OSG CE

This is a project Reza is working on. He needs help! Haifegn and fkw will meet with him on wednesday afternoon to talk through what's happened there so far, and where we are at with this at this point.

-- FkW - 02 May 2007


This topic: UCSDTier2 > HaifengComputing
Topic revision: r2 - 2007/05/18 - 22:13:31 - FkW
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback