Difference: CEOverloadDoc ( vs. 1)

Revision 12007/09/17 - Main.TerrenceMartin

Line: 1 to 1

Documentation of CE Overloads

Table of Contents

September 17th 2007

Overload of the UCSD CE osg-gw-2.t2.ucsd.edu began approximately at 09:30 in the morning Sept 17th 2007. Load quickly rose from around 50 to 500+. This extreme level of load results in batch system problems. During this time the second CE osg-gw-4.t2.ucsd.edu was not effected. At the time of the overload there were 3000+ jobs in the queue on osg-gw-2.

Possible reason for the overload is that certain remote sites use obsolete versions of condor. These sites allow multiple versions of grid-monitor to run per submitter.

Condor Versions 6.9.2pre2


 10:13:02 up 9 days, 13:34,  1 user,  load average: 533.90, 498.40, 414.38

PS output

In the following output the first column should be a 1 as each submitter should have only one grid monitor. Some older versions of condor have a bug where they will continue to submit grid monitors if the first does not respond in a specified time frame.

ps auwx|grep grid-monitor-job-status |grep perl|sed -e 's|.*https://| |' -e 's|/tmp/[a-z_]*[.]0x[0-9a-f]*[.]| |' -e 's|/.*||'|sort|uniq -c

      2  egee-rb-03.cnaf.infn.it:20020 3680
      1  rb113.cern.ch:20328 17164
      2  rb119.cern.ch:20279 7333
      3  rb119.cern.ch:20310 7333
      3  rb119.cern.ch:20342 7333
      1  rb122.cern.ch:20176 7247
      2  rb122.cern.ch:20225 7247
      4  rb122.cern.ch:20230 7247
      1  rb122.cern.ch:20238 7247
      3  rb124.cern.ch:20063 8628
      1  rb124.cern.ch:20176 8628
      2  rb124.cern.ch:20230 8628
      1  rb124.cern.ch:20427 8628
      4  rb127.cern.ch:20044 32166
      1  rb128.cern.ch:20014 21180
      2  rb128.cern.ch:20169 21180
      3  rb128.cern.ch:20197 21180


The following are the graphs over the 12 hours preceding and during the overload of the CE>

-- TerrenceMartin - 17 Sep 2007

  • CPU Load:

  • Condor Job Queue:

  • Managed Fork Queue:

  • Memory Usage:


Once the load hit 550+ the condor queue system started to have serious problems. At this point quickest resolution was;

  • Stopping of the condor queue
  • Restart of the computer system to purge the process queue
  • Upgrade of Condor to 6.9.4 (to take advantage of the short downtime)
  • Starting of condor.

META FILEATTACHMENT attr="" autoattached="1" comment="Condor Job Queue" date="1190054484" name="jobqueue.png" path="jobqueue.png" size="28967" user="Main.TerrenceMartin" version="1"
META FILEATTACHMENT attr="" autoattached="1" comment="Managed Fork Queue" date="1190054498" name="localqueue.png" path="localqueue.png" size="22898" user="Main.TerrenceMartin" version="1"
META FILEATTACHMENT attr="" autoattached="1" comment="CPU Load" date="1190054465" name="cpuload.png" path="cpuload.png" size="32997" user="Main.TerrenceMartin" version="1"
META FILEATTACHMENT attr="" autoattached="1" comment="Memory Usage" date="1190054509" name="memusage.png" path="memusage.png" size="32035" user="Main.TerrenceMartin" version="1"
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback