Documentation of CE Overloads

Table of Contents

September 17th 2007

Overload of the UCSD CE osg-gw-2.t2.ucsd.edu began approximately at 09:30 in the morning Sept 17th 2007. Load quickly rose from around 50 to 500+. This extreme level of load results in batch system problems. During this time the second CE osg-gw-4.t2.ucsd.edu was not effected. At the time of the overload there were 3000+ jobs in the queue on osg-gw-2.

Possible reason for the overload is that certain remote sites use obsolete versions of condor. These sites allow multiple versions of grid-monitor to run per submitter.

Condor Versions 6.9.2pre2

Load

uptime
 10:13:02 up 9 days, 13:34,  1 user,  load average: 533.90, 498.40, 414.38

PS output

In the following output the first column should be a 1 as each submitter should have only one grid monitor. Some older versions of condor have a bug where they will continue to submit grid monitors if the first does not respond in a specified time frame.


ps auwx|grep grid-monitor-job-status |grep perl|sed -e 's|.*https://| |' -e 's|/tmp/[a-z_]*[.]0x[0-9a-f]*[.]| |' -e 's|/.*||'|sort|uniq -c

      2  egee-rb-03.cnaf.infn.it:20020 3680
      1  rb113.cern.ch:20328 17164
      2  rb119.cern.ch:20279 7333
      3  rb119.cern.ch:20310 7333
      3  rb119.cern.ch:20342 7333
      1  rb122.cern.ch:20176 7247
      2  rb122.cern.ch:20225 7247
      4  rb122.cern.ch:20230 7247
      1  rb122.cern.ch:20238 7247
      3  rb124.cern.ch:20063 8628
      1  rb124.cern.ch:20176 8628
      2  rb124.cern.ch:20230 8628
      1  rb124.cern.ch:20427 8628
      4  rb127.cern.ch:20044 32166
      1  rb128.cern.ch:20014 21180
      2  rb128.cern.ch:20169 21180
      3  rb128.cern.ch:20197 21180

Graphs

The following are the graphs over the 12 hours preceding and during the overload of the CE>

-- TerrenceMartin - 17 Sep 2007

  • CPU Load:
    cpuload.png

  • Condor Job Queue:
    jobqueue.png

  • Managed Fork Queue:
    localqueue.png

  • Memory Usage:
    memusage.png

Resolution

Once the load hit 550+ the condor queue system started to have serious problems. At this point quickest resolution was;

  • Stopping of the condor queue
  • Restart of the computer system to purge the process queue
  • Upgrade of Condor to 6.9.4 (to take advantage of the short downtime)
  • Starting of condor.
Topic attachments
I Attachment Action Size Date Who Comment
pngpng cpuload.png manage 32.2 K 2007/09/17 - 18:41 TerrenceMartin CPU Load
pngpng jobqueue.png manage 28.3 K 2007/09/17 - 18:41 TerrenceMartin Condor Job Queue
pngpng localqueue.png manage 22.4 K 2007/09/17 - 18:41 TerrenceMartin Managed Fork Queue
pngpng memusage.png manage 31.3 K 2007/09/17 - 18:41 TerrenceMartin Memory Usage
Topic revision: r1 - 2007/09/17 - 18:44:44 - TerrenceMartin
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback