Documentation of CE Overloads
Table of Contents
September 17th 2007
Overload of the UCSD CE osg-gw-2.t2.ucsd.edu began approximately at 09:30 in the morning Sept 17th 2007. Load quickly rose from around 50 to 500+. This extreme level of load results in batch system problems. During this time the second CE osg-gw-4.t2.ucsd.edu was not effected. At the time of the overload there were 3000+ jobs in the queue on osg-gw-2.
Possible reason for the overload is that certain remote sites use obsolete versions of condor. These sites allow multiple versions of grid-monitor to run per submitter.
Condor Versions 6.9.2pre2
Load
uptime
10:13:02 up 9 days, 13:34, 1 user, load average: 533.90, 498.40, 414.38
PS output
In the following output the first column should be a 1 as each submitter should have only one grid monitor. Some older versions of condor have a bug where they will continue to submit grid monitors if the first does not respond in a specified time frame.
ps auwx|grep grid-monitor-job-status |grep perl|sed -e 's|.*https://| |' -e 's|/tmp/[a-z_]*[.]0x[0-9a-f]*[.]| |' -e 's|/.*||'|sort|uniq -c
2 egee-rb-03.cnaf.infn.it:20020 3680
1 rb113.cern.ch:20328 17164
2 rb119.cern.ch:20279 7333
3 rb119.cern.ch:20310 7333
3 rb119.cern.ch:20342 7333
1 rb122.cern.ch:20176 7247
2 rb122.cern.ch:20225 7247
4 rb122.cern.ch:20230 7247
1 rb122.cern.ch:20238 7247
3 rb124.cern.ch:20063 8628
1 rb124.cern.ch:20176 8628
2 rb124.cern.ch:20230 8628
1 rb124.cern.ch:20427 8628
4 rb127.cern.ch:20044 32166
1 rb128.cern.ch:20014 21180
2 rb128.cern.ch:20169 21180
3 rb128.cern.ch:20197 21180
Graphs
The following are the graphs over the 12 hours preceding and during the overload of the CE>
--
TerrenceMartin - 17 Sep 2007
- CPU Load:
- Condor Job Queue:
- Managed Fork Queue:
- Memory Usage:
Resolution
Once the load hit 550+ the condor queue system started to have serious problems. At this point quickest resolution was;
- Stopping of the condor queue
- Restart of the computer system to purge the process queue
- Upgrade of Condor to 6.9.4 (to take advantage of the short downtime)
- Starting of condor.