Email from Joe: -------- Forwarded Message --------
From: Joe Boyd <xxxx@fnal.gov>
To: Parag Mhashilkar <xxxx@fnal.gov>
Cc: Federica Moscato <xxxx@fnal.gov>, Dennis D Box <xxxx@fnal.gov>
Subject: a different factory died
Date: Wed, 30 Sep 2009 20:38:00 -0500
Hi Parag,
This was a completely different installation than the last one where the factory
died on me a couple of times. Again though, the factory died when a configured
condor limit was reached. This was glideinwms 1.5.1 so maybe something is fixed
in a later release. I can't even remember what I was testing before. This was
a different limit than before. I had one entry point open and 8000 jobs
submitted. I hadn't realized that condor was setup with this:
[gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1] condor_config_val -dump
| grep 5000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 5000
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5000
SEC_DEFAULT_SESSION_DURATION = 50000
[gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1]
This is the glideinwms factory condor pool and once it submitted 5000 glideins
it wouldn't send any more I guess. At this point, the factory died. There is
the error file. The factory_info file doesn't have any error in it. The last
entry is just a regular loop entry with the same timestamp as this file.
[gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1/log] cat
factory_err.20090930.log
[2009-09-30T15:53:51-05:00 29724] Exception at Wed Sep 30 15:53:51 2009:
['Traceback (most recent call last):\n', ' File
"/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
in main\n glideinDescript,entries)\n', ' File
"/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 121,
in spawn\n time.sleep(sleep_time)\n', ' File
"/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 192,
in termsignal\n raise KeyboardInterrupt, "Received signal %s"%signr\n',
'KeyboardInterrupt: Received signal 15\n']
[2009-09-30T16:03:08-05:00 32504] Exception at Wed Sep 30 16:03:08 2009:
['Traceback (most recent call last):\n', ' File
"/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
in main\n glideinDescript,entries)\n', ' File
"/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 115,
in spawn\n raise RuntimeError,"Entry \'%s\' exited, quit the whole
factory:\\n%s\\n%s"%(entry_name,tempOut,tempErr)\n', "RuntimeError: Entry
'osgt2' exited, quit the whole factory:\n[]\n[]\n"]
joe