Task List

FEATURES

  • Aggregrate gatekeeper/sites displayed in the monitoring plots.
    • Introduce the concepts of Groups
    • Each entry is a part of its own group
    • Each entry can be part of one or more groups
    • Forming Groups is configurable
    • Read the configuration and plot the information based on the user request
  • Allow easy means to configureunquoted strings in the factory configuration.
    • Useful for configuring condor_config variables like logging level. The values forsuch variables is unquoted string
    • Currently, only way to achieve this is through changing the web_base/condor_vars.lst file in the installation area and reconfiguring the factory.

BUGS

  • Factory does not submit glideins
    • Observed in the v1.6.x and v2.x
    • Was not able to reproduce successfully
    • Igor thinks this could be related to exception occuring but not handled correctly. Instead the exception is blindly ignored. His student may be looking into this issue.
    • Factory does not correctly determine the number of idle glideins in the system. Sometimes it reports zero glideins are in the system and submits bunch of new glideins, thus overloading the system.
    • In case there are errors/exceptions while running condor_q, we should just bypass the entire cycle and try again during the next cycle. This way if the condor_schedd has gone down, the entry will not advertise to the collector and the classad will eventually expire. This seems to be a safer operation.
    • Email from Burt:
      I noticed this in the CMS production installation:[2009-09-18T15:30:35-05:00 32407] Client 'cmssrv86', schedd status {1: 104, 2: 1659, 1100: 1, 1002: 104}, collector running ?[2009-09-18T15:31:58-05:00 32407] Client 'cmssrv86', schedd status {1: 0}, collector running ?
      [2009-09-18T15:33:07-05:00 32407] Client 'cmssrv86', schedd status {1: 100, 2: 1658, 1002: 100}, collector Note the 15:31:58 status of 1:0 -- that's not real. Is there some error condition with the condor_q output that defaults to marking
      as "zero jobs idle" {1: 0}

  • Factory sometimes crashes when it reaches the maximum number of glideins that can be submitted.
    • Email from Joe:
      -------- Forwarded Message --------
      From: Joe Boyd <xxxx@fnal.gov>
      To: Parag Mhashilkar <xxxx@fnal.gov>
      Cc: Federica Moscato <xxxx@fnal.gov>, Dennis D Box <xxxx@fnal.gov>
      Subject: a different factory died
      Date: Wed, 30 Sep 2009 20:38:00 -0500

      Hi Parag,

      This was a completely different installation than the last one where the factory
      died on me a couple of times. Again though, the factory died when a configured
      condor limit was reached. This was glideinwms 1.5.1 so maybe something is fixed
      in a later release. I can't even remember what I was testing before. This was
      a different limit than before. I had one entry point open and 8000 jobs
      submitted. I hadn't realized that condor was setup with this:

      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1] condor_config_val -dump
      | grep 5000
      GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 5000
      GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5000
      SEC_DEFAULT_SESSION_DURATION = 50000
      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1]

      This is the glideinwms factory condor pool and once it submitted 5000 glideins
      it wouldn't send any more I guess. At this point, the factory died. There is
      the error file. The factory_info file doesn't have any error in it. The last
      entry is just a regular loop entry with the same timestamp as this file.

      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1/log] cat
      factory_err.20090930.log
      [2009-09-30T15:53:51-05:00 29724] Exception at Wed Sep 30 15:53:51 2009:
      ['Traceback (most recent call last):\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
      in main\n glideinDescript,entries)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 121,
      in spawn\n time.sleep(sleep_time)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 192,
      in termsignal\n raise KeyboardInterrupt, "Received signal %s"%signr\n',
      'KeyboardInterrupt: Received signal 15\n']
      [2009-09-30T16:03:08-05:00 32504] Exception at Wed Sep 30 16:03:08 2009:
      ['Traceback (most recent call last):\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
      in main\n glideinDescript,entries)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 115,
      in spawn\n raise RuntimeError,"Entry \'%s\' exited, quit the whole
      factory:\\n%s\\n%s"%(entry_name,tempOut,tempErr)\n', "RuntimeError: Entry
      'osgt2' exited, quit the whole factory:\n[]\n[]\n"]

      joe

  • Factory shuts down if one of the entry crashes.
    • This is annoying and a problematic entry should not cause the entire factory to shut down.
  • v2_1 doesn't work with glexec enabled.
    • Tested with condor 7.3.1 and condor 7.2.4condor_procd hangs with permissions error.
Topic revision: r4 - 2009/10/12 - 16:39:29 - ParagMhashilkar
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback