Release v2.4 series

Release v2.4 - For Production (Released: May 04, 2010)

  • Fix installer bug (JW)

    • Install collector+schedd as root, specify non-root user to run Condor under; schedd does not start up:
    Hi guys.
    
    I tried to install the collector+schedd on a test machine, installing as root and using a user account for the condor daemons,
    and the installer did not like it; it fails in the last step, when trying to start the daemons:
    ...
    How many secondary schedds do you want?: [9] 4
    Error determining who should own the Condor-related directories.
    Either create a "condor" account, or set the CONDOR_IDS environment
    variable to the valid uid.gid pair that should be used by Condor.
    Traceback (most recent call last):
      File "./glideinWMS_install", line 4238, in ?
        main()
      File "./glideinWMS_install", line 117, in main
        return installer(install_options)
      File "./glideinWMS_install", line 94, in installer
        install_options[k]["proc"]()
      File "./glideinWMS_install", line 301, in schedd_node_install
        configure_secondary_schedd(schedd_name)
      File "./glideinWMS_install", line 3431, in configure_secondary_schedd
        raise RuntimeError, "Failed to initialize schedd '%s'!"%schedd_name
    RuntimeError: Failed to initialize schedd 'jobs1'!
    ... 
    

  • Privilege separation in factory

  • Factory: Refactor: separate log directories from configuration directories

    • Current structure is difficult to backup, maintain/delete logs, etc.
  • Aggregate gatekeeper/sites displayed in the monitoring plots. (PM)

    • Introduce the concepts of Groups

Release v2.4.1 (Released on July 27, 2010)

  • Bug about condor 0 length file reported by Dennis

    • This is not an issue in v2 series
  • Test glideinWMS compatibility with condor v7.5.2

    • Change occurrences of 1 to True and 0 to False in glideinWMS. Condor changed the way they interpret 1/0 and True/False.
  • Improved Documentation

    • Completed and committed to branch_v2plus by Doug Strain
    • Merged to branch_v2_4plus
  • Monitoring Groups improvements

  • APPEND_REQ_VANILLA = (Memory>=1)&&(Disk>=1)

    • Most of the users do not set Disk Usage. So set above in condor_config.local
  • Factory & Frontend: use of gLexec to be determined by each frontend

    • Currently, factory configures this on a per entry point basis.
    • Proposal: factory should advertise if gLexec is available; frontend should choose if glexec use is mandatory, desired, or off.
    • Possibly create new attribute that looks like:
      gfactory frontend
      GLEXEC_AVAILABLE ->
      <- GLEXEC_USE
    • Actual implementation: If an entry is configure with GLEXEC_BIN with value other than NONE, it is assumed that site has GLEXEC available to be used. Frontend can mandate/overide the use of GLEXEC by setting attr GLIDEIN_Glexec_Use. If set to NONE, glidein with not use GLEXEC, if set to OPTIONAL, glidein will use GLEXEC if site has one, if set to REQUIRED, glidein will enforce the use of GLEXEC. Setting it to REQUIRED also enforces the factory to send glideins to the sites that have GLEXEC configured.
    • Tested 6/30/10 John Weigand:
       frontend factory result -------- ------- ------ required glexec ran/used glexec never glexec ran/did not use glexec optional glexec ran/used glexec optional no glexec ran/did not use glexec never no glexec ran/did not use glexec required no glexec should have no glideins start job367- authorize but not user job - clueless nothing in any log saying it will never be satisfied 

  • Move configuration of GLIDEIN_Job_Max_Time from the factory to the fronted.

    • The problem is that right now GLIDEIN_Job_Max_Time+GLIDEIN_Retire_Time determine how long the glidein can run, and GLIDEIN_Retire_Time must stay in the factory (site specific).
  • Check that Factory configuration is consistent w.r.t CONDOR_ARCH, CONDOR_VERSION, CONDOR_OS

    For entry where we configure CONDOR_ARCH and CONDOR_OS (example below)

    <attr name="CONDOR_ARCH" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="x86"/> <attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="rhel3"/>

    We should make sure that x86 and rhel3 can actually be resolved and are configured through the config parameter condor_tarball (something like following)

    <condor_tarball arch="x86" base_dir="/home/gfactoryparag/v2.2.3/glidecondor-x86-rhel3" os="rhel3" tar_file="/var/www/html/glidefactory-v2.2.3/stage/glidein_v1_0/condor_bin_rhel3-x86.a21hTW.tgz"/>

    So this way, condor os-arch specific tarball to be used by every entry indeed exists.

  • Add a new knob PREEMPT_GRACE_TIME (IS)

    The HCC group requested another feature; since they use preemption (for which I added an explicit knob in v2_4 a while ago) whey also need to specify MaxJobRetirementTime, or the job will never go away when preempted.
    
    I defined
    PREEMPT_GRACE_TIME
    that is then used to define MaxJobRetirementTime
    in my UCSD gfactory, and they are testing it now.
    
    The reason for a different name is so it is not confused with
    GLIDEIN_Job_Max_Time
    

Release v2.4.2 (Released on August 03, 2010)

  • Fix v2.4.1 which breaks backward incompatibility in monitoring

    • This is not an issue in v2 series

Release v2.4.3 (Released; September 30, 2010)

  • Install VOMS Certs during the installation process.

    • This has been merged from branch_v2plus
    -- JohnWeigand - 2010/09/21 - test results:
    The current method of putting .pem files in the ./vomsdir directory has not been working as most .pem files that are distributed have been already expired. The new method, using .lsc files, has just been tested and should be available in the next release.
  • Fix the misleading condor config file and default settings (Fixed: Aug 10, 2010)

    -- JohnWeigand - 2010/09/21 - test results:
    Verified with Parag on 9/20. This is a function of the create/reconfig for the frontend and is the ./frontend.condor_config that is generated from those processes.
  • Fix generation of CN from a voms proxy of service certificate. It adds extra CN bit at the end which should be stripped out (Fixed: Aug 10, 2010)

    -- JohnWeigand - 2010/09/21 - test results:
    I did not test the q/a installer. Just the ini installer.
  • Make sure the entry does not throw exception when we fail to submit glideins

    • FOLLOWING IS NOT A BUG. It works fine as expected and the factory does not crash. Verified in branch_v2_4plus.
    • Happens when condor_submit fails for whatever reason and we try to determine the number of glideins submitted from a None/empty list object
    [2010-08-10T13:45:21-05:00 32385] condor_submit failed: Error running 'export X509_USER_PROXY=/home/gfactory/.grid/pilot.dbox.proxy;./job_submit.sh "gpgeneral" "gpgeneral@v1_0@if01@fnal_if01" 20  -- GLIDEIN_Collector if01.dot,fnal.dot,gov' code 256:['\n', 'ERROR: proxy has expired\n']  [2010-08-10T13:45:21-05:00 32385] Exception at Tue Aug 10 13:45:21 2010: Traceback (most recent call last):   File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 311, in iterate     glideinDescript,jobDescript,jobAttributes,jobParams)   File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 281, in iterate_one     done_something = find_and_perform_work(in_downtime,glideinDescript,jobDescript,jobParams)   File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 181, in find_and_perform_work     jobDescript,x509_proxy_fname,params)   File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 91, in perform_work     nr_submitted=glideFactoryLib.keepIdleGlideins(condorQ,idle_glideins,max_running,max_held,submit_attrs,x509_proxy_fname,params)   File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 286, in keepIdleGlideins     submitGlideins(condorq.entry_name,condorq.schedd_name,condorq.client_name,min_nr_idle-idle_glideins,submit_attrs,x509_proxy_fname,params)   File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 664, in submitGlideins     cluster,count=extractJobId(submit_out)   File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 587, in extractJobId     raise condorExe.ExeError, "Could not find cluster info!" ExeError: Could not find cluster info! 

    -- JohnWeigand - 2010/09/21 - test results:
    Did not know how to create this condition specifically for this type exception.
    Tests I did perform:
    *1. Stopped the globus-gatekeeper on the entry CE.*
    Results - Only 1 glidein was started on the WMS Collector. This message in the client log file indicated the resource was down:
    In submit_20100921_cms_jgw-v2_4_3.main.log.. 026 (004.000.000) 09/21 08:59:38 Detected Down Grid Resource GridResource: gt2 gr9x0.fnal.gov/jobmanager-condor 
    However, at 09:38, this 'tuple index out of range' stacktrace appeared log files and never appeared again over an 01:30:00 time period. I do not believe it is related to this issue. The log files this appeared in where the factory (not client logs):

    • glidein_v2_4_3/log/entry_ress_ITB_GRATIA_TEST_2/factory.20100921.info.log
    • glidein_v2_4_3/log/entry_ress_ITB_GRATIA_TEST_2/factory.20100921.err.log
    [2010-09-21T09:38:37-05:00 14896] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '   File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryEntry.py", line 453, in iterate\n    write_stats()\n', '   File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryEntry.py", line 357, in write_stats\n    glideFactoryLib.factoryConfig.log_stats.write_file()\n', '   File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryMonitoring.py", line 870, in write_file\n    diff_summary=self.get_diff_summary()\n', '  File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryMonitoring.py", line 795, in get_diff_summary\n    sdel[4][\'username\']=username\n', 'IndexError: tuple index out of range\n'] 
    2. Started the entry CE globus-gatekeeper at 10:40. %BR& Results: It processed the user jobs successfully. I then shutdown the entry CE gatekeeper. It recognized the down resource correctly. The glidein pilots on the WMS collector continued running. I did not get the warning message again.

  • Check the default scripts run on worker node and make sure errors are logged in stderr and not stdout.

    -- JohnWeigand - 2010/09/21 - test results:
    Tough one to test. It would help if it the description on this told you just which stdout we are looking for.

  • Allow factory_startup to be started from any directory

    • 7/2/10 from Ian MacNeill (USCD) - We have a request for a change to be made to 'factory_startup'. When we call factory_startup, it works for all of the arguments (stop, reconfig, ect) from any directory. When we call 'factory_startup start' from any directory but the one it is stored in, it fails to work. Can this be modified so that the argument 'start' will work from any directory?
    • Parag: Current fix still has couple of flaws. Needs fix.
    • Just cd into factory_dir in factory_startup.
    -- JohnWeigand - 2010/09/21 - test results:
    Verified that it can be started from any directory.

-- ParagMhashilkar - 2011/02/01

Topic revision: r4 - 2011/05/24 - 18:16:00 - ParagMhashilkar
UCSDTier2.ReleasePagev2_4 moved from UCSDTier2.ReleasePagev24 on 2011/02/07 - 16:41 by BurtHolzman - put it back
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback