Glidein Factory FAQ

Contents

Variables Used in this Document

Here is a list of variables used in this document as shorthand for common paths used in factory operations:

Variable Value Description
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinwms glideinWMS source code directory
GLIDEIN_FACTOOLS /home/gfactory/factools factools repo location

Assumed gfactory Path Setup

This document assumes the gfactory user has the following set in the $PATH:

  • $GLIDEIN_SRC_DIR/factory/tools/
  • $GLIDEIN_FACTOOLS/generic/bin/

Basic Procedures

Reconfiguring Factory

  1. Edit /etc/gwms-factory/glideinWMS.xml
  2. Stop the factory, reconfigure, and then restart:
    service gwms-factory stop
    service gwms-factory reconfig
    service gwms-factory start
    

Restarting Factory after Reboot

  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start

Site Debugging Procedures

Putting Entries in Temporary Downtime

./factory_startup down -entry entry_name -comment 'comment on why it is down'

You can optionally put an end time with the option

-end [[[YYYY-]MM-]DD-]HH:MM[:SS]
Examples:
-end 07:00
-end 05-19-07:00
-end 2014-05-19-07:00

Maintenance

Installing Factory from RPMs

Click here for instructions on how install a Factory from scratch using the OSG RPMs.

Installing factools Repo

The factools repo can be found at:

https://github.com/jdost321/factools

In the gfactory user home directory, run:

git clone git://github.com/jdost321/factools.git

refer to factools/README on how to set up environment to enable factools usage.

Adding a New Site to Glidein Factory

Click here for instructions on how to add a site for VOs to use.

Shared Factory Config

Click here for information about the shared factory config

Entry Templates

NOTE this section is likely obsolete

CMS cream:

      <entry name="" comment="" enabled="True" gatekeeper="https://%HOSTNAME%:8443/ce-cream/services/CREAM2 %BATCH% %QUEUE%" gridtype="cream" verbosity="std" work_dir="TMPDIR">
         <config>
            <max_jobs held="25" idle="400" running="10000">
               <max_job_frontends>
               </max_job_frontends>
            </max_jobs>
            <release max_per_cycle="20" sleep="0.2"/>
            <remove max_per_cycle="5" sleep="0.2"/>
            <restrictions require_voms_proxy="False"/>
            <submit cluster_size="10" max_per_cycle="100" sleep="0.2"/>
         </config>
         <downtimes/>
         <allow_frontends>
         </allow_frontends>
         <attrs>
            <attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
            <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
            <attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="114840"/>
            <attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_SEs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
            <attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
         </attrs>
         <files>
         </files>
         <infosys_refs>
            <infosys_ref ref="GlueCEUniqueID=" server="exp-bdii.cern.ch" type="BDII"/>
         </infosys_refs>
         <monitorgroups>
            <monitorgroup group_name="CMST2"/>
            <monitorgroup group_name="CMS"/>
         </monitorgroups>
      </entry>

Cloning Factories

Below are examples of doing a global clone from UCSD to GOC and CERN factories. You can record your clones in the Factory Cloning Log.

DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2015-01-13.

Description of clone_glidein Arguments

  • -merge yes/no/only
    • yes - modify existing entries in addition to adding new ones
    • no - only add new entries (default)
    • only - only merge existing; don't add new entries
  • -preserve_enable - when merging don't disable sites that are still enabled in original config
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Temporary Fix for v3_2_5 -> v3_2_3

NOTE until all factories are v3_2_5, you will see errors like:

Unexpected error occurred loading the configuration file.

Unknown parameter glidein.entries.ATLAS_US_Michigan_gate01.config.submit.submit_attrs

To avoid this, before running the clone tool, remove the new v3_2_5 attributes:

grep -v submit_attrs glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2

Then proceed to clone normally using glideinWMS.xml.ucsd2 instead.

Cloning UCSD -> GOC

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC -> UCSD is done in the exact same way so it is not shown here.

Cloning UCSD -> CERN

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. Use include and exclude constraints to only add regular CMS sites
    clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes -include GLIDEIN_Supported_VOs CMS -exclude GLIDEIN_Supported_VOs CMSOverflow glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC -> CERN is done in the exact same way so it is not shown here.

Cloning CERN -> UCSD

These instructions assume you have copied the CERN config to the respective factory and named it glideinWMS.xml.cern.

  1. Exclude the cloud resources:
    clone_glidein -other glideinWMS.xml.cern -out glideinWMS.xml.test -exclude name CMS_T1_TW_ASGC_AI -exclude name CMS_T2_CH_CERN_AI -exclude name CMS_T2_CH_CERN_HLT -merge yes glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning CERN -> GOC is done in the exact same way so it is not shown here.

Cloning UCSD -> GOC-ITB

NOTE while we test glexec sites take care to not accidentally disable glexec on GOC-ITB entries. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. Append _ITB to OSGVO and associated names:
    sed -e 's/OSGVO\([,"]\)/OSGVO_ITB\1/g' -e 's/OSGVOHTPC/OSGVOHTPC_ITB/g' -e 's/OSGVOBigMem/OSGVOBigMem_ITB/g' -e 's/OSGVO_MULTICORE/OSGVO_MULTICORE_ITB/g' glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
    
  2. clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC-ITB -> UCSD

NOTE while we test glexec sites take care to not accidentally enable glexec on UCSD entries that haven't been tested. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.

These instructions assume you have copied the GOC-ITB config to the respective factory and named it glideinWMS.xml.itb.

  1. Remove _ITB from OSGVO_ITB and associated names:
    sed -e 's/OSGVO_ITB/OSGVO/g' -e 's/OSGVOHTPC_ITB/OSGVOHTPC/g' -e 's/OSGVOBigMem_ITB/OSGVOBigMem/g' -e 's/OSGVO_MULTICORE_ITB/OSGVO_MULTICORE/g' glideinWMS.xml.itb > glideinWMS.xml.itb2
    
  2. clone_glidein -other glideinWMS.xml.itb2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Installing Factory Condor from scratch

Please see instructions under Installing Factory Condor from Scratch

Upgrading Factory Condor

Factories with an RPM install:

The following commands need to be run as the root user:

  1. Stop glideinWMS:
    service gwms-factory stop
  2. Stop Condor:
    service condor stop
  3. Update condor using yum. Note: for the ITB factory, you'll likely want to use the osg-development repo instead of osg.
    yum update --enablerepo epel --enablerepo osg condor condor-classads condor-cream-gahp condor-procd
  4. Start Condor:
    service condor start
  5. Wait for the appropriate amount of time, then start glideinWMS:
    service gwms-factory start

Factories using a non-RPM install:

Go to the condor website and download the tarball as root user:

http://research.cs.wisc.edu/htcondor/downloads/

As of 2014-01-09:

For the UCSD factory we currently use condor-rel-x86_64_RedHat6-stripped.tar.gz

For CERN we use condor-rel-x86_64_RedHat5-stripped.tar.gz

For the other factories we use condor-rel-x86_RedHat5-stripped.tar.gz

cd /root/Downloads wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-rel-x86_RedHat5-stripped.tar.gz 

As gfactory stop Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

The next commands need to be run as the root user.

  1. Stop Condor:
    /etc/init.d/condor stop
  2. Run upgrade script:
    /root/glideinwms/install/glidecondor_upgrade condor-rel-x86_RedHat5-stripped.tar.gz 
  3. Start Condor with init.d script:
    /etc/init.d/condor start
  4. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%

As gfactory start the factory:

./factory_startup start

Upgrading Glidein Condor

Click here for instructions on how to upgrade the Condor tarballs for glideins to use.

Upgrading GlideinWMS in Place

Factories with RPM installs:

Note: The following needs to be done as root

  1. Shut down the factory
    service gwms-factory stop
    
  2. Update the factory packages using yum. Note: for the ITB factory, if you're testing a pre-release you may need to add --enablerepo=osg-development
    yum update --exclude=condor*
  3. If the create_condor_tarball script has been updated, follow the instructions to rebuild all of the glidein condor tarballs and update the factory config with the new ones as outlined in the Upgrading Glidein Condor section.
  4. Upgrade the factory
    service gwms-factory upgrade
  5. Start the factory back up
    service gwms-factory start

Factories with non-RPM installs:

This should be the standard method to upgrade GlideinWMS unless significant changes have been made to the code base. Otherwise, it is preferable to create a new factory instance with the following instructions:
Upgrading GlideinWMS with New Instance

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd .. rsync -av glideinWMS/ glideinWMS-old 

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

1 Shut down Factory:
cd $GLIDEIN_FACTORY_DIR
./factory_startup stop
  1. Check to make sure there are no running factory python processes:
    ps -u gfactory
  2. Fetch latest code and checkout new where new is the desired tag or branch name:
    cd $GLIDEIN_SRC_DIR git fetch git checkout new 
  3. Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
  4. Run upgrade and supply full absolute path of config file:
    cd $GLIDEIN_FACTORY_DIR
    ./factory_startup upgrade ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
    
  5. Restart Factory:
    ./factory_startup start

Upgrading GlideinWMS with New Instance

These instructions should only be followed if significant changes have been made to the code base. Otherwise, it is preferable to upgrade in place with the following instructions:
Upgrading GlideinWMS in Place

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd .. rsync -av glideinWMS/ glideinWMS-old 

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

  1. Create new instance directory and copy over config file:
    cd $GLIDEIN_FACTORY_DIR/.. mkdir glidein_new.cfg cp glidein_old.cfg/glideinWMS.xml glidein_new.cfg/ 
  2. Copy over any validation scripts or wrappers used in previous instance:
    cp glidein_old.cfg/*.sh glidein_new.cfg/ cp glidein_old.cfg/*.source glidein_new.cfg/ 
  3. Create new Prestage dir for tarballs:
    mkdir glidein_new.cfg/Prestage 
  4. Edit glidein_new.cfg/glideinWMS.xml to use new name:
    glidein_name="new" 
  5. Also replace any references of glidein_old.cfg with glidein_new.cfg in the new config file. You might also like to remove all disabled ( enabled="False") entries. You can do this simply by doing:
    sed '/enabled="False"/,/<\/entry>/d' -i glidein_new.cfg/glideinWMS.xml
    1 Shut down Factory:
    cd $GLIDEIN_FACTORY_DIR
    ./factory_startup stop
    
  6. Check to make sure there are no running factory python processes:
    ps -u gfactory
  7. Fetch latest code and checkout new where new is the desired tag or branch name:
    cd $GLIDEIN_SRC_DIR git fetch git checkout new 
  8. Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
  9. Create new factory instance:
    cd $GLIDEIN_FACTORY_DIR/.. $GLIDEIN_SRC_DIR/creation/create_glidein glidein_new.cfg/glideinWMS.xml 
  10. Copy the old downtimes file to newly created instance dir:
    cp glidein_old/glideinWMS.downtimes glidein_new/ 
  11. Change into the newly created glidein_new directory and start up the new instance.

Post Upgrade Actions

Edit gfactory user .bash_profile:

export GLIDEIN_FACTORY_DIR=/home/gfactory/glideinsubmit/glidein_new 

At UCSD and GOC as root, update the osg_gfactory monitoring symlink:

cd /var/www/html rm osg_gfactory ln -s glidefactory/monitor/glidein_new osg_gfactory 

Upgrading GlideinWMS v2_7 Gotchas

GlideinWMS v2_7 has significant changes so it is best to follow the above:
Upgrading GlideinWMS with New Instance

The source code directory must be renamed to all lowercase glideinwms. A good time to do this is in step 8 of the above instructions before the git fetch.

The gfactory user's .bash_profile will have to be modded:

export GLIDEIN_SRC_DIR=/path/to/src/glideinwms 

An additional mod needs to be added to the .bash_profile to get around the analyze_entries bug:

export GLIDEIN_MON_URL=$GLIDEIN_FACTORY_DIR

Finally, factools will have to be switched to the special compat branch:

cd $GLIDEIN_FACTOOLS
git checkout dev_2_7_compat

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_v2_0/

Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:

/var/gfactory/glideinsubmit/

Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.

Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up

/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/

Condor also needs the host certificate to function;

/etc/grid-security 

should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

Dealing with Scalability limits

Number of Process Limits

In RHEL 6 the default number of processes per user is conservatively low, set to 1024. This will likely affect factory performance at full scale. Add /etc/security/limits.d/91-userlimits.conf:

# we need many processes, for Condor
*   soft   nproc   128297

FD Limits

NOTE This may no longer be an issue, as glideinWMS has significantly reduced the number of needed FDs in the factory code

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n    glideinDescript,entries,restart_attempts,restart_interval)\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n    childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', '  File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n    c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240

After changing log out then back in as gfactory and stop /restart the factory.

Factory Specific Notes

Factory Software and Patches

UCSD

Date Software Type Description
2014-07-30 gwms v3_2_6  
2014-08-20 condor 8.2.2  

GOC

DateSorted ascending Software Type Description
2014-02-11 condor 8.0.5  
2014-06-10 gwms v3_2_5  
2014-06-10 gwms git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 Add HTCondorCE RSL support
2014-06-10 gwms git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c Add condor_chirp to tarballs

CERN 0305

Date Software Type Description
2014-07-30 gwms v3_2_6  
2014-08-20 condor 8.2.2  

CERN 32

2014-06-10 gwms v3_2_5  
2014-06-10 gwms git cherry-pick 664c1daf5d651369991de1d5e33b5c6538c0c5f4 Add autoupdate to monitoring page
2014-06-10 gwms git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 Add HTCondorCE RSL support
2014-06-10 gwms git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c Add condor_chirp to tarballs
2014-02-11 condor 8.0.5  

GOC-ITB

Date Software Type Description
2014-07-23 gwms v3_2_6_rc3  
2014-05-30 condor 8.2.2  

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu

Turn off timeout for sudo

run:

/usr/sbin/visudo

add the following:

Defaults    timestamp_timeout = 0

Firewall settings

For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:

###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000

Frontend Support

Adding a New Frontend

Click here for instructions on how to register a new Frontend to the Factory.

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Monitoring Reference

Glidein Factory Status

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

At the simplest level, there are four graphs to look at which are all displayed on top of one another on this page:

  • Running glidein jobs (green solid, on by default)
  • Glideins at collector (black line, not on by default)
  • Glideins claimed by user jobs (purple line, on by default)
  • Glideins not matched (yellow line, on by default)

What to look for with these:

Glideins claimed (purple) should not be much lower than the green envelope.

Glideins at collector (black) should also not be much lower than the green envelope.

Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Glidein Factory Status Now

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow ( NOTE needs updating) for a longer and more detailed discussion.

Log Reference

Glidein output logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out $GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.err

Glidein user logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/condor_activity_*.log

Condor daemon logs:

/opt/glidecondor/condor_local/log/*Log

NOTE On GOC machines:

/usr/local/glidecondor/condor_local/log/*Log

Condor gridmanager logs:

/dev/shm/GridmanagerLog.schedd_glideins*

NOTE On GOC machines:

/tmp/GridmanagerLog.schedd_glideins*

Factory daemon logs:

$GLIDEIN_FACTORY_DIR/log/factory/factory.*.log $GLIDEIN_FACTORY_DIR/log/entry_*/factory.*.log

Completed glidein logs:

$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log

Tool Reference

Running analyze_entries status report

cd $GLIDEIN_FACTORY_DIR
analyze_entries -x 24 -s waste

Run command with -h to print explanation of possible options.

This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.

Using proxy_info to Verifiy Pilot Proxies

An example of how to verify the pilot proxies used by the frontend.

  1. Get a list of the proxies for a VO and CE:
    proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/ 

  1. Display a particular proxy's information:
    proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy 

  1. For additional tool help run:
    proxy_info -h
    

NOTE at CERN you must first source:

source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh

Site Debugging Reference

How to contact Grid sites

Non-CMS issues at OSG sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

  • For Email Address use osg-gfactory-support@physics.ucsd.edu
  • Check the Resource box and find the name corresponding to the GLIDEIN_ResourceName attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the Resource name in the Title so it is easy to find

CMS issues for All Sites

Use Savannah:

https://savannah.cern.ch/

  1. Search for CMS
  2. Select CMS Computing Infrastructure Support
  3. Click Submit a new item

NOTE the following assumes you have administrative rights

Fill out the following fields:

  • For Catagory select Facilities
  • For Assigned to select cmscompinfrasup-sitename
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find
  • in Add Email Addresses add osg_gfactory

NOTE As an exception UK admins complain that we should always set Use GGUS to Yes or they will not see the ticket: https://savannah.cern.ch/support/index.php?134388

NOTE It seems admins at T2_FR_GRIF_IRFU require ggus as well.

If using ggus, be sure to add the gfactory-support email list in the "Involve others" field. Otherwise, GGUS won't send out an email to us whenever the ticket is updated.

NOTE if the site squad in question cannot be found in Assigned to then just follow the same instructions as below:

Non-CMS issues at European sites

Non-CMS issues at European sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

IMPORTANT leave Resource unchecked.

Explain in the Description it is an EGI resource along with the GLIDEIN_ResourceName and mention you are forwarding it to GGUS on behalf of the affected VO. After submitting the ticket, click the GGUS (Prod) box in the Ticket Exchange options and click Update.

Globus Hold Reasons

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

Some background on Globus

In Globus' Hold Reasons, "job manager" refers to the process running on the CE responsible for submitting to the local batch system. The process is called globus-job-manager.

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:

  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

  • Globus error 17

  • Globus error 17, 31, 79, 121, 155
    • These glideins may not be recoverable, and the factory attempts to remove them.
    • The factory does not always succeed, so you may have to do it manually with -forcex
      • In particular, even if you remove these glideins, when they turn into unknown state ("X"), they might turn back into held state ("H"). So -forcex is the way to remove them definitively

  • Globus error 155
    • The globus-job-manager is likely unable to send a file back to the factory

  • Globus error 31: the job manager failed to cancel the job as requested
    • This happens for various reasons but one case we have observed is hitting memory limits on the WN (batch was PBS)

  • Globus error 10: globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without first establishing a SSL session.

  • Globus error 47: the gatekeeper failed to run the job manager
    • This has been seen when the disk on the gatekeeper gets full

Nordugrid Hold Reasons

CREAM Hold Reasons (Work in progress)

Link to a summary page on CREAM troubleshooting

Reasons we mostly understand

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up

  • REAM error: CREAM_Job_Register Error: ???odName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment] FaultCause=[Submissions are disabled!] Timestamp=[Fri 09 Nov 2012 14:44:25]
    • Site is probably down for maintenance

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Unknown host]
    • Likely a site that has been decommissioned and hostname is no longer valid, or possibly a typo in the hostname in the config.

  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs

  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec

  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.

  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:mkdir: cannot create directory `/var/jwgen//crm02_844243283.debug': File exists-[ERROR] Globus::GRAM::Error::JOB_UNSUBMITTED-Invalid job description-) N/A (jobId = CREAM844243283)

  • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed OpenSSL Error: s3_clnt.c:1063: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash e7734335
    • Can occur if the CA on the factory is out of date, and the gatekeeper identifies itself with a certificate that's newer than our CA. Try using "yum update" to get a more current CA.

Reasons we don't understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.

  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
      
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired

New Improved Docs (based on Alison's notes)

FactoryOpsGlideinWMS

FactoryInfo

Ops support

  • Internal tickets
  • mailing lists
  • access and cross-factory support

Factory Ops

  • Creating a new instance (and preserve monitoring history?)
  • Initial setup (daily emails, processes/monitoring, ??)
  • Adding new entries
  • Adding a frontend
  • Upgrading
    • factory
    • condor
  • Cloning
    • sites from one factory to another
    • Global cloning, such as t1 site group
  • Removing schedds (I think docs for this may be wrong?)
  • Attributes (link to gwms docs)
  • Finding missing sites
  • Removing glideins (includes scripts)
  • Submitting test jobs
  • Putting sites in downtime
  • Submitting Tickets
  • Decommissioning sites
  • Factory Disk warnings
  • Entry issues
    • CREAM
    • globus
    • Misc
  • Removing old entries

Daily Ops Monitoring

  • Mailing list
  • Internal tickets (Jira)
  • Daily emails
  • Analyze Entries
  • Web pages
  • Held jobs
  • Infosys
  • Misc
  • .err log problems
  • HOLD problems
  • Condor Activity Log problems

Daily Ops Other issues

  • Restarting the grid manager
  • Handling stuck waiting glideins
  • Rundiffs
  • Unmatched jobs

Additional References

  • logs
  • monitoring tools
  • proxies
  • ssh logins
  • git commands
  • frontend security info
  • BDII
  • Log Retention rules
  • Security
  • Condor G
  • Useful scripts
  • Misc

Future work

  • rpms

Authors

-- TerrenceMartin

-- IgorSfiligoi

Topic revision: r170 - 2017/06/26 - 17:55:22 - JeffreyDost
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback