Difference: GlideinFactoryFAQ (1 vs. 170)

Revision 1702017/06/26 - Main.JeffreyDost

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 273 to 273
 Note: The following needs to be done as root
  1. Shut down the factory
    service gwms-factory stop
    
Changed:
<
<
  1. Update the factory packages using yum. Note: The ITB factory may need the osg-development repo instead of osg
    yum update --enablerepo epel --enablerepo osg glideinwms-factory glideinwms-factory-condor
  2. Follow the instructions to rebuild all of the glidein condor tarballs and update the factory config with the new ones as outlined in the Upgrading Glidein Condor section.
  3. Upgrade the factory
    service gwms-factory upgrade
  4. Start the factory back up
    service gwms-factory start
>
>
  1. Update the factory packages using yum. Note: for the ITB factory, if you're testing a pre-release you may need to add --enablerepo=osg-development
    yum update --exclude=condor*
  2. If the create_condor_tarball script has been updated, follow the instructions to rebuild all of the glidein condor tarballs and update the factory config with the new ones as outlined in the Upgrading Glidein Condor section.
  3. Upgrade the factory
    service gwms-factory upgrade
  4. Start the factory back up
    service gwms-factory start
  Factories with non-RPM installs:

Revision 1692017/05/11 - Main.JeffreyDost

Changed:
<
<
Revision 168 is unreadable
>
>

Glidein Factory FAQ

Contents

Variables Used in this Document

Here is a list of variables used in this document as shorthand for common paths used in factory operations:

Variable Value Description
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinwms glideinWMS source code directory
GLIDEIN_FACTOOLS /home/gfactory/factools factools repo location

Assumed gfactory Path Setup

This document assumes the gfactory user has the following set in the $PATH:

  • $GLIDEIN_SRC_DIR/factory/tools/
  • $GLIDEIN_FACTOOLS/generic/bin/

Basic Procedures

Reconfiguring Factory

  1. Edit /etc/gwms-factory/glideinWMS.xml
  2. Stop the factory, reconfigure, and then restart:
    service gwms-factory stop
    service gwms-factory reconfig
    service gwms-factory start
    

Restarting Factory after Reboot

  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start

Site Debugging Procedures

Putting Entries in Temporary Downtime

./factory_startup down -entry entry_name -comment 'comment on why it is down'

You can optionally put an end time with the option

-end [[[YYYY-]MM-]DD-]HH:MM[:SS]
Examples:
-end 07:00
-end 05-19-07:00
-end 2014-05-19-07:00

Maintenance

Installing Factory from RPMs

Click here for instructions on how install a Factory from scratch using the OSG RPMs.

Installing factools Repo

The factools repo can be found at:

https://github.com/jdost321/factools

In the gfactory user home directory, run:

git clone git://github.com/jdost321/factools.git

refer to factools/README on how to set up environment to enable factools usage.

Adding a New Site to Glidein Factory

Click here for instructions on how to add a site for VOs to use.

Shared Factory Config

Click here for information about the shared factory config

Entry Templates

NOTE this section is likely obsolete

CMS cream:

<--/twistyPlugin twikiMakeVisibleInline-->
      <entry name="" comment="" enabled="True" gatekeeper="https://%HOSTNAME%:8443/ce-cream/services/CREAM2 %BATCH% %QUEUE%" gridtype="cream" verbosity="std" work_dir="TMPDIR">
         <config>
            <max_jobs held="25" idle="400" running="10000">
               <max_job_frontends>
               </max_job_frontends>
            </max_jobs>
            <release max_per_cycle="20" sleep="0.2"/>
            <remove max_per_cycle="5" sleep="0.2"/>
            <restrictions require_voms_proxy="False"/>
            <submit cluster_size="10" max_per_cycle="100" sleep="0.2"/>
         </config>
         <downtimes/>
         <allow_frontends>
         </allow_frontends>
         <attrs>
            <attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
            <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
            <attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="114840"/>
            <attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_SEs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
            <attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
         </attrs>
         <files>
         </files>
         <infosys_refs>
            <infosys_ref ref="GlueCEUniqueID=" server="exp-bdii.cern.ch" type="BDII"/>
         </infosys_refs>
         <monitorgroups>
            <monitorgroup group_name="CMST2"/>
            <monitorgroup group_name="CMS"/>
         </monitorgroups>
      </entry>
<--/twistyPlugin-->

Cloning Factories

Below are examples of doing a global clone from UCSD to GOC and CERN factories. You can record your clones in the Factory Cloning Log.

DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2015-01-13.

Description of clone_glidein Arguments

  • -merge yes/no/only
    • yes - modify existing entries in addition to adding new ones
    • no - only add new entries (default)
    • only - only merge existing; don't add new entries
  • -preserve_enable - when merging don't disable sites that are still enabled in original config
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Temporary Fix for v3_2_5 -> v3_2_3

NOTE until all factories are v3_2_5, you will see errors like:

Unexpected error occurred loading the configuration file.

Unknown parameter glidein.entries.ATLAS_US_Michigan_gate01.config.submit.submit_attrs

To avoid this, before running the clone tool, remove the new v3_2_5 attributes:

grep -v submit_attrs glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2

Then proceed to clone normally using glideinWMS.xml.ucsd2 instead.

Cloning UCSD -> GOC

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC -> UCSD is done in the exact same way so it is not shown here.

Cloning UCSD -> CERN

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. Use include and exclude constraints to only add regular CMS sites
    clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes -include GLIDEIN_Supported_VOs CMS -exclude GLIDEIN_Supported_VOs CMSOverflow glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC -> CERN is done in the exact same way so it is not shown here.

Cloning CERN -> UCSD

These instructions assume you have copied the CERN config to the respective factory and named it glideinWMS.xml.cern.

  1. Exclude the cloud resources:
    clone_glidein -other glideinWMS.xml.cern -out glideinWMS.xml.test -exclude name CMS_T1_TW_ASGC_AI -exclude name CMS_T2_CH_CERN_AI -exclude name CMS_T2_CH_CERN_HLT -merge yes glideinWMS.xml
    
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning CERN -> GOC is done in the exact same way so it is not shown here.

Cloning UCSD -> GOC-ITB

NOTE while we test glexec sites take care to not accidentally disable glexec on GOC-ITB entries. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.

These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

  1. Append _ITB to OSGVO and associated names:
    sed -e 's/OSGVO\([,"]\)/OSGVO_ITB\1/g' -e 's/OSGVOHTPC/OSGVOHTPC_ITB/g' -e 's/OSGVOBigMem/OSGVOBigMem_ITB/g' -e 's/OSGVO_MULTICORE/OSGVO_MULTICORE_ITB/g' glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
    
  2. clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Cloning GOC-ITB -> UCSD

NOTE while we test glexec sites take care to not accidentally enable glexec on UCSD entries that haven't been tested. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.

These instructions assume you have copied the GOC-ITB config to the respective factory and named it glideinWMS.xml.itb.

  1. Remove _ITB from OSGVO_ITB and associated names:
    sed -e 's/OSGVO_ITB/OSGVO/g' -e 's/OSGVOHTPC_ITB/OSGVOHTPC/g' -e 's/OSGVOBigMem_ITB/OSGVOBigMem/g' -e 's/OSGVO_MULTICORE_ITB/OSGVO_MULTICORE/g' glideinWMS.xml.itb > glideinWMS.xml.itb2
    
  2. clone_glidein -other glideinWMS.xml.itb2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig

Installing Factory Condor from scratch

Please see instructions under Installing Factory Condor from Scratch

Upgrading Factory Condor

Factories with an RPM install:

The following commands need to be run as the root user:

  1. Stop glideinWMS:
    service gwms-factory stop
  2. Stop Condor:
    service condor stop
  3. Update condor using yum. Note: for the ITB factory, you'll likely want to use the osg-development repo instead of osg.
    yum update --enablerepo epel --enablerepo osg condor condor-classads condor-cream-gahp condor-procd
  4. Start Condor:
    service condor start
  5. Wait for the appropriate amount of time, then start glideinWMS:
    service gwms-factory start

Factories using a non-RPM install:

Go to the condor website and download the tarball as root user:

http://research.cs.wisc.edu/htcondor/downloads/

As of 2014-01-09:

For the UCSD factory we currently use condor-rel-x86_64_RedHat6-stripped.tar.gz

For CERN we use condor-rel-x86_64_RedHat5-stripped.tar.gz

For the other factories we use condor-rel-x86_RedHat5-stripped.tar.gz

cd /root/Downloads wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-rel-x86_RedHat5-stripped.tar.gz 

As gfactory stop Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

The next commands need to be run as the root user.

  1. Stop Condor:
    /etc/init.d/condor stop
  2. Run upgrade script:
    /root/glideinwms/install/glidecondor_upgrade condor-rel-x86_RedHat5-stripped.tar.gz 
  3. Start Condor with init.d script:
    /etc/init.d/condor start
  4. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%

As gfactory start the factory:

./factory_startup start

Upgrading Glidein Condor

Click here for instructions on how to upgrade the Condor tarballs for glideins to use.

Upgrading GlideinWMS in Place

Factories with RPM installs:

Note: The following needs to be done as root

  1. Shut down the factory
    service gwms-factory stop
    
  2. Update the factory packages using yum. Note: The ITB factory may need the osg-development repo instead of osg
    yum update --enablerepo epel --enablerepo osg glideinwms-factory glideinwms-factory-condor
  3. Follow the instructions to rebuild all of the glidein condor tarballs and update the factory config with the new ones as outlined in the Upgrading Glidein Condor section.
  4. Upgrade the factory
    service gwms-factory upgrade
  5. Start the factory back up
    service gwms-factory start

Factories with non-RPM installs:

This should be the standard method to upgrade GlideinWMS unless significant changes have been made to the code base. Otherwise, it is preferable to create a new factory instance with the following instructions:
Upgrading GlideinWMS with New Instance

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd .. rsync -av glideinWMS/ glideinWMS-old 

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

1 Shut down Factory:
cd $GLIDEIN_FACTORY_DIR
./factory_startup stop
  1. Check to make sure there are no running factory python processes:
    ps -u gfactory
  2. Fetch latest code and checkout new where new is the desired tag or branch name:
    cd $GLIDEIN_SRC_DIR git fetch git checkout new 
  3. Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
  4. Run upgrade and supply full absolute path of config file:
    cd $GLIDEIN_FACTORY_DIR
    ./factory_startup upgrade ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
    
  5. Restart Factory:
    ./factory_startup start

Upgrading GlideinWMS with New Instance

These instructions should only be followed if significant changes have been made to the code base. Otherwise, it is preferable to upgrade in place with the following instructions:
Upgrading GlideinWMS in Place

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd .. rsync -av glideinWMS/ glideinWMS-old 

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

  1. Create new instance directory and copy over config file:
    cd $GLIDEIN_FACTORY_DIR/.. mkdir glidein_new.cfg cp glidein_old.cfg/glideinWMS.xml glidein_new.cfg/ 
  2. Copy over any validation scripts or wrappers used in previous instance:
    cp glidein_old.cfg/*.sh glidein_new.cfg/ cp glidein_old.cfg/*.source glidein_new.cfg/ 
  3. Create new Prestage dir for tarballs:
    mkdir glidein_new.cfg/Prestage 
  4. Edit glidein_new.cfg/glideinWMS.xml to use new name:
    glidein_name="new" 
  5. Also replace any references of glidein_old.cfg with glidein_new.cfg in the new config file. You might also like to remove all disabled ( enabled="False") entries. You can do this simply by doing:
    sed '/enabled="False"/,/<\/entry>/d' -i glidein_new.cfg/glideinWMS.xml
    1 Shut down Factory:
    cd $GLIDEIN_FACTORY_DIR
    ./factory_startup stop
    
  6. Check to make sure there are no running factory python processes:
    ps -u gfactory
  7. Fetch latest code and checkout new where new is the desired tag or branch name:
    cd $GLIDEIN_SRC_DIR git fetch git checkout new 
  8. Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
  9. Create new factory instance:
    cd $GLIDEIN_FACTORY_DIR/.. $GLIDEIN_SRC_DIR/creation/create_glidein glidein_new.cfg/glideinWMS.xml 
  10. Copy the old downtimes file to newly created instance dir:
    cp glidein_old/glideinWMS.downtimes glidein_new/ 
  11. Change into the newly created glidein_new directory and start up the new instance.

Post Upgrade Actions

Edit gfactory user .bash_profile:

export GLIDEIN_FACTORY_DIR=/home/gfactory/glideinsubmit/glidein_new 

At UCSD and GOC as root, update the osg_gfactory monitoring symlink:

cd /var/www/html rm osg_gfactory ln -s glidefactory/monitor/glidein_new osg_gfactory 

Upgrading GlideinWMS v2_7 Gotchas

GlideinWMS v2_7 has significant changes so it is best to follow the above:
Upgrading GlideinWMS with New Instance

The source code directory must be renamed to all lowercase glideinwms. A good time to do this is in step 8 of the above instructions before the git fetch.

The gfactory user's .bash_profile will have to be modded:

export GLIDEIN_SRC_DIR=/path/to/src/glideinwms 

An additional mod needs to be added to the .bash_profile to get around the analyze_entries bug:

export GLIDEIN_MON_URL=$GLIDEIN_FACTORY_DIR

Finally, factools will have to be switched to the special compat branch:

cd $GLIDEIN_FACTOOLS
git checkout dev_2_7_compat

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_v2_0/

Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:

/var/gfactory/glideinsubmit/

Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.

Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up

/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/

Condor also needs the host certificate to function;

/etc/grid-security 

should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

Dealing with Scalability limits

Number of Process Limits

In RHEL 6 the default number of processes per user is conservatively low, set to 1024. This will likely affect factory performance at full scale. Add /etc/security/limits.d/91-userlimits.conf:

# we need many processes, for Condor
*   soft   nproc   128297

FD Limits

NOTE This may no longer be an issue, as glideinWMS has significantly reduced the number of needed FDs in the factory code

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n    glideinDescript,entries,restart_attempts,restart_interval)\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n    childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', '  File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n    c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240

After changing log out then back in as gfactory and stop /restart the factory.

Factory Specific Notes

Factory Software and Patches

UCSD

Date Software Type Description
2014-07-30 gwms v3_2_6  
2014-08-20 condor 8.2.2  

GOC

Date Software Type Description
2014-06-10 gwms v3_2_5  
2014-06-10 gwms git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 Add HTCondorCE RSL support
2014-06-10 gwms git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c Add condor_chirp to tarballs
2014-02-11 condor 8.0.5  

CERN 0305

Date Software Type Description
2014-07-30 gwms v3_2_6  
2014-08-20 condor 8.2.2  

CERN 32

2014-06-10 gwms v3_2_5  
2014-06-10 gwms git cherry-pick 664c1daf5d651369991de1d5e33b5c6538c0c5f4 Add autoupdate to monitoring page
2014-06-10 gwms git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 Add HTCondorCE RSL support
2014-06-10 gwms git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c Add condor_chirp to tarballs
2014-02-11 condor 8.0.5  

GOC-ITB

Date Software Type Description
2014-07-23 gwms v3_2_6_rc3  
2014-05-30 condor 8.2.2  

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu

Turn off timeout for sudo

run:

/usr/sbin/visudo

add the following:

Defaults    timestamp_timeout = 0

Firewall settings

For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:

###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000

Frontend Support

Adding a New Frontend

Click here for instructions on how to register a new Frontend to the Factory.

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Monitoring Reference

Glidein Factory Status

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

At the simplest level, there are four graphs to look at which are all displayed on top of one another on this page:

  • Running glidein jobs (green solid, on by default)
  • Glideins at collector (black line, not on by default)
  • Glideins claimed by user jobs (purple line, on by default)
  • Glideins not matched (yellow line, on by default)

What to look for with these:

Glideins claimed (purple) should not be much lower than the green envelope.

Glideins at collector (black) should also not be much lower than the green envelope.

Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Glidein Factory Status Now

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow ( NOTE needs updating) for a longer and more detailed discussion.

Log Reference

Glidein output logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out $GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.err

Glidein user logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/condor_activity_*.log

Condor daemon logs:

/opt/glidecondor/condor_local/log/*Log

NOTE On GOC machines:

/usr/local/glidecondor/condor_local/log/*Log

Condor gridmanager logs:

/dev/shm/GridmanagerLog.schedd_glideins*

NOTE On GOC machines:

/tmp/GridmanagerLog.schedd_glideins*

Factory daemon logs:

$GLIDEIN_FACTORY_DIR/log/factory/factory.*.log $GLIDEIN_FACTORY_DIR/log/entry_*/factory.*.log

Completed glidein logs:

$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log

Tool Reference

Running analyze_entries status report

cd $GLIDEIN_FACTORY_DIR
analyze_entries -x 24 -s waste

Run command with -h to print explanation of possible options.

This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.

Using proxy_info to Verifiy Pilot Proxies

An example of how to verify the pilot proxies used by the frontend.

  1. Get a list of the proxies for a VO and CE:
    proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/ 

  1. Display a particular proxy's information:
    proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy 

  1. For additional tool help run:
    proxy_info -h
    

NOTE at CERN you must first source:

source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh

Site Debugging Reference

How to contact Grid sites

Non-CMS issues at OSG sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

  • For Email Address use osg-gfactory-support@physics.ucsd.edu
  • Check the Resource box and find the name corresponding to the GLIDEIN_ResourceName attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the Resource name in the Title so it is easy to find

CMS issues for All Sites

Use Savannah:

https://savannah.cern.ch/

  1. Search for CMS
  2. Select CMS Computing Infrastructure Support
  3. Click Submit a new item

NOTE the following assumes you have administrative rights

Fill out the following fields:

  • For Catagory select Facilities
  • For Assigned to select cmscompinfrasup-sitename
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find
  • in Add Email Addresses add osg_gfactory

NOTE As an exception UK admins complain that we should always set Use GGUS to Yes or they will not see the ticket: https://savannah.cern.ch/support/index.php?134388

NOTE It seems admins at T2_FR_GRIF_IRFU require ggus as well.

If using ggus, be sure to add the gfactory-support email list in the "Involve others" field. Otherwise, GGUS won't send out an email to us whenever the ticket is updated.

NOTE if the site squad in question cannot be found in Assigned to then just follow the same instructions as below:

Non-CMS issues at European sites

Non-CMS issues at European sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

IMPORTANT leave Resource unchecked.

Explain in the Description it is an EGI resource along with the GLIDEIN_ResourceName and mention you are forwarding it to GGUS on behalf of the affected VO. After submitting the ticket, click the GGUS (Prod) box in the Ticket Exchange options and click Update.

Globus Hold Reasons

<--/twistyPlugin twikiMakeVisibleInline-->

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

<--/twistyPlugin-->

Some background on Globus

In Globus' Hold Reasons, "job manager" refers to the process running on the CE responsible for submitting to the local batch system. The process is called globus-job-manager.

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:

  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

  • Globus error 17

  • Globus error 17, 31, 79, 121, 155
    • These glideins may not be recoverable, and the factory attempts to remove them.
    • The factory does not always succeed, so you may have to do it manually with -forcex
      • In particular, even if you remove these glideins, when they turn into unknown state ("X"), they might turn back into held state ("H"). So -forcex is the way to remove them definitively

  • Globus error 155
    • The globus-job-manager is likely unable to send a file back to the factory

  • Globus error 31: the job manager failed to cancel the job as requested
    • This happens for various reasons but one case we have observed is hitting memory limits on the WN (batch was PBS)

  • Globus error 10: globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without first establishing a SSL session.

  • Globus error 47: the gatekeeper failed to run the job manager
    • This has been seen when the disk on the gatekeeper gets full

Nordugrid Hold Reasons

CREAM Hold Reasons (Work in progress)

Link to a summary page on CREAM troubleshooting

Reasons we mostly understand

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up

  • REAM error: CREAM_Job_Register Error: ???odName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment] FaultCause=[Submissions are disabled!] Timestamp=[Fri 09 Nov 2012 14:44:25]
    • Site is probably down for maintenance

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Unknown host]
    • Likely a site that has been decommissioned and hostname is no longer valid, or possibly a typo in the hostname in the config.

  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs

  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec

  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.

  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:mkdir: cannot create directory `/var/jwgen//crm02_844243283.debug': File exists-[ERROR] Globus::GRAM::Error::JOB_UNSUBMITTED-Invalid job description-) N/A (jobId = CREAM844243283)

  • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed OpenSSL Error: s3_clnt.c:1063: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash e7734335
    • Can occur if the CA on the factory is out of date, and the gatekeeper identifies itself with a certificate that's newer than our CA. Try using "yum update" to get a more current CA.

Reasons we don't understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.

  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
      
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired

New Improved Docs (based on Alison's notes)

FactoryOpsGlideinWMS

FactoryInfo

Ops support

  • Internal tickets
  • mailing lists
  • access and cross-factory support

Factory Ops

  • Creating a new instance (and preserve monitoring history?)
  • Initial setup (daily emails, processes/monitoring, ??)
  • Adding new entries
  • Adding a frontend
  • Upgrading
    • factory
    • condor
  • Cloning
    • sites from one factory to another
    • Global cloning, such as t1 site group
  • Removing schedds (I think docs for this may be wrong?)
  • Attributes (link to gwms docs)
  • Finding missing sites
  • Removing glideins (includes scripts)
  • Submitting test jobs
  • Putting sites in downtime
  • Submitting Tickets
  • Decommissioning sites
  • Factory Disk warnings
  • Entry issues
    • CREAM
    • globus
    • Misc
  • Removing old entries

Daily Ops Monitoring

  • Mailing list
  • Internal tickets (Jira)
  • Daily emails
  • Analyze Entries
  • Web pages
  • Held jobs
  • Infosys
  • Misc
  • .err log problems
  • HOLD problems
  • Condor Activity Log problems

Daily Ops Other issues

  • Restarting the grid manager
  • Handling stuck waiting glideins
  • Rundiffs
  • Unmatched jobs

Additional References

  • logs
  • monitoring tools
  • proxies
  • ssh logins
  • git commands
  • frontend security info
  • BDII
  • Log Retention rules
  • Security
  • Condor G
  • Useful scripts
  • Misc

Future work

  • rpms

Authors

-- TerrenceMartin

-- IgorSfiligoi

<-- TWIKI VARIABLES 
  • Set UCSD_VERS = v2_0
-->

Revision 1682016/05/02 - Main.JeffreyDost

Changed:
<
<
Revision 167 is unreadable
>
>
Revision 168 is unreadable

Revision 1672016/02/05 - Main.JeffreyDost

Changed:
<
<
Revision 166 is unreadable
>
>
Revision 167 is unreadable

Revision 1662015/05/11 - Main.BrendanDennis

Changed:
<
<
Revision 165 is unreadable
>
>
Revision 166 is unreadable

Revision 1652015/01/13 - Main.JeffreyDost

Changed:
<
<
Revision 164 is unreadable
>
>
Revision 165 is unreadable

Revision 1642014/11/14 - Main.JeffreyDost

Changed:
<
<
Revision 163 is unreadable
>
>
Revision 164 is unreadable

Revision 1632014/09/10 - Main.JeffreyDost

Changed:
<
<
Revision 162 is unreadable
>
>
Revision 163 is unreadable

Revision 1622014/08/06 - Main.JeffreyDost

Changed:
<
<
Revision 161 is unreadable
>
>
Revision 162 is unreadable

Revision 1612014/06/11 - Main.JeffreyDost

Changed:
<
<
Revision 160 is unreadable
>
>
Revision 161 is unreadable

Revision 1602014/06/10 - Main.JeffreyDost

Changed:
<
<
Revision 159 is unreadable
>
>
Revision 160 is unreadable

Revision 1592014/06/09 - Main.JeffreyDost

Changed:
<
<
Revision 158 is unreadable
>
>
Revision 159 is unreadable

Revision 1582014/06/06 - Main.JeffreyDost

Changed:
<
<
Revision 157 is unreadable
>
>
Revision 158 is unreadable

Revision 1572014/05/23 - Main.JeffreyDost

Changed:
<
<
Revision 156 is unreadable
>
>
Revision 157 is unreadable

Revision 1562014/05/19 - Main.JeffreyDost

Changed:
<
<
Revision 155 is unreadable
>
>
Revision 156 is unreadable

Revision 1552014/05/14 - Main.LuisLinares

Changed:
<
<
Revision 154 is unreadable
>
>
Revision 155 is unreadable

Revision 1542014/05/14 - Main.JeffreyDost

Changed:
<
<
Revision 153 is unreadable
>
>
Revision 154 is unreadable

Revision 1532014/05/12 - Main.DanielKlein

Changed:
<
<
Revision 152 is unreadable
>
>
Revision 153 is unreadable

Revision 1522014/04/23 - Main.JeffreyDost

Changed:
<
<
Revision 151 is unreadable
>
>
Revision 152 is unreadable

Revision 1512014/04/23 - Main.JeffreyDost

Changed:
<
<
Revision 150 is unreadable
>
>
Revision 151 is unreadable

Revision 1502014/04/16 - Main.JeffreyDost

Changed:
<
<
Revision 149 is unreadable
>
>
Revision 150 is unreadable

Revision 1492014/03/17 - Main.LuisLinares

Changed:
<
<
Revision 148 is unreadable
>
>
Revision 149 is unreadable

Revision 1482014/03/10 - Main.JeffreyDost

Changed:
<
<
Revision 147 is unreadable
>
>
Revision 148 is unreadable

Revision 1472014/03/10 - Main.JeffreyDost

Changed:
<
<
Revision 146 is unreadable
>
>
Revision 147 is unreadable

Revision 1462014/03/05 - Main.JeffreyDost

Changed:
<
<
Revision 145 is unreadable
>
>
Revision 146 is unreadable

Revision 1452014/02/24 - Main.JeffreyDost

Changed:
<
<
Revision 144 is unreadable
>
>
Revision 145 is unreadable

Revision 1442014/02/13 - Main.JeffreyDost

Changed:
<
<
Revision 143 is unreadable
>
>
Revision 144 is unreadable

Revision 1432014/02/12 - Main.JeffreyDost

Changed:
<
<
Revision 142 is unreadable
>
>
Revision 143 is unreadable

Revision 1422014/01/09 - Main.JeffreyDost

Changed:
<
<
Revision 141 is unreadable
>
>
Revision 142 is unreadable

Revision 1412014/01/09 - Main.JeffreyDost

Changed:
<
<
Revision 140 is unreadable
>
>
Revision 141 is unreadable

Revision 1402014/01/07 - Main.JeffreyDost

Changed:
<
<
Revision 139 is unreadable
>
>
Revision 140 is unreadable

Revision 1392014/01/07 - Main.JeffreyDost

Changed:
<
<
Revision 138 is unreadable
>
>
Revision 139 is unreadable

Revision 1382014/01/06 - Main.JeffreyDost

Changed:
<
<
Revision 137 is unreadable
>
>
Revision 138 is unreadable

Revision 1372014/01/02 - Main.JeffreyDost

Changed:
<
<
Revision 136 is unreadable
>
>
Revision 137 is unreadable

Revision 1362013/11/15 - Main.DanielKlein

Changed:
<
<
Revision 135 is unreadable
>
>
Revision 136 is unreadable

Revision 1352013/11/14 - Main.JeffreyDost

Changed:
<
<
Revision 134 is unreadable
>
>
Revision 135 is unreadable

Revision 1342013/11/14 - Main.JeffreyDost

Changed:
<
<
Revision 133 is unreadable
>
>
Revision 134 is unreadable

Revision 1332013/11/10 - Main.LuisLinares

Changed:
<
<
Revision 132 is unreadable
>
>
Revision 133 is unreadable

Revision 1322013/11/05 - Main.LuisLinares

Changed:
<
<
Revision 131 is unreadable
>
>
Revision 132 is unreadable

Revision 1312013/11/05 - Main.LuisLinares

Changed:
<
<
Revision 130 is unreadable
>
>
Revision 131 is unreadable

Revision 1302013/10/03 - Main.JeffreyDost

Changed:
<
<
Revision 129 is unreadable
>
>
Revision 130 is unreadable

Revision 1292013/09/23 - Main.JeffreyDost

Changed:
<
<
Revision 128 is unreadable
>
>
Revision 129 is unreadable

Revision 1282013/09/21 - Main.JeffreyDost

Changed:
<
<
Revision 127 is unreadable
>
>
Revision 128 is unreadable

Revision 1272013/09/11 - Main.JeffreyDost

Changed:
<
<
Revision 126 is unreadable
>
>
Revision 127 is unreadable

Revision 1262013/09/11 - Main.LuisLinares

Changed:
<
<
Revision 125 is unreadable
>
>
Revision 126 is unreadable

Revision 1252013/09/10 - Main.JeffreyDost

Changed:
<
<
Revision 124 is unreadable
>
>
Revision 125 is unreadable

Revision 1242013/09/05 - Main.JeffreyDost

Changed:
<
<
Revision 123 is unreadable
>
>
Revision 124 is unreadable

Revision 1232013/08/22 - Main.JeffreyDost

Changed:
<
<
Revision 122 is unreadable
>
>
Revision 123 is unreadable

Revision 1222013/08/19 - Main.LuisLinares

Changed:
<
<
Revision 121 is unreadable
>
>
Revision 122 is unreadable

Revision 1212013/07/30 - Main.JeffreyDost

Changed:
<
<
Revision 120 is unreadable
>
>
Revision 121 is unreadable

Revision 1202013/07/16 - Main.JeffreyDost

Changed:
<
<
Revision 119 is unreadable
>
>
Revision 120 is unreadable

Revision 1192013/07/13 - Main.JeffreyDost

Changed:
<
<
Revision 118 is unreadable
>
>
Revision 119 is unreadable

Revision 1182013/06/19 - Main.JeffreyDost

Changed:
<
<
Revision 117 is unreadable
>
>
Revision 118 is unreadable

Revision 1172013/06/19 - Main.LuisLinares

Changed:
<
<
Revision 116 is unreadable
>
>
Revision 117 is unreadable

Revision 1162013/06/13 - Main.JeffreyDost

Changed:
<
<
Revision 115 is unreadable
>
>
Revision 116 is unreadable

Revision 1152013/06/13 - Main.JeffreyDost

Changed:
<
<
Revision 114 is unreadable
>
>
Revision 115 is unreadable

Revision 1142013/06/07 - Main.LuisLinares

Changed:
<
<
Revision 113 is unreadable
>
>
Revision 114 is unreadable

Revision 1132013/06/07 - Main.JeffreyDost

Changed:
<
<
Revision 112 is unreadable
>
>
Revision 113 is unreadable

Revision 1122013/05/30 - Main.JeffreyDost

Changed:
<
<
Revision 111 is unreadable
>
>
Revision 112 is unreadable

Revision 1112013/05/29 - Main.JeffreyDost

Changed:
<
<
Revision 110 is unreadable
>
>
Revision 111 is unreadable

Revision 1102013/05/22 - Main.JeffreyDost

Changed:
<
<
Revision 109 is unreadable
>
>
Revision 110 is unreadable

Revision 1092013/05/15 - Main.LuisLinares

Changed:
<
<
Revision 108 is unreadable
>
>
Revision 109 is unreadable

Revision 1082013/05/10 - Main.LuisLinares

Changed:
<
<
Revision 107 is unreadable
>
>
Revision 108 is unreadable

Revision 1072013/05/09 - Main.JeffreyDost

Changed:
<
<
Revision 106 is unreadable
>
>
Revision 107 is unreadable

Revision 1062013/04/23 - Main.JeffreyDost

Changed:
<
<
Revision 105 is unreadable
>
>
Revision 106 is unreadable

Revision 1052013/04/23 - Main.JeffreyDost

Changed:
<
<
Revision 104 is unreadable
>
>
Revision 105 is unreadable

Revision 1042013/04/23 - Main.JeffreyDost

Changed:
<
<
Revision 103 is unreadable
>
>
Revision 104 is unreadable

Revision 1032013/04/19 - Main.LuisLinares

Changed:
<
<
Revision 102 is unreadable
>
>
Revision 103 is unreadable

Revision 1022013/04/19 - Main.JeffreyDost

Changed:
<
<
Revision 101 is unreadable
>
>
Revision 102 is unreadable

Revision 1012013/04/19 - Main.LuisLinares

Changed:
<
<
Revision 100 is unreadable
>
>
Revision 101 is unreadable

Revision 1002013/04/19 - Main.LuisLinares

Changed:
<
<
Revision 99 is unreadable
>
>
Revision 100 is unreadable

Revision 992013/04/18 - Main.AlexGeorges

Changed:
<
<
Revision 98 is unreadable
>
>
Revision 99 is unreadable

Revision 982013/04/18 - Main.LuisLinares

Changed:
<
<
Revision 97 is unreadable
>
>
Revision 98 is unreadable

Revision 972013/04/18 - Main.JeffreyDost

Changed:
<
<
Revision 96 is unreadable
>
>
Revision 97 is unreadable

Revision 962013/04/02 - Main.JeffreyDost

Changed:
<
<
Revision 95 is unreadable
>
>
Revision 96 is unreadable

Revision 952013/03/21 - Main.JeffreyDost

Changed:
<
<
Revision 94 is unreadable
>
>
Revision 95 is unreadable

Revision 942013/03/21 - Main.JeffreyDost

Changed:
<
<
Revision 93 is unreadable
>
>
Revision 94 is unreadable

Revision 932013/03/21 - Main.JeffreyDost

Changed:
<
<
Revision 92 is unreadable
>
>
Revision 93 is unreadable

Revision 922013/03/21 - Main.JeffreyDost

Changed:
<
<
Revision 91 is unreadable
>
>
Revision 92 is unreadable

Revision 912013/03/20 - Main.JeffreyDost

Changed:
<
<
Revision 90 is unreadable
>
>
Revision 91 is unreadable

Revision 902013/03/06 - Main.AlexGeorges

Changed:
<
<
Revision 89 is unreadable
>
>
Revision 90 is unreadable

Revision 892013/02/05 - Main.JeffreyDost

Changed:
<
<
Revision 88 is unreadable
>
>
Revision 89 is unreadable

Revision 882013/02/01 - Main.AlexGeorges

Changed:
<
<
Revision 87 is unreadable
>
>
Revision 88 is unreadable

Revision 872013/01/28 - Main.AlexGeorges

Changed:
<
<
Revision 86 is unreadable
>
>
Revision 87 is unreadable

Revision 862013/01/23 - Main.AlexGeorges

Changed:
<
<
Revision 85 is unreadable
>
>
Revision 86 is unreadable

Revision 852013/01/22 - Main.JeffreyDost

Changed:
<
<
Revision 84 is unreadable
>
>
Revision 85 is unreadable

Revision 842013/01/22 - Main.AlexGeorges

Changed:
<
<
Revision 83 is unreadable
>
>
Revision 84 is unreadable

Revision 832013/01/16 - Main.JeffreyDost

Changed:
<
<
Revision 82 is unreadable
>
>
Revision 83 is unreadable

Revision 822013/01/12 - Main.JeffreyDost

Changed:
<
<
Revision 81 is unreadable
>
>
Revision 82 is unreadable

Revision 812013/01/10 - Main.JeffreyDost

Changed:
<
<
Revision 80 is unreadable
>
>
Revision 81 is unreadable

Revision 802012/12/20 - Main.JeffreyDost

Changed:
<
<
Revision 79 is unreadable
>
>
Revision 80 is unreadable

Revision 792012/12/03 - Main.AlexGeorges

Changed:
<
<
Revision 78 is unreadable
>
>
Revision 79 is unreadable

Revision 782012/11/28 - Main.JeffreyDost

Changed:
<
<
Revision 77 is unreadable
>
>
Revision 78 is unreadable

Revision 772012/11/20 - Main.JeffreyDost

Changed:
<
<
Revision 76 is unreadable
>
>
Revision 77 is unreadable

Revision 762012/11/10 - Main.JeffreyDost

Changed:
<
<
Revision 75 is unreadable
>
>
Revision 76 is unreadable

Revision 752012/11/09 - Main.JeffreyDost

Changed:
<
<
Revision 74 is unreadable
>
>
Revision 75 is unreadable

Revision 742012/10/22 - Main.JeffreyDost

Added:
>
>
Revision 74 is unreadable
Deleted:
<
<
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->

Glidein Factory FAQ

Contents

Variables Used in this Document

Here is a list of variables used in this document as shorthand for common paths used in factory operations:

Variable Value Description
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinWMS glideinWMS source code directory
GLIDEIN_FACTOOLS /home/gfactory/factools factools repo location

Assumed gfactory Path Setup

This document assumes the gfactory user has the following set in the $PATH:

  • $GLIDEIN_SRC_DIR/factory/tools/
  • $GLIDEIN_FACTOOLS/UCSD/bin/
  • $GLIDEIN_FACTOOLS/generic/bin/

Basic Procedures

Reconfiguring Factory

  1. Change to the instance directory:
    cd $GLIDEIN_FACTORY_DIR
  2. Edit the glideinWMS.xml in the .cfg dir:
    vi ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
  3. Diff against the current config:
    diff glideinWMS.xml ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
  4. When you are satified, stop the factory, reconfigure, and then restart:
    ./factory_startup stop
    ./factory_startup reconfig
    ./factory_startup start
    

NOTE You may have to try stopping the factory multiple times if the load is high.

Restarting Factory after Reboot

  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start

Site Debugging Procedures

NOTE to be written

Maintenance

Adding a New Site to Glidein Factory

Click here for instructions on how to add a site for VOs to use.

Entry Templates

NOTE this section is likely obsolete

CMS cream:

<--/twistyPlugin twikiMakeVisibleInline-->
      <entry name="" comment="" enabled="True" gatekeeper="https://%HOSTNAME%:8443/ce-cream/services/CREAM2 %BATCH% %QUEUE%" gridtype="cream" verbosity="std" work_dir="TMPDIR">
         <config>
            <max_jobs held="25" idle="400" running="10000">
               <max_job_frontends>
               </max_job_frontends>
            </max_jobs>
            <release max_per_cycle="20" sleep="0.2"/>
            <remove max_per_cycle="5" sleep="0.2"/>
            <restrictions require_voms_proxy="False"/>
            <submit cluster_size="10" max_per_cycle="100" sleep="0.2"/>
         </config>
         <downtimes/>
         <allow_frontends>
         </allow_frontends>
         <attrs>
            <attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
            <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
            <attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="114840"/>
            <attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_SEs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
            <attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
         </attrs>
         <files>
         </files>
         <infosys_refs>
            <infosys_ref ref="GlueCEUniqueID=" server="exp-bdii.cern.ch" type="BDII"/>
         </infosys_refs>
         <monitorgroups>
            <monitorgroup group_name="CMST2"/>
            <monitorgroup group_name="CMS"/>
         </monitorgroups>
      </entry>
<--/twistyPlugin-->

Cloning Factories

Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-10-02

Description of clone_glidein Arguments

  • -merge yes/no/only
    • yes - modify existing entries in addition to adding new ones
    • no - only add new entries
    • only - only merge existing; don't add new entries
  • -preserve_enable - when merging don't disable sites that are still enabled in original config
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Cloning GOC Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
    
  2. Run a second time with merge disabled and exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB:
    sed -e 's/OSGVO\([^H]\)/OSGVO_ITB\1/g' -e 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
    
  2. Manually add CMSOverflow to GLIDEIN_Supported_VOs for Engage_US_MWT2_osg and HCC_US_BNL_gk02 in glideinWMS.xml.ucsd2.
  3. Manually change GLIDEIN_Supported_VOs from glowVO to CMS for OSG_CrossOSG_ce in glideinWMS.xml.ucsd2.
  4. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also don't touch a few entries still to be tested on itb.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only -exclude name HCC_US_BU_atlas_dque -exclude name HCC_US_BU_atlas_opteron -exclude name OSG_US_HAMPTONU_hugrid02 -exclude name OSG_US_LEHIGH_piranha -exclude name OSG_US_NotreDame_earth glideinWMS.xml
    
  5. Run a second time with merge disabled but exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
    
  6. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
    
  2. Run a second time with merge disabled but only include what we want
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
    
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Upgrading Factory Condor

Go to the condor website and download the tarball as root user:

http://www.cs.wisc.edu/condor/downloads-v2/download.pl

For the UCSD factory we currently use condor-rel-x86_rhap_5.x-stripped.tar.gz

cd /root/Downloads
wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-rel-x86_rhap_5.x-stripped.tar.gz

As gfactory stop Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

The next commands need to be run as the root user.

  1. Stop Condor:
    /etc/init.d/condor stop
  2. Run upgrade script:
    $GLIDEIN_SRC_DIR/install/glidecondor_upgrade condor-rel-x86_rhap_5.x-stripped.tar.gz
    
  3. Start Condor with init.d script:
    /etc/init.d/condor start
  4. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%

As gfactory start the factory:

./factory_startup start

Upgrading Glidein Condor

Click here for instructions on how to upgrade the Condor tarballs for glideins to use.

Upgrading GlideinWMS

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd ..
rsync -av glideinWMS/ glideinWMS-old

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

Shut down Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

Check to make sure there are no running factory python processes:

ps -u gfactory

Fetch latest code and checkout new where new is the desired tag or branch name:

cd $GLIDEIN_SRC_DIR
git fetch
git checkout new

It is highly recommended to rebuild all glidein condor tarballs using create_condor_tarball and update the config file accordingly at this point, as outlined in the Upgrading Glidein Condor section, however do not reconfigure and restart the factory yet. Instead, after upgrading the tarballs and updating the config, proceed with the following steps below.

Run upgrade and supply full absolute path of config file:

cd $GLIDEIN_FACTORY_DIR
./factory_startup upgrade ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml

Restart Factory:

./factory_startup start

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_v2_0/

Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:

/var/gfactory/glideinsubmit/

Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.

Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up

/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/

Condor also needs the host certificate to function;

/etc/grid-security 

should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

Dealing with Scalability limits

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n    glideinDescript,entries,restart_attempts,restart_interval)\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n    childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', '  File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n    c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240

After changing log out then back in as gfactory and stop /restart the factory.

Factory Specific Notes

Factory Software and Patches

UCSD

Date Software Type Description
2012-09-27 gwms branch_v2_6_1_gf1 Adds glidein xml reports, glexec test and xml fix
2012-08-28 condor 7.8.3

GOC-ITB

Date Software Type Description
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports, glexec test and xml fix
2012-09-07 condor 7.8.3

GOC

Date Software Type Description
2012-09-18 gwms v2_6_1
2012-09-11 condor 7.8.2 with 7.8.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites

CERN

Date Software Type Description
2012-08-14 gwms branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu

Turn off timeout for sudo

run:

/usr/sbin/visudo

add the following:

Defaults    timestamp_timeout = 0

Firewall settings

For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:

###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000

Frontend Support

Adding a New Frontend

Click here for instructions on how to register a new Frontend to the Factory.

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Monitoring Reference

Glidein Factory Status

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

At the simplest level, there are four graphs to look at which are all displayed on top of one another on this page:

  • Running glidein jobs (green solid, on by default)
  • Glideins at collector (black line, not on by default)
  • Glideins claimed by user jobs (purple line, on by default)
  • Glideins not matched (yellow line, on by default)

What to look for with these:

Glideins claimed (purple) should not be much lower than the green envelope.

Glideins at collector (black) should also not be much lower than the green envelope.

Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Glidein Factory Status Now

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow (NOTE needs updating) for a longer and more detailed discussion.

Log Reference

Glidein output logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.err

Glidein user logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/condor_activity_*.log

Condor daemon logs:

/opt/glidecondor/condor_local/log/*Log

NOTE On GOC machines:

/usr/local/glidecondor/condor_local/log/*Log

Condor gridmanager logs:

/dev/shm/GridmanagerLog.schedd_glideins*

NOTE On GOC machines:

/tmp/GridmanagerLog.schedd_glideins*

Factory daemon logs:

$GLIDEIN_FACTORY_DIR/log/factory/factory.*.log
$GLIDEIN_FACTORY_DIR/log/entry_*/factory.*.log

Completed glidein logs:

$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log

Tool Reference

Running analyze_entries status report

cd $GLIDEIN_FACTORY_DIR
analyze_entries -x 24 -s waste

Run command with -h to print explanation of possible options.

This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.

Using proxy_info to Verifiy Pilot Proxies

An example of how to verify the pilot proxies used by the frontend.

  1. Get a list of the proxies for a VO and CE:
    proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
    

  1. Display a particular proxy's information:
    proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
    

  1. For additional tool help run:
    proxy_info -h
    

Site Debugging Reference

How to contact Grid sites

Non-CMS issues at OSG sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

  • For Email Address use osg-gfactory-support@physics.ucsd.edu
  • Check the Resource box and find the name corresponding to the GLIDEIN_ResourceName attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the Resource name in the Title so it is easy to find

CMS issues for All Sites

Use Savannah:

https://savannah.cern.ch/

  1. Search for CMS
  2. Select CMS Computing Infrastructure  Support
  3. Click Submit a new item

NOTE the following assumes you have administrative rights

Fill out the following fields:

  • For Catagory select Facilities Operations
  • For Assigned to select cmscompinfrasup-sitename
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find
  • in Add Email Addresses add osg_gfactory

NOTE if the site squad in question cannot be found in Assigned to then just follow the same instructions as below:

Non-CMS issues at European sites

Non-CMS issues at European sites

Use GOC:

https://ticket.grid.iu.edu/goc/submit

IMPORTANT leave Resource unchecked.

Explain in the Description it is an EGI resource along with the GLIDEIN_ResourceName and request to have it forwarded to GGUS on behalf of the affected VO.

Globus Hold Reasons

<--/twistyPlugin twikiMakeVisibleInline-->

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

<--/twistyPlugin-->

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:

  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Unknown host]
    • Likely a site that has been decommissioned and hostname is no longer valid, or possibly a typo in the hostname in the config.

  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs

  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec

  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.

Reasons we don't understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.

  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
      
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired

New Improved Docs (based on Alison's notes)

FactoryOpsGlideinWMS

FactoryInfo

Ops support

  • Internal tickets
  • mailing lists
  • access and cross-factory support

Factory Ops

  • Creating a new instance (and preserve monitoring history?)
  • Initial setup (daily emails, processes/monitoring, ??)
  • Adding new entries
  • Adding a frontend
  • Upgrading
    • factory
    • condor
  • Cloning
    • sites from one factory to another
    • Global cloning, such as t1 site group
  • Removing schedds (I think docs for this may be wrong?)
  • Attributes (link to gwms docs)
  • Finding missing sites
  • Removing glideins (includes scripts)
  • Submitting test jobs
  • Putting sites in downtime
  • Submitting Tickets
  • Decommissioning sites
  • Factory Disk warnings
  • Entry issues
    • CREAM
    • globus
    • Misc
  • Removing old entries

Daily Ops Monitoring

  • Mailing list
  • Internal tickets (Jira)
  • Daily emails
  • Analyze Entries
  • Web pages
  • Held jobs
  • Infosys
  • Misc
  • .err log problems
  • HOLD problems
  • Condor Activity Log problems

Daily Ops Other issues

  • Restarting the grid manager
  • Handling stuck waiting glideins
  • Rundiffs
  • Unmatched jobs

Additional References

  • logs
  • monitoring tools
  • proxies
  • ssh logins
  • git commands
  • frontend security info
  • BDII
  • Log Retention rules
  • Security
  • Condor G
  • Useful scripts
  • Misc

Future work

  • rpms

Authors

-- TerrenceMartin

-- IgorSfiligoi

Revision 732012/10/18 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 462 to 462
 
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find
Changed:
<
<
  • in Add Email Addresses add osg-gfactory-support@physics.ucsd.edu
>
>
  • in Add Email Addresses add osg_gfactory
  NOTE if the site squad in question cannot be found in Assigned to then just follow the same instructions as below:

Revision 722012/10/17 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 57 to 57
 

Adding a New Site to Glidein Factory

Changed:
<
<
These instructions assume you know the gatekeeper hostname and the VO who requested the entry.

Find Site in BDII

Check OSG BDII (is.grid.iu.edu) first:

lds osg hostname* | less

If it isn't there try CERN BDII (exp-bdii.cern.ch):

lds egi hostname* | less

Find a queue that supports the VO and has reasonable wallclock limits. Look for the lower of GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime and ensure it has enough time.

GlueCEAccessControlBaseRule: VO:VO
GlueCEPolicyMaxWallClockTime: MINUTES
GlueCEPolicyMaxCPUTime: MINUTES

Make note of the GlueCEUniqueID:

GlueCEUniqueID=hostname:port/jobmanager-jm_type-queue

Run add_new_entry

add_new_entry takes different arguments depending on the type of site and whether or not it supports CMS. If it supports CMS, run:

add_new_entry -p cms_plugin.py glideinWMS.xml glideinWMS.xml.test bdii_server GlueCEUniqueID entry_name GLIDEIN_Site vo_name "Added date --your_name" GLIDEIN_CMSSite GLIDEIN_SEs

Otherwise if it is a site that does not support CMS, run:

add_new_entry glideinWMS.xml glideinWMS.xml.test bdii_server GlueCEUniqueID entry_name GLIDEIN_Site vo_name "Added date --your_name"
>
>
Click here for instructions on how to add a site for VOs to use.
 

Entry Templates

Revision 712012/10/17 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 78 to 78
 
GlueCEUniqueID=hostname:port/jobmanager-jm_type-queue
Added:
>
>

Run add_new_entry

add_new_entry takes different arguments depending on the type of site and whether or not it supports CMS. If it supports CMS, run:

add_new_entry -p cms_plugin.py glideinWMS.xml glideinWMS.xml.test bdii_server GlueCEUniqueID entry_name GLIDEIN_Site vo_name "Added date --your_name" GLIDEIN_CMSSite GLIDEIN_SEs

Otherwise if it is a site that does not support CMS, run:

add_new_entry glideinWMS.xml glideinWMS.xml.test bdii_server GlueCEUniqueID entry_name GLIDEIN_Site vo_name "Added date --your_name"
 

Entry Templates

NOTE this section is likely obsolete

Revision 702012/10/17 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 57 to 57
 

Adding a New Site to Glidein Factory

Added:
>
>
These instructions assume you know the gatekeeper hostname and the VO who requested the entry.

Find Site in BDII

Check OSG BDII (is.grid.iu.edu) first:

lds osg hostname* | less

If it isn't there try CERN BDII (exp-bdii.cern.ch):

lds egi hostname* | less

Find a queue that supports the VO and has reasonable wallclock limits. Look for the lower of GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime and ensure it has enough time.

GlueCEAccessControlBaseRule: VO:VO
GlueCEPolicyMaxWallClockTime: MINUTES
GlueCEPolicyMaxCPUTime: MINUTES

Make note of the GlueCEUniqueID:

GlueCEUniqueID=hostname:port/jobmanager-jm_type-queue
 

Entry Templates

NOTE this section is likely obsolete

Revision 692012/10/04 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 508 to 508
 
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443
Added:
>
>
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Unknown host]
    • Likely a site that has been decommissioned and hostname is no longer valid, or possibly a typo in the hostname in the config.
 
  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs

Revision 682012/10/02 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 133 to 133
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
>
>
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB:
    sed -e 's/OSGVO\([^H]\)/OSGVO_ITB\1/g' -e 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
 
Changed:
<
<
  1. Run a second time with merge disabled but exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
>
>
  1. Manually add CMSOverflow to GLIDEIN_Supported_VOs for Engage_US_MWT2_osg and HCC_US_BNL_gk02 in glideinWMS.xml.ucsd2.
  2. Manually change GLIDEIN_Supported_VOs from glowVO to CMS for OSG_CrossOSG_ce in glideinWMS.xml.ucsd2.
  3. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also don't touch a few entries still to be tested on itb.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only -exclude name HCC_US_BU_atlas_dque -exclude name HCC_US_BU_atlas_opteron -exclude name OSG_US_HAMPTONU_hugrid02 -exclude name OSG_US_LEHIGH_piranha -exclude name OSG_US_NotreDame_earth glideinWMS.xml
 
Changed:
<
<
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3
>
>
  1. Run a second time with merge disabled but exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
 
Deleted:
<
<
  1. Engage_US_MWT2_osg and HCC_US_BNL_gk02: Add CMSOverflow to GLIDEIN_Supported_VO
  2. OSG_CrossOSG_ce: Change GLIDEIN_Supported_VO to CMS
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Revision 672012/10/02 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 113 to 113
  Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
Changed:
<
<
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-08-23
>
>
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-10-02
 

Description of clone_glidein Arguments

  • -merge yes/no/only
Line: 133 to 133
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
 
Changed:
<
<
  1. Run a second time with merge disabled but exclude a few experimental entries
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
>
>
  1. Run a second time with merge disabled but exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
 
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3

Revision 662012/09/28 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 292 to 292
 

UCSD

Date Software Type Description
Changed:
<
<
2012-09-27 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
>
>
2012-09-27 gwms branch_v2_6_1_gf1 Adds glidein xml reports, glexec test and xml fix
 
2012-08-28 condor 7.8.3

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
>
>
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports, glexec test and xml fix
 
2012-09-07 condor 7.8.3

GOC

Revision 652012/09/28 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 292 to 292
 

UCSD

Date Software Type Description
Changed:
<
<
2012-08-31 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
>
>
2012-09-27 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
 
2012-08-28 condor 7.8.3

GOC-ITB

Revision 642012/09/25 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 338 to 338
 

Adding a New Frontend

Changed:
<
<

Required Preliminary Info

Required from Frontend admin:

  • security_name - Agree on a name with the Frontend admin before proceeding. The security_name should contain the VO name and optionally a geographic location or abbreviated institution name if there is any chance in the future more than one frontend will serve the same VO.
  • Frontend host cert DN - provided by Frontend admin

Decided by Factory admin:

  • username - The UNIX username the frontend will be mapped to in the factory. By convention, start username with “fe”
  • Frontend identity - The identity the frontend will be mapped to in the WMS Collector. This does not need to be the same as the UNIX username but it can be.
  • vo_name - Name to be specified in the GLIDEIN_Supported_VOs list in each entry authorized for the Frontend to use. This is usually simply the VO name but is arbitrary. It must be given to the Frontend admin to complete the process.

Like security_name, if multiple frontends serve the VO it may be useful to have geographic or institutional info in the username and identity name.

Registration Procedure

Perform the following steps as root:

  1. Create new user:
    useradd username
  2. Add user to /etc/condor/privsep_config:
    valid-target-uids = feuser1 : feuser2 : … : username
    valid-target-gids = feuser1 : feuser2 : … : username
    
  3. Authenticate with Condor:
    $GLIDEIN_SRC_DIR/install/glidecondor_addDN -daemon 'add comment here' frontend_DN identity
    
    Include in the comment the Frontend name, admin name, and admin's email address. This shows up in the condor config file.
  4. Reconfigure Condor:
    killall -HUP condor_collector

Perform the following steps as gfactory:

  1. add new Frontend to glideinWMS.xml
    <frontends>
       ...
       <frontend name="security_name" comment="Contact: add list of admins and contact email addresses here" identity="identity@glidein-1.t2.ucsd.edu">
          <security_classes>
             <security_class name="frontend" username="username"/>
          </security_classes>
       </frontend>
       ...
    </frontends>
    
  2. Reconfigure and restart the Factory

Notify Frontend Admin

Email the frontend admin when it is finished:

<--/twistyPlugin twikiMakeVisibleInline-->
Hi admin_name,

We have finished registering your frontend to our factory.  Here is the relevant info you need to complete your frontend configuration:

In your frontend security section please set:
security_name="security_name"

In factory collector section use the following:
DN="/DC=org/DC=doegrids/OU=Services/CN=glidein-1.t2.ucsd.edu"
factory_identity="gfactory@glidein-1.t2.ucsd.edu"
my_identity="identity@glidein-1.t2.ucsd.edu"
node="glidein-1.t2.ucsd.edu"

In the pilot proxy section please use:
security_class="frontend"

Please also add stringListMember("vo_name",GLIDEIN_Supported_VOs) to your factory query_expr.

For the next step, please let us know a single site you would like to submit to, so we can test the configuration.  Ideally it is a site you also have admin rights to.

Once we confirm everything is working you can either supply us a full list of desired sites or we can provide a list of sites for you to choose from that claim to support your VO, whichever you prefer.

Thanks,
your_name
OSG Glidein Factory Operations
<--/twistyPlugin-->

Whitlisting entries for Frontend

Add the vo_name to the GLIDEIN_Supported_VOs list to each entry the frontend wants to use.

NOTE We have a tool that can generate a list of sites claiming to support a given VO. Details on how to use this will be added here later.

>
>
Click here for instructions on how to register a new Frontend to the Factory.
 

How To Open A Ticket To Contact Glidein Factory Support

Revision 632012/09/25 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 347 to 347
 Decided by Factory admin:
  • username - The UNIX username the frontend will be mapped to in the factory. By convention, start username with “fe”
  • Frontend identity - The identity the frontend will be mapped to in the WMS Collector. This does not need to be the same as the UNIX username but it can be.
Added:
>
>
  • vo_name - Name to be specified in the GLIDEIN_Supported_VOs list in each entry authorized for the Frontend to use. This is usually simply the VO name but is arbitrary. It must be given to the Frontend admin to complete the process.
  Like security_name, if multiple frontends serve the VO it may be useful to have geographic or institutional info in the username and identity name.
Line: 417 to 418
  </>
<--/twistyPlugin-->
Added:
>
>

Whitlisting entries for Frontend

Add the vo_name to the GLIDEIN_Supported_VOs list to each entry the frontend wants to use.

NOTE We have a tool that can generate a list of sites claiming to support a given VO. Details on how to use this will be added here later.

 

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Revision 622012/09/25 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 377 to 377
 
  1. Reconfigure and restart the Factory
Added:
>
>

Notify Frontend Admin

Email the frontend admin when it is finished:

<--/twistyPlugin twikiMakeVisibleInline-->
Hi admin_name,

We have finished registering your frontend to our factory.  Here is the relevant info you need to complete your frontend configuration:

In your frontend security section please set:
security_name="security_name"

In factory collector section use the following:
DN="/DC=org/DC=doegrids/OU=Services/CN=glidein-1.t2.ucsd.edu"
factory_identity="gfactory@glidein-1.t2.ucsd.edu"
my_identity="identity@glidein-1.t2.ucsd.edu"
node="glidein-1.t2.ucsd.edu"

In the pilot proxy section please use:
security_class="frontend"

Please also add stringListMember("vo_name",GLIDEIN_Supported_VOs) to your factory query_expr.

For the next step, please let us know a single site you would like to submit to, so we can test the configuration.  Ideally it is a site you also have admin rights to.

Once we confirm everything is working you can either supply us a full list of desired sites or we can provide a list of sites for you to choose from that claim to support your VO, whichever you prefer.

Thanks,
your_name
OSG Glidein Factory Operations
<--/twistyPlugin-->
 

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Revision 612012/09/24 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 346 to 346
  Decided by Factory admin:
  • username - The UNIX username the frontend will be mapped to in the factory. By convention, start username with “fe”
Changed:
<
<
  • Factory identity - The identity the frontend will be mapped to in the WMS Collector. This does not need to be the same as the UNIX username but it can be.
>
>
  • Frontend identity - The identity the frontend will be mapped to in the WMS Collector. This does not need to be the same as the UNIX username but it can be.
  Like security_name, if multiple frontends serve the VO it may be useful to have geographic or institutional info in the username and identity name.
Line: 363 to 363
  Include in the comment the Frontend name, admin name, and admin's email address. This shows up in the condor config file.
  1. Reconfigure Condor:
    killall -HUP condor_collector
Changed:
<
<

Adding a new VO

NOTE this seciton needs to be updated

For security purposes:

  1. (as root) add the vo user (e.g. fevo1)
    useradd fevo1
  2. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
  3. (as root) Add the VO pilot DNs to
    /etc/grid-security/grid-mapfile
    Note: Only needed if using CREAM.
  4. (as root) Add VO to the Condor config (note: the UNIX user naem and the condro name may be different, but can be the same, as in this example)
    ~/glideinWMS/install# ./glidecondor_addDN -daemon "VO1 Frontend DN" "<VO1 DN>" fevo1
    Reconfig condor
    /opt/glidecondor/sbin/condor_reconfig -collector
  5. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_v2_0.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_v2_0$ ./factory_startup reconfig ../glidein_v2_0.cfg/glideinWMS.xml

For resource selection purposes:

  1. Identify the entries they can use (no obvious way just yet)
  2. If they need sites we don't support yet, add an entry for them
    Use the
    VO_blah
    naming convention, so we know who first requested the entry.
  3. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.
>
>
Perform the following steps as gfactory:
  1. add new Frontend to glideinWMS.xml
    <frontends>
       ...
       <frontend name="security_name" comment="Contact: add list of admins and contact email addresses here" identity="identity@glidein-1.t2.ucsd.edu">
          <security_classes>
             <security_class name="frontend" username="username"/>
          </security_classes>
       </frontend>
       ...
    </frontends>
    
  2. Reconfigure and restart the Factory
 

How To Open A Ticket To Contact Glidein Factory Support

Revision 602012/09/24 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 336 to 336
 

Frontend Support

Added:
>
>

Adding a New Frontend

Required Preliminary Info

Required from Frontend admin:

  • security_name - Agree on a name with the Frontend admin before proceeding. The security_name should contain the VO name and optionally a geographic location or abbreviated institution name if there is any chance in the future more than one frontend will serve the same VO.
  • Frontend host cert DN - provided by Frontend admin

Decided by Factory admin:

  • username - The UNIX username the frontend will be mapped to in the factory. By convention, start username with “fe”
  • Factory identity - The identity the frontend will be mapped to in the WMS Collector. This does not need to be the same as the UNIX username but it can be.

Like security_name, if multiple frontends serve the VO it may be useful to have geographic or institutional info in the username and identity name.

Registration Procedure

Perform the following steps as root:

  1. Create new user:
    useradd username
  2. Add user to /etc/condor/privsep_config:
    valid-target-uids = feuser1 : feuser2 : … : username
    valid-target-gids = feuser1 : feuser2 :  … : username
    
  3. Authenticate with Condor:
    $GLIDEIN_SRC_DIR/install/glidecondor_addDN -daemon 'add comment here' frontend_DN identity
    
    Include in the comment the Frontend name, admin name, and admin's email address. This shows up in the condor config file.
  4. Reconfigure Condor:
    killall -HUP condor_collector
 

Adding a new VO

NOTE this seciton needs to be updated

Revision 592012/09/23 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 191 to 191
  Click here for instructions on how to upgrade the Condor tarballs for glideins to use.
Added:
>
>

Upgrading GlideinWMS

Back Up Old GlideinWMS (Optional)

Check if there are any manually applied patches:

cd $GLIDEIN_SRC_DIR
git status

If there are and they are worth saving, it is easiest to just backup the whole git repo:

cd ..
rsync -av glideinWMS/ glideinWMS-old

old should signify the glideinWMS version number you are backing up.

Upgrade Procedure

Shut down Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

Check to make sure there are no running factory python processes:

ps -u gfactory

Fetch latest code and checkout new where new is the desired tag or branch name:

cd $GLIDEIN_SRC_DIR
git fetch
git checkout new

It is highly recommended to rebuild all glidein condor tarballs using create_condor_tarball and update the config file accordingly at this point, as outlined in the Upgrading Glidein Condor section, however do not reconfigure and restart the factory yet. Instead, after upgrading the tarballs and updating the config, proceed with the following steps below.

Run upgrade and supply full absolute path of config file:

cd $GLIDEIN_FACTORY_DIR
./factory_startup upgrade ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml

Restart Factory:

./factory_startup start
 

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

Revision 582012/09/22 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 29 to 29
 

Basic Procedures

Reconfiguring Factory

Added:
>
>
 
  1. Change to the instance directory:
    cd $GLIDEIN_FACTORY_DIR
  2. Edit the glideinWMS.xml in the .cfg dir:
    vi ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
  3. Diff against the current config:
    diff glideinWMS.xml ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
Line: 39 to 40
 

NOTE You may have to try stopping the factory multiple times if the load is high.

Added:
>
>
 

Restarting Factory after Reboot

Line: 159 to 161
  http://www.cs.wisc.edu/condor/downloads-v2/download.pl
Changed:
<
<
For the UCSD factory we currently use condor-<rel>-x86_rhap_5.x-stripped.tar.gz
>
>
For the UCSD factory we currently use condor-rel-x86_rhap_5.x-stripped.tar.gz
 
cd /root/Downloads

Changed:
<
<
wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-<rel>-x86_rhap_5.x-stripped.tar.gz
>
>
wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-rel-x86_rhap_5.x-stripped.tar.gz
 

As gfactory stop Factory:

Line: 177 to 179
 
  1. Stop Condor:
    /etc/init.d/condor stop
  2. Run upgrade script:
    
    
Changed:
<
<
$GLIDEIN_SRC_DIR/install/glidecondor_upgrade condor-<rel>-x86_rhap_5.x-stripped.tar.gz
>
>
$GLIDEIN_SRC_DIR/install/glidecondor_upgrade condor-rel-x86_rhap_5.x-stripped.tar.gz
 
  1. Start Condor with init.d script:
    /etc/init.d/condor start
  2. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
Line: 185 to 187
 As gfactory start the factory:
./factory_startup start
Added:
>
>

Upgrading Glidein Condor

Click here for instructions on how to upgrade the Condor tarballs for glideins to use.

 

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

Line: 415 to 421
  Fill out the following fields:
  • For Catagory select Facilities Operations
Changed:
<
<
  • For Assigned to select cmscompinfrasup-<site name>
>
>
  • For Assigned to select cmscompinfrasup-sitename
 
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find

Revision 572012/09/21 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 155 to 155
 

Upgrading Factory Condor

Changed:
<
<
Go to the condor website and download tarballs as root user:
>
>
Go to the condor website and download the tarball as root user:
  http://www.cs.wisc.edu/condor/downloads-v2/download.pl

Revision 562012/09/21 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 18 to 18
 
GLIDEIN_SRC_DIR /home/gfactory/glideinWMS glideinWMS source code directory
GLIDEIN_FACTOOLS /home/gfactory/factools factools repo location
Changed:
<
<

Assumed gfactory Path Setup

>
>

Assumed gfactory Path Setup

 
Changed:
<
<
This document assumes the gfactory user has the following set in the $PATH:
>
>
This document assumes the gfactory user has the following set in the $PATH:
 
  • $GLIDEIN_SRC_DIR/factory/tools/
  • $GLIDEIN_FACTOOLS/UCSD/bin/
Line: 42 to 42
 

Restarting Factory after Reboot

Changed:
<
<
  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start
>
>
  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start
 

Site Debugging Procedures

Line: 153 to 153
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
Added:
>
>

Upgrading Factory Condor

Go to the condor website and download tarballs as root user:

http://www.cs.wisc.edu/condor/downloads-v2/download.pl

For the UCSD factory we currently use condor-<rel>-x86_rhap_5.x-stripped.tar.gz

cd /root/Downloads
wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-<rel>-x86_rhap_5.x-stripped.tar.gz

As gfactory stop Factory:

cd $GLIDEIN_FACTORY_DIR
./factory_startup stop

The next commands need to be run as the root user.

  1. Stop Condor:
    /etc/init.d/condor stop
  2. Run upgrade script:
    $GLIDEIN_SRC_DIR/install/glidecondor_upgrade condor-<rel>-x86_rhap_5.x-stripped.tar.gz
    
  3. Start Condor with init.d script:
    /etc/init.d/condor start
  4. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%

As gfactory start the factory:

./factory_startup start
 

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

Revision 552012/09/20 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 11 to 11
 

Variables Used in this Document

Changed:
<
<
A few variables are defined in the gfactory user's ~/.bash_profile for various tools to use. Here is a list of a few that are commonly referenced throughout this document:
>
>
Here is a list of variables used in this document as shorthand for common paths used in factory operations:
 
Variable Value Description
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinWMS glideinWMS source code directory
Added:
>
>
GLIDEIN_FACTOOLS /home/gfactory/factools factools repo location

Assumed gfactory Path Setup

This document assumes the gfactory user has the following set in the $PATH:

  • $GLIDEIN_SRC_DIR/factory/tools/
  • $GLIDEIN_FACTOOLS/UCSD/bin/
  • $GLIDEIN_FACTOOLS/generic/bin/
 

Basic Procedures

Line: 325 to 334
 

Running analyze_entries status report

cd $GLIDEIN_FACTORY_DIR
Changed:
<
<
$GLIDEIN_SRC_DIR/factory/tools/analyze_entries -x 24 -s waste
>
>
analyze_entries -x 24 -s waste
 

Run command with -h to print explanation of possible options.

Line: 337 to 346
 An example of how to verify the pilot proxies used by the frontend.

  1. Get a list of the proxies for a VO and CE:
    
    
Changed:
<
<
$GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
>
>
proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
 
  1. Display a particular proxy's information:
    
    
Changed:
<
<
$GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
>
>
proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
 

  1. For additional tool help run:
Changed:
<
<
$GLIDEIN_SRC_DIR/factory/tools/proxy_info -h
>
>
proxy_info -h
 

Site Debugging Reference

Revision 542012/09/20 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 19 to 19
 

Basic Procedures

Changed:
<
<

Restarting the Glidein Factory after Reboot

>
>

Reconfiguring Factory

  1. Change to the instance directory:
    cd $GLIDEIN_FACTORY_DIR
  2. Edit the glideinWMS.xml in the .cfg dir:
    vi ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
  3. Diff against the current config:
    diff glideinWMS.xml ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
  4. When you are satified, stop the factory, reconfigure, and then restart:
    ./factory_startup stop
    ./factory_startup reconfig
    ./factory_startup start
    

NOTE You may have to try stopping the factory multiple times if the load is high.

Restarting Factory after Reboot

 
  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
Line: 332 to 344
 

  1. For additional tool help run:
Changed:
<
<
$GLIDEIN_SRC_DIR/proxy_info -h
>
>
$GLIDEIN_SRC_DIR/factory/tools/proxy_info -h
 

Site Debugging Reference

Revision 532012/09/19 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 7 to 7
 

Contents

Changed:
<
<
>
>
 

Variables Used in this Document

Line: 17 to 17
 
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinWMS glideinWMS source code directory
Added:
>
>

Basic Procedures

Restarting the Glidein Factory after Reboot

  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start

Site Debugging Procedures

NOTE to be written

Maintenance

 

Adding a New Site to Glidein Factory

Entry Templates

Line: 117 to 132
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
Changed:
<
<

Running the site status report

>
>

Areas needing backup

The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_v2_0/

Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:

/var/gfactory/glideinsubmit/

Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.

Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up

/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/

Condor also needs the host certificate to function;

/etc/grid-security 

should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

Dealing with Scalability limits

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

 
Changed:
<
<
cd $GLIDEIN_FACTORY_DIR $GLIDEIN_SRC_DIR/factory/tools/analyze_entries -x 24 -s waste
>
>
[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', ' File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n glideinDescript,entries,restart_attempts,restart_interval)\n', ' File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', ' File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240

After changing log out then back in as gfactory and stop /restart the factory.

Factory Specific Notes

Factory Software and Patches

UCSD

Date Software Type Description
2012-08-31 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-08-28 condor 7.8.3

GOC-ITB

Date Software Type Description
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-09-07 condor 7.8.3

GOC

Date Software Type Description
2012-09-18 gwms v2_6_1
2012-09-11 condor 7.8.2 with 7.8.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites

CERN

Date Software Type Description
2012-08-14 gwms branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu

Turn off timeout for sudo

run:

/usr/sbin/visudo

add the following:

Defaults    timestamp_timeout = 0
 
Changed:
<
<
Run command with -h to print explanation of possible options.
>
>

Firewall settings

 
Changed:
<
<
This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.
>
>
For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:
###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000

Frontend Support

Adding a new VO

NOTE this seciton needs to be updated

For security purposes:

  1. (as root) add the vo user (e.g. fevo1)
    useradd fevo1
  2. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
  3. (as root) Add the VO pilot DNs to
    /etc/grid-security/grid-mapfile
    Note: Only needed if using CREAM.
  4. (as root) Add VO to the Condor config (note: the UNIX user naem and the condro name may be different, but can be the same, as in this example)
    ~/glideinWMS/install# ./glidecondor_addDN -daemon "VO1 Frontend DN" "<VO1 DN>" fevo1
    Reconfig condor
    /opt/glidecondor/sbin/condor_reconfig -collector
  5. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_v2_0.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_v2_0$ ./factory_startup reconfig ../glidein_v2_0.cfg/glideinWMS.xml

For resource selection purposes:

  1. Identify the entries they can use (no obvious way just yet)
  2. If they need sites we don't support yet, add an entry for them
    Use the
    VO_blah
    naming convention, so we know who first requested the entry.
  3. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.

How To Open A Ticket To Contact Glidein Factory Support

 
Changed:
<
<

Monitoring webpages

Glidein Factory Status

>
>
NOTE This procedure is likely obsolete and needs to be verified with GOC

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Monitoring Reference

Glidein Factory Status

  http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

Line: 149 to 277
  Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.
Changed:
<
<

Glidein Factory Status Now

>
>

Glidein Factory Status Now

  http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow (NOTE needs updating) for a longer and more detailed discussion.
Changed:
<
<

GFactory log directories

>
>

Log Reference

  Glidein output logs:
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out

Line: 180 to 308
 Completed glidein logs:
$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log
Changed:
<
<

Globus Hold Reasons

<--/twistyPlugin twikiMakeVisibleInline-->

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

<--/twistyPlugin-->

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:

  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443

  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs

  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec

  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.
>
>

Tool Reference

 
Changed:
<
<

Reasons we don't understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.

  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
>
>

Running analyze_entries status report

cd $GLIDEIN_FACTORY_DIR
$GLIDEIN_SRC_DIR/factory/tools/analyze_entries -x 24 -s waste
 
Deleted:
<
<
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired

Restarting the Glidein Factory after Reboot

  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start

Adding a new VO

NOTE this seciton needs to be updated

For security purposes:

  1. (as root) add the vo user (e.g. fevo1)
    useradd fevo1
  2. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
  3. (as root) Add the VO pilot DNs to
    /etc/grid-security/grid-mapfile
    Note: Only needed if using CREAM.
  4. (as root) Add VO to the Condor config (note: the UNIX user naem and the condro name may be different, but can be the same, as in this example)
    ~/glideinWMS/install# ./glidecondor_addDN -daemon "VO1 Frontend DN" "<VO1 DN>" fevo1
    Reconfig condor
    /opt/glidecondor/sbin/condor_reconfig -collector
  5. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_v2_0.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_v2_0$ ./factory_startup reconfig ../glidein_v2_0.cfg/glideinWMS.xml

For resource selection purposes:

  1. Identify the entries they can use (no obvious way just yet)
  2. If they need sites we don't support yet, add an entry for them
    Use the
    VO_blah
    naming convention, so we know who first requested the entry.
  3. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.

Areas needing backup

 
Changed:
<
<
The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_v2_0/
>
>
Run command with -h to print explanation of possible options.
 
Changed:
<
<
Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:
/var/gfactory/glideinsubmit/
>
>
This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.
 
Changed:
<
<
Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.
>
>

Using proxy_info to Verifiy Pilot Proxies

 
Changed:
<
<
Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.
>
>
An example of how to verify the pilot proxies used by the frontend.
 
Changed:
<
<
If there are any experimental or in-development code pieces, those should use a separate backup policy.
>
>
  1. Get a list of the proxies for a VO and CE:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
 
Changed:
<
<
The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up
/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/
>
>
  1. Display a particular proxy's information:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
    
 
Changed:
<
<
Condor also needs the host certificate to function;
/etc/grid-security 
>
>
  1. For additional tool help run:
    $GLIDEIN_SRC_DIR/proxy_info -h
 
Changed:
<
<
should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

>
>

Site Debugging Reference

 
Deleted:
<
<
The same should apply to all other software components the factory is relying on.
 

How to contact Grid sites

Non-CMS issues at OSG sites

Line: 325 to 367
 
  • Include the CMS Site name in the Title so it is easy to find
  • in Add Email Addresses add osg-gfactory-support@physics.ucsd.edu
Added:
>
>
NOTE if the site squad in question cannot be found in Assigned to then just follow the same instructions as below:

Non-CMS issues at European sites

 

Non-CMS issues at European sites

Use GOC:

Line: 335 to 381
  Explain in the Description it is an EGI resource along with the GLIDEIN_ResourceName and request to have it forwarded to GGUS on behalf of the affected VO.
Changed:
<
<

Verification of Pilot Proxies

An example of how to verify the pilot proxies used by the frontend.

  1. Get a list of the proxies for a VO and CE:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/

  1. Display a particular proxy's information:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
    

  1. For additional tool help run:
    $GLIDEIN_SRC_DIR/proxy_info -h
    

How To Open A Ticket To Contact Glidein Factory Support

NOTE This procedure is likely obsolete and needs to be verified with GOC

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Dealing with Scalability limits

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n    glideinDescript,entries,restart_attempts,restart_interval)\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n    childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', '  File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n    c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240
>
>

Globus Hold Reasons

 
Changed:
<
<
After changing log out then back in as gfactory and stop /restart the factory.
>
>
<--/twistyPlugin twikiMakeVisibleInline-->

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

<--/twistyPlugin-->
 
Changed:
<
<

Factory Software and Patches

>
>

Additional notes

 
Changed:
<
<

UCSD

>
>
  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:
 
Changed:
<
<
Date Software Type Description
2012-08-31 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-08-28 condor 7.8.3
>
>
  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)
 
Changed:
<
<

GOC-ITB

Date Software Type Description
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-09-07 condor 7.8.3
>
>

CREAM Hold Reasons (Work in progress)

 
Changed:
<
<

GOC

Date Software Type Description
2012-09-18 gwms v2_6_1
2012-09-11 condor 7.8.2 with 7.8.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
>
>

Reasons we mostly understand

 
Changed:
<
<

CERN

Date Software Type Description
2012-08-14 gwms branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
>
>
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up
 
Changed:
<
<

GOC Factory Things to Remember

>
>
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443
 
Changed:
<
<
Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu
>
>
  • CREAM_Delegate Error: Authorization error: System error reading local user information
    • We see this on old CREAM installs that don't like / symbols in DNs
 
Changed:
<
<

Turn off timeout for sudo

>
>
  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec
 
Changed:
<
<
run:
/usr/sbin/visudo
>
>
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.
 
Changed:
<
<
add the following:
>
>

Reasons we don't understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.
 
Changed:
<
<
Defaults    timestamp_timeout = 0
>
>
  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
 
Changed:
<
<

Firewall settings

For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:

###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000
>
>
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
 
Added:
>
>
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired
 
Changed:
<
<

Things to Remember from Glidein Factory Training week 11/11

CREAM issues with lease renewal

condor-g sets expired leases for some of our held glideins. This can be checked in the job classad JobLeaseExpiration.

New Improved Docs (based on Alison's notes)

>
>

New Improved Docs (based on Alison's notes)

  FactoryOpsGlideinWMS
Line: 501 to 518
 Future work
  • rpms
Changed:
<
<

Authors

>
>

Authors

  -- TerrenceMartin

Revision 522012/09/19 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 11 to 11
 

Variables Used in this Document

Changed:
<
<
The current factory instance directory is commonly referenced throughout this document and is set in the gfactory user's ~/.bash_profile for various tools to use:
>
>
A few variables are defined in the gfactory user's ~/.bash_profile for various tools to use. Here is a list of a few that are commonly referenced throughout this document:
 
Changed:
<
<
export GLIDEIN_FACTORY_DIR=/home/gfactory/glideinsubmit/glidein_v2_0
>
>
Variable Value Description
GLIDEIN_FACTORY_DIR /home/gfactory/glideinsubmit/glidein_v2_0 current factory instance directory
GLIDEIN_SRC_DIR /home/gfactory/glideinWMS glideinWMS source code directory
 

Adding a New Site to Glidein Factory

Entry Templates

Added:
>
>
NOTE this section is likely obsolete
 CMS cream:

%TWISTY{

Line: 82 to 86
 
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Cloning GOC Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled and exclude disabled entries.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
    
  2. Run a second time with merge disabled and exclude disabled entries.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
    
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
  3. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
    
  2. Run a second time with merge disabled but exclude a few experimental entries
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
    
  3. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3
    
 
  1. Engage_US_MWT2_osg and HCC_US_BNL_gk02: Add CMSOverflow to GLIDEIN_Supported_VO
  2. OSG_CrossOSG_ce: Change GLIDEIN_Supported_VO to CMS
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  2. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
    
  2. Run a second time with merge disabled but only include what we want
    $GLIDEIN_SRC_DIR/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
    
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Running the site status report

Changed:
<
<
cd $GLIDEIN_FACTORY_DIR
~/glideinWMS/factory/tools/analyze_entries -x 24 -s waste
>
>
cd $GLIDEIN_FACTORY_DIR
$GLIDEIN_SRC_DIR/factory/tools/analyze_entries -x 24 -s waste
  Run command with -h to print explanation of possible options.
Line: 131 to 150
 Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Glidein Factory Status Now

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
>
>
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow (NOTE needs updating) for a longer and more detailed discussion.
 

GFactory log directories

Line: 233 to 252
 

Adding a new VO

Added:
>
>
NOTE this seciton needs to be updated
 For security purposes:
  1. (as root) add the vo user (e.g. fevo1)
    useradd fevo1
  2. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
Line: 246 to 267
 
  1. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.

Areas needing backup

Changed:
<
<
The glidein factory is mostly stateless... if we were to loose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.
>
>
The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.
  The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:
/var/gfactory/glideinsubmit/glidein_v2_0/
Line: 256 to 277
  Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.
Changed:
<
<
Nothing else in the factory should need to be backed up;
all the code should be in CVS or downloadable from an official repository.
>
>
Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.
  If there are any experimental or in-development code pieces, those should use a separate backup policy.
Line: 276 to 297
 The same should apply to all other software components the factory is relying on.

How to contact Grid sites

Changed:
<
<
For OSG sites, use the GOC
https://ticket.grid.iu.edu/goc/open

For CMS sites, use Savannah:

https://savannah.cern.ch/
>
>

Non-CMS issues at OSG sites

Use GOC:
 
Changed:
<
<
Search for: CMS
>
>
https://ticket.grid.iu.edu/goc/submit
 
Changed:
<
<
Select: CMS Computing Infrastructure Support
>
>
  • For Email Address use osg-gfactory-support@physics.ucsd.edu
  • Check the Resource box and find the name corresponding to the GLIDEIN_ResourceName attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the Resource name in the Title so it is easy to find
 
Changed:
<
<
Use: Submit a new item https://savannah.cern.ch/support/?group=cmscompinfrasup&func=additem
>
>

CMS issues for All Sites

Use Savannah:
 
Changed:
<
<
For Non-CMS European sites us GOC:
>
>
https://savannah.cern.ch/
 
Changed:
<
<
https://ticket.grid.iu.edu/goc/other
>
>
  1. Search for CMS
  2. Select CMS Computing Infrastructure  Support
  3. Click Submit a new item
 
Changed:
<
<
Select the proper VO and make note that it is for an EGI or WLCG resource and GOC will forward to GGUS.
>
>
NOTE the following assumes you have administrative rights
 
Changed:
<
<

Verification of Proxies

>
>
Fill out the following fields:
  • For Catagory select Facilities Operations
  • For Assigned to select cmscompinfrasup-<site name>
  • Set Use GGUS to No (this can be changed to Yes later if admins never respond)
  • For Site find the name corresponding to the GLIDEIN_CMSSite attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
  • Include the CMS Site name in the Title so it is easy to find
  • in Add Email Addresses add osg-gfactory-support@physics.ucsd.edu
 
Changed:
<
<
An example of how to verify the proxies used by the frontend. Log into the gfactory first.
>
>

Non-CMS issues at European sites

 
Changed:
<
<
Setup the environment for voms-proxy tools.

source /opt/vdt/setup.sh
>
>
Use GOC:
 
Changed:
<
<
Access the factory tools
>
>
https://ticket.grid.iu.edu/goc/submit
 
Changed:
<
<
cd ~/glideinWMS/factory/tools
>
>
IMPORTANT leave Resource unchecked.
 
Changed:
<
<
Get a list of the proxies for a VO and CE
>
>
Explain in the Description it is an EGI resource along with the GLIDEIN_ResourceName and request to have it forwarded to GGUS on behalf of the affected VO.
 
Changed:
<
<
./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/' 
>
>

Verification of Pilot Proxies

 
Changed:
<
<
Display a particular proxies information
>
>
An example of how to verify the pilot proxies used by the frontend.
 
Changed:
<
<
./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy' 
>
>
  1. Get a list of the proxies for a VO and CE:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
 
Changed:
<
<
For additional tool help run
>
>
  1. Display a particular proxy's information:
    $GLIDEIN_SRC_DIR/factory/tools/proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
    
 
Changed:
<
<
 ./proxy_info -h
>
>
  1. For additional tool help run:
    $GLIDEIN_SRC_DIR/proxy_info -h
 

How To Open A Ticket To Contact Glidein Factory Support

Added:
>
>
NOTE This procedure is likely obsolete and needs to be verified with GOC
 Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

Line: 351 to 374
  After changing log out then back in as gfactory and stop /restart the factory.
Changed:
<
<

Factory Patches

>
>

Factory Software and Patches

 

UCSD

Date Software Type Description
Changed:
<
<
2012-08-31 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so, /opt/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
>
>
2012-08-31 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-08-28 condor 7.8.3
 

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-09-14 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-09-07 condor running 7.8.3
>
>
2012-09-14 gwms branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
2012-09-07 condor 7.8.3
 

GOC

Date Software Type Description
Changed:
<
<
2012-09-18 gwms running v2_6_1  
2012-09-11 condor 7.6.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
>
>
2012-09-18 gwms v2_6_1
2012-09-11 condor 7.8.2 with 7.8.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
 

CERN

Date Software Type Description
Changed:
<
<
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
>
>
2012-08-14 gwms branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
 

GOC Factory Things to Remember

Revision 512012/09/19 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 187 to 187
 

Reasons we mostly understand

Deleted:
<
<
  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
    • Can happen with udp packet loss at a site (firewall filtering). UPDATE_COLLECTOR_WITH_TCP should fix it.
 
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up
Line: 206 to 203
 
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.

Reasons we don't understand

Added:
>
>
 
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • This can be seen when glideins are submitted to a site in downtime.
Line: 226 to 225
 
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired

Restarting the Glidein Factory after Reboot

Changed:
<
<
  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)
  3. Start gfactory (~/glideinsubmit/glidein_v2_0/factory_startup start as gfactory)
>
>
  1. As root start httpd:
    /etc/init.d/httpd start
  2. As root start condor:
    /etc/init.d/condor start
  3. Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
  4. As gfactory start the factory:
    $GLIDEIN_FACTORY_DIR/factory_startup start
 

Adding a new VO

Revision 502012/09/18 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 9 to 9
 
Added:
>
>

Variables Used in this Document

The current factory instance directory is commonly referenced throughout this document and is set in the gfactory user's ~/.bash_profile for various tools to use:

export GLIDEIN_FACTORY_DIR=/home/gfactory/glideinsubmit/glidein_v2_0
 

Adding a New Site to Glidein Factory

Entry Templates

Line: 129 to 135
 

GFactory log directories

Changed:
<
<
The factory daemons log files can be found in
~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
>
>
Glidein output logs:
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.err

Glidein user logs:

$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/condor_activity_*.log

Condor daemon logs:

/opt/glidecondor/condor_local/log/*Log
 
Changed:
<
<
The list of all completed jobs can be found in
 ~/glideinsubmit/glidein_v2_0/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_.log
>
>
NOTE On GOC machines:
/usr/local/glidecondor/condor_local/log/*Log
 
Changed:
<
<
The glidein exit logs can be found in
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/job.*.out|err 
>
>
Condor gridmanager logs:
/dev/shm/GridmanagerLog.schedd_glideins*
 
Changed:
<
<
Condor schedd job log can befound in
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/condor_activity__*.log 
>
>
NOTE On GOC machines:
/tmp/GridmanagerLog.schedd_glideins*
 
Changed:
<
<
Finally, the factory as a whole has its logs in
 ~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
>
>
Factory daemon logs:
$GLIDEIN_FACTORY_DIR/log/factory/factory.*.log
$GLIDEIN_FACTORY_DIR/log/entry_*/factory.*.log
 
Changed:
<
<
However, it is unlikely you need to look at that.
>
>
Completed glidein logs:
$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log
 

Globus Hold Reasons

Line: 161 to 176
 

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
Changed:
<
<
>
>
    • This can happen if it is a condor site and the admin removes held glideins from their side.
    • Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:
 
  • Globus error 9: the system cancelled the job
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

Revision 492012/09/18 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 95 to 95
 

Running the site status report

Changed:
<
<
 gfactory@glidein-1 ~$ glideinWMS/factory/tools/analyze_entries -o ~/logae/
>
>
cd $GLIDEIN_FACTORY_DIR
~/glideinWMS/factory/tools/analyze_entries -x 24 -s waste
 
Changed:
<
<
The "-o" is optional and specifies where the output should go, default is ~.
>
>
Run command with -h to print explanation of possible options.
 
Changed:
<
<
This will be send to osg-gfactory-support@physics.ucsd.edu daily.
>
>
This report is sent to osg-gfactory-reports@physics.ucsd.edu daily.
 

Monitoring webpages

Glidein Factory Status

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_v2_0/factoryStatus.html
>
>
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html
  Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."
Line: 123 to 124
  Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.
Deleted:
<
<
A script has been written to analyze the Glidein Factory Status data. To learn more about it, click here.
 

Glidein Factory Status Now

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_v2_0/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
>
>
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
 

GFactory log directories

Revision 482012/09/18 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 349 to 349
 

GOC

Date Software Type Description
Changed:
<
<
2012-09-18 gwms running glideinWMS v2_6_1  
>
>
2012-09-18 gwms running v2_6_1  
 
2012-09-11 condor 7.6.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites

CERN

Revision 472012/09/18 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 344 to 344
 

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-08-31 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
>
>
2012-09-14 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
 
2012-09-07 condor running 7.8.3

GOC

Date Software Type Description
Changed:
<
<
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
>
>
2012-09-18 gwms running glideinWMS v2_6_1  
 
2012-09-11 condor 7.6.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites

CERN

Revision 462012/09/18 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 83 to 83
 

Cloning GOC-ITB Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
Deleted:
<
<
  1. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
 
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3
Changed:
<
<
  1. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
    sed 's/ITBaddCMSOverflow.*value=\"/ITBaddCMSOverflow" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMSOverflow,/g' glideinWMS.xml.test3 > glideinWMS.xml.test4
>
>
  1. Engage_US_MWT2_osg and HCC_US_BNL_gk02: Add CMSOverflow to GLIDEIN_Supported_VO
  2. OSG_CrossOSG_ce: Change GLIDEIN_Supported_VO to CMS
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Revision 452012/09/18 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 344 to 344
 

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-08-31 gwms running glideinWMS branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix; pulled to get xml fix --Tim 20120907
>
>
2012-08-31 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
 
2012-09-07 condor running 7.8.3

GOC

Revision 442012/09/14 - Main.KristaLarson

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 391 to 391
  FactoryOpsGlideinWMS
Changed:
<
<
FactoryInfo (names, locations, users served, sites supported, anything unique about that install, upgrade req, ?)
  • All factories
  • UCSD
  • GOC
  • CERN
  • FNAL
>
>
FactoryInfo
  Ops support
  • Internal tickets

Revision 432012/09/11 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 340 to 340
 
Date Software Type Description
2012-08-31 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
Changed:
<
<
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites
>
>
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so, /opt/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
 

GOC-ITB

Date Software Type Description
2012-08-31 gwms running glideinWMS branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix; pulled to get xml fix --Tim 20120907
Changed:
<
<
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites
>
>
2012-09-07 condor running 7.8.3
 

GOC

Date Software Type Description
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
Added:
>
>
2012-09-11 condor 7.6.3 pre-release /usr/local/glidecondor/sbin/condor_gridmanager, /usr/local/glidecondor/sbin/nordugrid_gahp, /usr/local/glidecondor/lib/libcondor_utils_7_8_3.so, /usr/local/glidecondor/bin/condor_history patch to fix ARC 1.1.x sites
 

CERN

Date Software Type Description

Revision 422012/09/07 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 344 to 344
 

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-08-31 gwms running glideinWMS branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
>
>
2012-08-31 gwms running glideinWMS branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix; pulled to get xml fix --Tim 20120907
 
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites

GOC

Revision 412012/08/31 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 339 to 339
 

UCSD

Date Software Type Description
Changed:
<
<
2012-08-27 gwms running branch_v2_6_1_gf1 Adds glidein xml reports
>
>
2012-08-31 gwms running branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
 
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
2012-08-14 gwms cherry-pick 3cf7e1f8151d4b6c216b7f3c81578fc867ce8371 smarter hold release code to reduce load
2012-08-15 gwms cherry-pick 683b24d5b60b4b7dc83717e3d4663df1aa676286 bug fix for prev hold release pick
>
>
2012-08-31 gwms running glideinWMS branch_v2_6_1_gf1 Adds glidein xml reports and glexec test fix
 
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites

GOC

Revision 402012/08/28 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 339 to 339
 

UCSD

Date Software Type Description
Changed:
<
<
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
>
>
2012-08-27 gwms running branch_v2_6_1_gf1 Adds glidein xml reports
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites
 

GOC-ITB

Date Software Type Description
Deleted:
<
<
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites
 
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
2012-08-14 gwms cherry-pick 3cf7e1f8151d4b6c216b7f3c81578fc867ce8371 smarter hold release code to reduce load
Added:
>
>
2012-08-15 gwms cherry-pick 683b24d5b60b4b7dc83717e3d4663df1aa676286 bug fix for prev hold release pick
2012-08-28 condor 7.6.3 pre-release /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp, /opt/glidecondor/lib/libcondor_utils_7_8_3.so patch to fix ARC 1.1.x sites
 

GOC

Date Software Type Description

Revision 392012/08/23 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 65 to 65
  Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
Changed:
<
<
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-08-14
>
>
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-08-23
 

Description of clone_glidein Arguments

  • -merge yes/no/only
Line: 77 to 77
 

Cloning GOC Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
Changed:
<
<
  1. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
>
>
  1. Run a second time with merge disabled and exclude disabled entries.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Revision 382012/08/21 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 83 to 83
 

Cloning GOC-ITB Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
Deleted:
<
<
  1. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
 
  1. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
Changed:
<
<
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test1
>
>
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test3
  2. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
    sed 's/ITBaddCMSOverflow.*value=\"/ITBaddCMSOverflow" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMSOverflow,/g' glideinWMS.xml.test3 > glideinWMS.xml.test4
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Revision 372012/08/21 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 82 to 82
 

Cloning GOC-ITB Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
Changed:
<
<
  1. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
>
>
  1. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test1
 
  1. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
  2. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
  3. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test1

Revision 362012/08/20 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 81 to 81
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test1 -merge only glideinWMS.xml
 
  1. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
  2. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
  3. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
Changed:
<
<
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
>
>
  1. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
    sed 's/OSGVOHTPC/OSGVOHTPC_ITB/g' glideinWMS.xml.test2 | sed 's/OSGVO,/OSGVO_ITB,/g' | sed 's/OSGVO\"/OSGVO_ITB\"/g' > glideinWMS.xml.test1
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Revision 352012/08/17 - Main.KristaLarson

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 76 to 76
 
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Cloning GOC Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
 

Cloning GOC-ITB Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
  3. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
  4. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
  5. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
  6. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
  3. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
  4. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
  5. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
  6. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
 

Cloning CERN Factory

Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  2. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  2. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
  3. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
 

Running the site status report

Line: 131 to 131
 

GFactory log directories

The factory daemons log files can be found in

Changed:
<
<
~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
>
>
~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
  The list of all completed jobs can be found in
Changed:
<
<
 ~/glideinsubmit/glidein_v2_0/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_.log
>
>
 ~/glideinsubmit/glidein_v2_0/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_.log
  The glidein exit logs can be found in
Changed:
<
<
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/job.*.out|err 
>
>
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/job.*.out|err 
  Condor schedd job log can befound in
Changed:
<
<
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/condor_activity__*.log 
>
>
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/condor_activity__*.log 
  Finally, the factory as a whole has its logs in
Changed:
<
<
 ~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
>
>
 ~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
  However, it is unlikely you need to look at that.
Line: 300 to 300
  Get a list of the proxies for a VO and CE
Changed:
<
<
./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/'
>
>
./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/' 
  Display a particular proxies information
Changed:
<
<
./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy'
>
>
./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy' 
  For additional tool help run
Line: 362 to 358
 

GOC Factory Things to Remember

Changed:
<
<
Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu
>
>
Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu
 

Turn off timeout for sudo

Line: 389 to 384
 

Things to Remember from Glidein Factory Training week 11/11

CREAM issues with lease renewal

Added:
>
>
 condor-g sets expired leases for some of our held glideins. This can be checked in the job classad JobLeaseExpiration.
Added:
>
>
New Improved Docs (based on Alison's notes)

FactoryOpsGlideinWMS

FactoryInfo (names, locations, users served, sites supported, anything unique about that install, upgrade req, ?)

  • All factories
  • UCSD
  • GOC
  • CERN
  • FNAL

Ops support

  • Internal tickets
  • mailing lists
  • access and cross-factory support

Factory Ops

  • Creating a new instance (and preserve monitoring history?)
  • Initial setup (daily emails, processes/monitoring, ??)
  • Adding new entries
  • Adding a frontend
  • Upgrading
    • factory
    • condor
  • Cloning
    • sites from one factory to another
    • Global cloning, such as t1 site group
  • Removing schedds (I think docs for this may be wrong?)
  • Attributes (link to gwms docs)
  • Finding missing sites
  • Removing glideins (includes scripts)
  • Submitting test jobs
  • Putting sites in downtime
  • Submitting Tickets
  • Decommissioning sites
  • Factory Disk warnings
  • Entry issues
    • CREAM
    • globus
    • Misc
  • Removing old entries

Daily Ops Monitoring

  • Mailing list
  • Internal tickets (Jira)
  • Daily emails
  • Analyze Entries
  • Web pages
  • Held jobs
  • Infosys
  • Misc
  • .err log problems
  • HOLD problems
  • Condor Activity Log problems

Daily Ops Other issues

  • Restarting the grid manager
  • Handling stuck waiting glideins
  • Rundiffs
  • Unmatched jobs

Additional References

  • logs
  • monitoring tools
  • proxies
  • ssh logins
  • git commands
  • frontend security info
  • BDII
  • Log Retention rules
  • Security
  • Condor G
  • Useful scripts
  • Misc

Future work

  • rpms
 

Authors

-- TerrenceMartin

Revision 342012/08/14 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 65 to 65
  Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
Changed:
<
<
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-07-26
>
>
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-08-14
 

Description of clone_glidein Arguments

  • -merge yes/no/only
Line: 76 to 76
 
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Cloning GOC Factory

Changed:
<
<
  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
  2. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -exclude name CMS_T2_UK_London_Brunel_dc2_68 -exclude name CMS_T2_UK_London_Brunel_dc2_70 -exclude name CMS_T2_UK_London_Brunel_dgc_43 -merge only glideinWMS.xml
  3. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name ATLAS_ -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False".
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning GOC-ITB Factory

Line: 90 to 89
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Changed:
<
<
  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
  2. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False"
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  3. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem glideinWMS.xml.test
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also preserve comments.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  2. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem -exclude name CMS_T3_US_Omaha_tusker_long_3d glideinWMS.xml.test
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Running the site status report

Line: 346 to 344
 
Date Software Type Description
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites
Added:
>
>
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
 

GOC-ITB

Date Software Type Description
Deleted:
<
<
2012-07-31 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
 
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites
Added:
>
>
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
2012-08-14 gwms cherry-pick 3cf7e1f8151d4b6c216b7f3c81578fc867ce8371 smarter hold release code to reduce load

GOC

Date Software Type Description
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible

CERN

Date Software Type Description
2012-08-14 gwms running glideinWMS branch_v2_6_gf1 patch to fix analyze_entries reports and make Firefox compatible
 

GOC Factory Things to Remember

Revision 332012/08/14 - Main.TimMortensen

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 81 to 81
 
  1. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name ATLAS_ -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
  2. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
Added:
>
>

Cloning GOC-ITB Factory

  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd -merge no -out glideinWMS.xml.test2 -exclude enabled False -merge no glideinWMS.xml.test
  3. Add CMSOverflow to GLIDEIN_Supported_VO in Engage_US_MWT2_osg and HCC_US_BNL_gk02
  4. Change GLIDEIN_Supported_VO on OSG_CrossOSG_ce to CMS
  5. Change OSGVO and OSGVOHTPC to OSGVO_ITB and OSGVOHTPC_ITB
  6. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
 

Cloning CERN Factory

  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
  2. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False"
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml

Revision 322012/07/31 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 337 to 337
 

UCSD

Date Software Type Description
Deleted:
<
<
2012-07-02 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
 
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites

GOC-ITB

Date Software Type Description
Changed:
<
<
2012-06-28 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
>
>
2012-07-31 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
 
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites

GOC Factory Things to Remember

Revision 312012/07/30 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 337 to 337
 

UCSD

Date Software Type Description
Deleted:
<
<
2012-06-05 gwms modded ~/glideinWMS/creation/web_base/glexec_setup.sh fix glexec detection at EGI sites
2012-06-06 gwms modded ~/glideinWMS/factory/tools/entry_q add GLIDEIN_FACTORY_DIR env var to entry_q
 
2012-07-02 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
Deleted:
<
<
2012-07-18 gwms cherry-pick 79f7f99da43fd3277a064ccb71c5541238573253 patch to fix ARC 1.1.x sites
 
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites

GOC-ITB

Date Software Type Description
2012-06-28 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
Deleted:
<
<
2012-07-18 gwms cherry-pick 79f7f99da43fd3277a064ccb71c5541238573253 patch to fix ARC 1.1.x sites
 
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites

GOC Factory Things to Remember

Revision 302012/07/26 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 65 to 65
  Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
Changed:
<
<
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-05-18
>
>
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-07-26
 

Description of clone_glidein Arguments

  • -merge yes/no/only
Line: 77 to 77
 

Cloning GOC Factory

  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
Changed:
<
<
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False"
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name ATLAS_ -exclude name GLOW_US_Vanderbilt_ce -exclude name Engage_US_Florida_iogw1 -exclude name OSG_CrossOSG_ce -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
>
>
  1. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False". Also ignore special EGI glexec for now until glideinWMS 2.6.
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -exclude name CMS_T2_UK_London_Brunel_dc2_68 -exclude name CMS_T2_UK_London_Brunel_dc2_70 -exclude name CMS_T2_UK_London_Brunel_dgc_43 -merge only glideinWMS.xml
  2. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name ATLAS_ -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
 
  1. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

Revision 292012/07/18 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 332 to 332
  After changing log out then back in as gfactory and stop /restart the factory.
Added:
>
>

Factory Patches

UCSD

Date Software Type Description
2012-06-05 gwms modded ~/glideinWMS/creation/web_base/glexec_setup.sh fix glexec detection at EGI sites
2012-06-06 gwms modded ~/glideinWMS/factory/tools/entry_q add GLIDEIN_FACTORY_DIR env var to entry_q
2012-07-02 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
2012-07-18 gwms cherry-pick 79f7f99da43fd3277a064ccb71c5541238573253 patch to fix ARC 1.1.x sites
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites

GOC-ITB

Date Software Type Description
2012-06-28 gwms cherry-pick df01a1bd4a61e0cd640ce182d2db184b91757a5e, aaa12524b3e9efa239836c7e4ab4971957cf35cb add the per-frontend aggregation of log RRDs
2012-07-18 gwms cherry-pick 79f7f99da43fd3277a064ccb71c5541238573253 patch to fix ARC 1.1.x sites
2012-07-18 condor modded /opt/glidecondor/sbin/condor_gridmanager, /opt/glidecondor/sbin/nordugrid_gahp patch to fix ARC 1.1.x sites
 

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu

Revision 282012/07/18 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 348 to 348
 Defaults timestamp_timeout = 0
Added:
>
>

Firewall settings

For condor we give a port range of 20k-50k. See the /etc/iptables.d files for details. Also the condor config must know about it:

###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000
 

Things to Remember from Glidein Factory Training week 11/11

CREAM issues with lease renewal

Revision 272012/07/12 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 156 to 156
 

Additional notes

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
Changed:
<
<
This can happen if it is a condor site and the admin removes held glideins from their side. Also related to a globus limits issue Brian discovered: https://savannah.cern.ch/support/?128469
>
>
 
  • Globus error 9: the system cancelled the job
Changed:
<
<
Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)
>
>
    • Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)
 

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
Changed:
<
<
Can happen with udp packet loss at a site (firewall filtering). UPDATE_COLLECTOR_WITH_TCP should fix it.
>
>
    • Can happen with udp packet loss at a site (firewall filtering). UPDATE_COLLECTOR_WITH_TCP should fix it.
 
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
Changed:
<
<
Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up
>
>
    • Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up
 
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
Changed:
<
<
Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443
>
>
    • Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443
 
  • CREAM_Delegate Error: Authorization error: System error reading local user information
Changed:
<
<
We see this on old CREAM installs that don't like / symbols in DNs
>
>
    • We see this on old CREAM installs that don't like / symbols in DNs
 
  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec
Changed:
<
<
This can happen if a site isn't configured to accept /Role=pilot from our proxy: https://savannah.cern.ch/support/?122104
>
>
 
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
Changed:
<
<
This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.
>
>
    • This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.
 

Reasons we don't understand

Deleted:
<
<
  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed.
    • CREAM error: reason=999
      (not really sure what this means)
 
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
Added:
>
>
    • This can be seen when glideins are submitted to a site in downtime.

  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
    • CREAM error: reason=999
      • (not really sure what this means)
 
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      

Revision 262012/07/10 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 175 to 175
  Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up
Deleted:
<
<
  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user prdcms35@cream01.lcg.cscs.ch, queue cms-) N/A (jobId = CREAM960940922)

Seen when site is in downtime.

 
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]

Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443

Line: 196 to 192
 This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.

Reasons we don't understand

Added:
>
>
  • The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed.
    • CREAM error: reason=999
      (not really sure what this means)
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
    • CREAM error: CREAM_Job_Purge Error: job does not exist
    • CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
      
    • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
      
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
 
  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
Added:
>
>
    • CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired
 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)

Revision 252012/07/09 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_3
-->
Line: 181 to 181
 
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
Changed:
<
<
Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361
>
>
Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361. To check if it could be firewall try telnet: telnet cream01.iihe.ac.be 8443
 
  • CREAM_Delegate Error: Authorization error: System error reading local user information
Line: 191 to 191
  This can happen if a site isn't configured to accept /Role=pilot from our proxy: https://savannah.cern.ch/support/?122104
Deleted:
<
<

Reasons we don't understand

 
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
Changed:
<
<
Seen at CMS_T2_DE_DESY_cr, the following error is in gridmanager logs:
>
>
This happens when there is a really old running glidein on the queue (likely lost in rundiff) with an expired lease. It prevents all later glideins with same user from obtaining a new lease. Just remove it, the held jobs should recover.
 
Changed:
<
<
10/28/11 13:23:35 [24918] (355563.1) gmState GM_SUBMIT, remoteState : cream_job_register() failed
>
>

Reasons we don't understand

  • If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
    • CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout  530-globus_callout_module: The callout returned an error  530-an unknown error occurred  530 End.
    • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed.  500-an end-of-file was reached  500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.  500-globus_xio: An end of file occurred  500 End.
 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)

Revision 242012/05/18 - Main.JeffreyDost

Line: 1 to 1
 

Glidein Factory FAQ

Line: 61 to 61
  </>
<--/twistyPlugin-->
Added:
>
>

Cloning Factories

Below are examples of doing a global clone from UCSD to GOC and CERN factories. They assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.

DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2012-05-18

Description of clone_glidein Arguments

  • -merge yes/no/only
    • yes - modify existing entries in addition to adding new ones
    • no - only add new entries
    • only - only merge existing; don't add new entries
  • -preserve_enable - when merging don't disable sites that are still enabled in original config
  • -disable_old - if site is in original config but no longer in in "other" config, disable it

Cloning GOC Factory

  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
  2. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False"
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only glideinWMS.xml
  3. Run a second time with merge disabled but exclude a few experimental entries
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -merge no -out glideinWMS.xml.test2 -exclude enabled False -exclude name ATLAS_ -exclude name GLOW_US_Vanderbilt_ce -exclude name Engage_US_Florida_iogw1 -exclude name OSG_CrossOSG_ce -exclude name SBGRID_US_HMS_Orchestra_dev -merge no glideinWMS.xml.test
  4. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig

Cloning CERN Factory

  1. Manually remove country codes
    grep -v Country glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
  2. Do a "merge only" first to properly disable newly decommissioned sites without bringing in old ones with enabled="False"
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge only -preserve_comments glideinWMS.xml
  3. Run a second time with merge disabled but only include what we want
    ~/glideinWMS/creation/clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test2 -merge no -include name CMS_T1 -include name CMS_T2 -include name CMS_T3 -exclude enabled False -exclude name CMS_T1_US_FNAL_ce3 -exclude name CMS_T3_US_Omaha_tusker_bigmem glideinWMS.xml.test
  4. When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test2 and reconfig
 

Running the site status report

 gfactory@glidein-1 ~$ glideinWMS/factory/tools/analyze_entries -o ~/logae/
Line: 127 to 153
 
Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

</>

<--/twistyPlugin-->
Changed:
<
<

Additional notes:

>
>

Additional notes

 
  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..
Changed:
<
<
This can happen if it is a condor site and the admin removes held glideins from their side.
>
>
This can happen if it is a condor site and the admin removes held glideins from their side. Also related to a globus limits issue Brian discovered: https://savannah.cern.ch/support/?128469
 
  • Globus error 9: the system cancelled the job
Line: 286 to 312
  Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.
Added:
>
>

Dealing with Scalability limits

Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:

[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n    glideinDescript,entries,restart_attempts,restart_interval)\n', '  File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n    childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', '  File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n    c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']

To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:

ulimit -n 50240

in /etc/security/limits.conf:

gfactory        hard    nofile          50240

After changing log out then back in as gfactory and stop /restart the factory.

 

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu

Revision 232012/04/24 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 133 to 133
  This can happen if it is a condor site and the admin removes held glideins from their side.
Added:
>
>
  • Globus error 9: the system cancelled the job

Happens when sites preempt glideins for exceeding memory limits or preempts opportunistic glideins. (seen at Michigan and BNL)

 

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

Revision 222012/04/23 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 127 to 127
 
Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

</>

<--/twistyPlugin-->
Added:
>
>

Additional notes:

  • Globus error 79: connecting to the job manager failed.  Possible reasons: job terminated, invalid job contact, network problems, ..

This can happen if it is a condor site and the admin removes held glideins from their side.

 

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

Revision 212012/04/12 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 131 to 131
 

Reasons we mostly understand

Added:
>
>
  • CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out

Can happen with udp packet loss at a site (firewall filtering). UPDATE_COLLECTOR_WITH_TCP should fix it.

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/prdcms/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_prdcms35/99/CREAM999015216/OSB/job.479371.9.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.

Happens when CREAM site has no memory of job (possibly removed on remote side) but gridmanager refuses to give up

  • CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user prdcms35@cream01.lcg.cscs.ch, queue cms-) N/A (jobId = CREAM960940922)

Seen when site is in downtime.

 
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]

Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361

Line: 145 to 157
 

Reasons we don't understand

Changed:
<
<
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/cms/_DC_org_DC_doegrids_OU_Services_CN_uscmspilot09_glidein_1_t2_ucsd_edu_cms_Role_pilot_Capability_NULL_cms242/67/CREAM672719344/OSB/job.351295.0.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
>
>
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
 
Changed:
<
<
Seen at CMS_T2_CH_CSCS_cream02. Also the following error is in gridmanager logs:
>
>
Seen at CMS_T2_DE_DESY_cr, the following error is in gridmanager logs:
 
10/28/11 13:23:35 [24918] (355563.1) gmState GM_SUBMIT, remoteState : cream_job_register() failed
Deleted:
<
<
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]

Seen at CMS_T2_DE_DESY_cr, same gridmanager log error can be seen as pasted above.

 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)

Revision 202012/02/09 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 222 to 222
 https://savannah.cern.ch/support/?group=cmscompinfrasup&func=additem
Added:
>
>
For Non-CMS European sites us GOC:

https://ticket.grid.iu.edu/goc/other

Select the proper VO and make note that it is for an EGI or WLCG resource and GOC will forward to GGUS.

 

Verification of Proxies

An example of how to verify the proxies used by the frontend. Log into the gfactory first.

Revision 192011/11/10 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 278 to 278
 Defaults timestamp_timeout = 0
Added:
>
>

Things to Remember from Glidein Factory Training week 11/11

CREAM issues with lease renewal

condor-g sets expired leases for some of our held glideins. This can be checked in the job classad JobLeaseExpiration.
 

Authors

-- TerrenceMartin

Revision 182011/10/31 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 131 to 131
 

Reasons we mostly understand

Changed:
<
<
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    
>
>
  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
 
Changed:
<
<
Site likely down.
>
>
Site likely down. But sometimes it is because we are being blocked by a firewall: https://savannah.cern.ch/support/?120361
 
Changed:
<
<
  • CREAM_Delegate Error: Authorization error: System error reading local user information
    
>
>
  • CREAM_Delegate Error: Authorization error: System error reading local user information
  We see this on old CREAM installs that don't like / symbols in DNs
Added:
>
>
  • CREAM_Delegate Error: Authorization error: Failed to get the local user id via glexec

This can happen if a site isn't configured to accept /Role=pilot from our proxy: https://savannah.cern.ch/support/?122104

 

Reasons we don't understand

Changed:
<
<
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/cms/_DC_org_DC_doegrids_OU_Services_CN_uscmspilot09_glidein_1_t2_ucsd_edu_cms_Role_pilot_Capability_NULL_cms242/67/CREAM672719344/OSB/job.351295.0.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    
>
>
  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/cms/_DC_org_DC_doegrids_OU_Services_CN_uscmspilot09_glidein_1_t2_ucsd_edu_cms_Role_pilot_Capability_NULL_cms242/67/CREAM672719344/OSB/job.351295.0.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
  Seen at CMS_T2_CH_CSCS_cream02. Also the following error is in gridmanager logs:
Changed:
<
<
10/28/11 13:23:35 [24918] (355563.1) gmState GM_SUBMIT, remoteState : cream_job_register() failed
>
>
10/28/11 13:23:35 [24918] (355563.1) gmState GM_SUBMIT, remoteState : cream_job_register() failed
 
Changed:
<
<
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!] 
    
>
>
  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!]
  Seen at CMS_T2_DE_DESY_cr, same gridmanager log error can be seen as pasted above.

Revision 172011/10/28 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 127 to 127
 
Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)

</>

<--/twistyPlugin-->
Added:
>
>

CREAM Hold Reasons (Work in progress)

Reasons we mostly understand

  • CREAM_Delegate Error: Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
    

Site likely down.

  • CREAM_Delegate Error: Authorization error: System error reading local user information
    

We see this on old CREAM installs that don't like / symbols in DNs

Reasons we don't understand

  • CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.  500-globus_xio: Unable to open file /cream_localsandbox/data/cms/_DC_org_DC_doegrids_OU_Services_CN_uscmspilot09_glidein_1_t2_ucsd_edu_cms_Role_pilot_Capability_NULL_cms242/67/CREAM672719344/OSB/job.351295.0.out  500-globus_xio: System error in open: No such file or directory  500-globus_xio: A system call failed: No such file or directory  500 End.
    

Seen at CMS_T2_CH_CSCS_cream02. Also the following error is in gridmanager logs:

10/28/11 13:23:35 [24918] (355563.1) gmState GM_SUBMIT, remoteState : cream_job_register() failed

  • CREAM error: CREAM_Job_Register Error: MethodName=[jobRegister] Timestamp=[Tue 18 Oct 2011 06:39:04] ErrorCode=[0] Description=[delegation error: delegation id "1318117200.654933" not found!] FaultCause=[delegProxyInfo "1318117200.654933" not found!] 
    

Seen at CMS_T2_DE_DESY_cr, same gridmanager log error can be seen as pasted above.

 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)

Revision 162011/10/24 - Main.JeffreyDost

Line: 1 to 1
 
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
Line: 23 to 23
 hideimgleft="/twiki2/pub/TWiki/TWikiDocGraphics/toggleclose-small.gif" }%
Changed:
<
<
>
>
 

Revision 152011/10/19 - Main.JeffreyDost

Line: 1 to 1
Added:
>
>
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
 

Glidein Factory FAQ

Contents

Line: 7 to 11
 

Adding a New Site to Glidein Factory

Added:
>
>

Entry Templates

CMS cream:

<--/twistyPlugin twikiMakeVisibleInline-->
      <entry name="" comment="" enabled="True" gatekeeper="https://%HOSTNAME%:8443/ce-cream/services/CREAM2 %BATCH% %QUEUE%>" gridtype="cream" verbosity="std" work_dir="TMPDIR">
         <config>
            <max_jobs held="25" idle="400" running="10000">
               <max_job_frontends>
               </max_job_frontends>
            </max_jobs>
            <release max_per_cycle="20" sleep="0.2"/>
            <remove max_per_cycle="5" sleep="0.2"/>
            <restrictions require_voms_proxy="False"/>
            <submit cluster_size="10" max_per_cycle="100" sleep="0.2"/>
         </config>
         <downtimes/>
         <allow_frontends>
         </allow_frontends>
         <attrs>
            <attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
            <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
            <attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="114840"/>
            <attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_SEs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
            <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
            <attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
         </attrs>
         <files>
         </files>
         <infosys_refs>
            <infosys_ref ref="GlueCEUniqueID=" server="exp-bdii.cern.ch" type="BDII"/>
         </infosys_refs>
         <monitorgroups>
            <monitorgroup group_name="CMST2"/>
            <monitorgroup group_name="CMS"/>
         </monitorgroups>
      </entry>
<--/twistyPlugin-->
 

Running the site status report

 gfactory@glidein-1 ~$ glideinWMS/factory/tools/analyze_entries -o ~/logae/
Line: 63 to 117
 

Globus Hold Reasons

Added:
>
>
<--/twistyPlugin twikiMakeVisibleInline-->
 
Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)
Added:
>
>
<--/twistyPlugin-->
 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
Line: 194 to 256
  -- IgorSfiligoi
Changed:
<
<
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->
>
>

Revision 142011/09/29 - Main.JeffreyDost

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 61 to 61
  However, it is unlikely you need to look at that.
Added:
>
>

Globus Hold Reasons

Globus Error Code Held Reason Job is Recoverable
10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
121 the job state file doesn't exist No
126 it is unknown if the job was submitted Yes
12 the connection to the server failed (check host and port) Yes
131 the user proxy expired (job is still running) Maybe
17 the job failed when the job manager attempted to run it No
22 the job manager failed to create an internal script argument file No
31 the job manager failed to cancel the job as requested No
3 an I/O operation failed Yes
47 the gatekeeper failed to run the job manager No
48 the provided RSL could not be properly parsed No
4 jobmanager unable to set default to the directory requested No
76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
7 an authorization operation failed Yes
7 authentication with the remote server failed Yes
8 the user cancelled the job No
94 the jobmanager does not accept any new requests (shutting down) Yes
9 the system cancelled the job No
? Job failed, no reason given by GRAM server No
122 could not read the job state file Maybe (short term: no)
132 the job was not submitted by original jobmanager No (likely to be fatal)
 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)

Revision 132011/09/21 - Main.JeffreyDost

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 17 to 17
 

Monitoring webpages

Glidein Factory Status

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatus.html
>
>
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_v2_0/factoryStatus.html
  Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."
Line: 40 to 40
 A script has been written to analyze the Glidein Factory Status data. To learn more about it, click here.

Glidein Factory Status Now

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
>
>
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_v2_0/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
 

GFactory log directories

The factory daemons log files can be found in

Changed:
<
<
~/glideinsubmit/glidein_Production_v3_1/log/entry_<entry>/factory.<date>.*.log 
>
>
~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
  The list of all completed jobs can be found in
Changed:
<
<
 ~/glideinsubmit/glidein_Production_v3_1/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_<date>.log
>
>
 ~/glideinsubmit/glidein_v2_0/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_.log
  The glidein exit logs can be found in
Changed:
<
<
~/glideinsubmit/glidein_Production_v3_1/client_log/user_<frontend>/entry_<entry>/job.*.out|err 
>
>
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/job.*.out|err 
  Condor schedd job log can befound in
Changed:
<
<
~/glideinsubmit/glidein_Production_v3_1/client_log/user_<frontend>/entry_<entry>/condor_activity_<date>_*.log 
>
>
~/glideinsubmit/glidein_v2_0/client_log/user_/entry_/condor_activity__*.log 
  Finally, the factory as a whole has its logs in
Changed:
<
<
 ~/glideinsubmit/glidein_Production_v3_1/log/entry_<entry>/factory.<date>.*.log 
>
>
 ~/glideinsubmit/glidein_v2_0/log/entry_/factory..*.log 
  However, it is unlikely you need to look at that.

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)
Changed:
<
<
  1. Start gfactory (~/glideinsubmit/glidein_Production_v3_1/factory_startup start as gfactory)
>
>
  1. Start gfactory (~/glideinsubmit/glidein_v2_0/factory_startup start as gfactory)
 

Adding a new VO

Line: 73 to 73
 
  1. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
  2. (as root) Add the VO pilot DNs to
    /etc/grid-security/grid-mapfile
    Note: Only needed if using CREAM.
  3. (as root) Add VO to the Condor config (note: the UNIX user naem and the condro name may be different, but can be the same, as in this example)
    ~/glideinWMS/install# ./glidecondor_addDN -daemon "VO1 Frontend DN" "<VO1 DN>" fevo1
    Reconfig condor
    /opt/glidecondor/sbin/condor_reconfig -collector
Changed:
<
<
  1. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_Production_v3_1.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_Production_v3_1$ ./factory_startup reconfig ../glidein_Production_v3_1.cfg/glideinWMS.xml
>
>
  1. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_v2_0.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_v2_0$ ./factory_startup reconfig ../glidein_v2_0.cfg/glideinWMS.xml
  For resource selection purposes:
  1. Identify the entries they can use (no obvious way just yet)
Line: 84 to 84
 The glidein factory is mostly stateless... if we were to loose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

Changed:
<
<
/var/gfactory/glideinsubmit/glidein_Production_v3_1/
>
>
/var/gfactory/glideinsubmit/glidein_v2_0/
  Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:
/var/gfactory/glideinsubmit/
Line: 146 to 146
  Get a list of the proxies for a VO and CE
Changed:
<
<
./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_Production_v3_1/entry_CMS_T2_US_UCSD_gw2/'
>
>
./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/'
  Display a particular proxies information
Changed:
<
<
./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_Production_v3_1/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@Production_v3_1@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy'
>
>
./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy'
  For additional tool help run
Line: 189 to 189
 -- TerrenceMartin

-- IgorSfiligoi

Added:
>
>
<-- TWIKI VARIABLES 
  • Set UCSD_VERS = Production_v4_1
-->

Revision 122011/09/07 - Main.JeffreyDost

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 168 to 168
  Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.
Added:
>
>

GOC Factory Things to Remember

Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu

Turn off timeout for sudo

run:

/usr/sbin/visudo

add the following:

Defaults    timestamp_timeout = 0
 

Authors

-- TerrenceMartin

Revision 112010/12/13 - Main.IanMacNeill

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 162 to 162
  ./proxy_info -h
Added:
>
>

How To Open A Ticket To Contact Glidein Factory Support

Although we encourage users to contact us directly at osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.

Until GOC institutes a custom form for the Glidein Factory, begin by visiting https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.

 

Authors

-- TerrenceMartin

Revision 102010/08/17 - Main.TerrenceMartin

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 128 to 128
 https://savannah.cern.ch/support/?group=cmscompinfrasup&func=additem
Added:
>
>

Verification of Proxies

An example of how to verify the proxies used by the frontend. Log into the gfactory first.

Setup the environment for voms-proxy tools.

source /opt/vdt/setup.sh

Access the factory tools

cd ~/glideinWMS/factory/tools

Get a list of the proxies for a VO and CE

./proxy_info fecms ls -l '/var/gfactory/clientproxies/user_fecms/glidein_Production_v3_1/entry_CMS_T2_US_UCSD_gw2/'

Display a particular proxies information

./proxy_info fecms info '/var/gfactory/clientproxies/user_fecms/glidein_Production_v3_1/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@Production_v3_1@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy'

For additional tool help run

 ./proxy_info -h
 

Authors

-- TerrenceMartin

Revision 92010/08/12 - Main.IgorSfiligoi

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 109 to 109
 Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

Added:
>
>

How to contact Grid sites

For OSG sites, use the GOC

https://ticket.grid.iu.edu/goc/open

For CMS sites, use Savannah:

https://savannah.cern.ch/

Search for:
CMS

Select: 
CMS Computing Infrastructure  Support

Use: 
Submit a new item
https://savannah.cern.ch/support/?group=cmscompinfrasup&func=additem
 

Authors

-- TerrenceMartin

Revision 82010/08/02 - Main.ChrisMurphy

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 37 to 37
  Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.
Added:
>
>
A script has been written to analyze the Glidein Factory Status data. To learn more about it, click here.
 

Glidein Factory Status Now

Changed:
<
<
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
>
>
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
 

GFactory log directories

Revision 72010/07/16 - Main.IanMacNeill

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 16 to 16
 This will be send to osg-gfactory-support@physics.ucsd.edu daily.

Monitoring webpages

Changed:
<
<
>
>

Glidein Factory Status

 http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

Line: 38 to 38
 Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Glidein Factory Status Now

Changed:
<
<
GlideinFactoryStatusNow
>
>
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See GlideinFactoryStatusNow for a longer and more detailed discussion.
 

GFactory log directories

Revision 62010/07/16 - Main.TerrenceMartin

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 36 to 36
 Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Added:
>
>

Glidein Factory Status Now

GlideinFactoryStatusNow

 

GFactory log directories

The factory daemons log files can be found in

Line: 88 to 93
 Nothing else in the factory should need to be backed up;
all the code should be in CVS or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

Deleted:
<
<

  The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up
/opt/glidecondor/etc/

Revision 52010/07/14 - Main.IgorSfiligoi

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 73 to 73
 
  1. Identify the entries they can use (no obvious way just yet)
  2. If they need sites we don't support yet, add an entry for them
    Use the
    VO_blah
    naming convention, so we know who first requested the entry.
  3. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.
Added:
>
>

Areas needing backup

The glidein factory is mostly stateless... if we were to loose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.

The main configuration file is glideinWMS.xml. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:

/var/gfactory/glideinsubmit/glidein_Production_v3_1/

Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:

/var/gfactory/glideinsubmit/

Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.

Nothing else in the factory should need to be backed up;
all the code should be in CVS or downloadable from an official repository.

If there are any experimental or in-development code pieces, those should use a separate backup policy.

The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up

/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/

Condor also needs the host certificate to function;

/etc/grid-security 

should thus be backed-up, too.

Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.

The same should apply to all other software components the factory is relying on.

 

Authors

Changed:
<
<
-- TerrenceMartin - 2010/06/09
>
>
-- TerrenceMartin

-- IgorSfiligoi

Revision 42010/06/17 - Main.IgorSfiligoi

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 36 to 36
 Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

Added:
>
>

GFactory log directories

The factory daemons log files can be found in

~/glideinsubmit/glidein_Production_v3_1/log/entry_<entry>/factory.<date>.*.log 

The list of all completed jobs can be found in

 ~/glideinsubmit/glidein_Production_v3_1/log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_<date>.log

The glidein exit logs can be found in

~/glideinsubmit/glidein_Production_v3_1/client_log/user_<frontend>/entry_<entry>/job.*.out|err 

Condor schedd job log can befound in

~/glideinsubmit/glidein_Production_v3_1/client_log/user_<frontend>/entry_<entry>/condor_activity_<date>_*.log 

Finally, the factory as a whole has its logs in

 ~/glideinsubmit/glidein_Production_v3_1/log/entry_<entry>/factory.<date>.*.log 

However, it is unlikely you need to look at that.

 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)

Revision 32010/06/15 - Main.IgorSfiligoi

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 42 to 42
 
  1. Start Condor (/etc/init.d as root)
  2. Start gfactory (~/glideinsubmit/glidein_Production_v3_1/factory_startup start as gfactory)
Added:
>
>

Adding a new VO

For security purposes:

  1. (as root) add the vo user (e.g. fevo1)
    useradd fevo1
  2. (as root) add the new user and group into
    /etc/condor/privsep_config
    sections
    valid-target-uids and valid-target-gids
  3. (as root) Add the VO pilot DNs to
    /etc/grid-security/grid-mapfile
    Note: Only needed if using CREAM.
  4. (as root) Add VO to the Condor config (note: the UNIX user naem and the condro name may be different, but can be the same, as in this example)
    ~/glideinWMS/install# ./glidecondor_addDN -daemon "VO1 Frontend DN" "<VO1 DN>" fevo1
    Reconfig condor
    /opt/glidecondor/sbin/condor_reconfig -collector
  5. (as gfactory) Add the VO the gfactory config
    ~/glideinsubmin/glidein_Production_v3_1.cfg/glideinWMS.xml
    Example change
    <frontend name="vo1-glidein" identity="fevo1@glidein-1.t2.ucsd.edu">
    <security_classes>
    <security_class name="frontend" username="fevo1"/>
    </security_classes>
    </frontend>

    Reconfig factory
    ~/glideinsubmit/glidein_Production_v3_1$ ./factory_startup reconfig ../glidein_Production_v3_1.cfg/glideinWMS.xml

For resource selection purposes:

  1. Identify the entries they can use (no obvious way just yet)
  2. If they need sites we don't support yet, add an entry for them
    Use the
    VO_blah
    naming convention, so we know who first requested the entry.
  3. For each entry, add theVO in the
    GLIDEIN_Supported_VOs
    attribute.
 

Authors

-- TerrenceMartin - 2010/06/09

Revision 22010/06/09 - Main.WarrenAndrews

Line: 1 to 1
 

Glidein Factory FAQ

Contents

Line: 9 to 9
 

Running the site status report

Added:
>
>
 gfactory@glidein-1 ~$ glideinWMS/factory/tools/analyze_entries -o ~/logae/

The "-o" is optional and specifies where the output should go, default is ~.

This will be send to osg-gfactory-support@physics.ucsd.edu daily.

Monitoring webpages

http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/factoryStatus.html

Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."

At the simplest level, there are four graphs to look at which are all displayed on top of one another on this page:

  • Running glidein jobs (green solid, on by default)
  • Glideins at collector (black line, not on by default)
  • Glideins claimed by user jobs (purple line, on by default)
  • Glideins not matched (yellow line, on by default)

What to look for with these:

Glideins claimed (purple) should not be much lower than the green envelope.

Glideins at collector (black) should also not be much lower than the green envelope.

Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).

Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.

 

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)
Changed:
<
<
  1. Start gfactory (~/glideinsubmit/glidein_Productio3_1 as gfactory)
>
>
  1. Start gfactory (~/glideinsubmit/glidein_Production_v3_1/factory_startup start as gfactory)
 

Authors

Revision 12010/06/09 - Main.TerrenceMartin

Line: 1 to 1
Added:
>
>

Glidein Factory FAQ

Contents

Adding a New Site to Glidein Factory

Running the site status report

Restarting the Glidein Factory after Reboot

  1. Start httpd (/etc/init.d as root)
  2. Start Condor (/etc/init.d as root)
  3. Start gfactory (~/glideinsubmit/glidein_Productio3_1 as gfactory)

Authors

-- TerrenceMartin - 2010/06/09

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback