Glidein Factory FAQ
Contents
Variables Used in this Document
Here is a list of variables used in this document as shorthand for common paths used in factory operations:
Variable |
Value |
Description |
GLIDEIN_FACTORY_DIR |
/home/gfactory/glideinsubmit/glidein_v2_0 |
current factory instance directory |
GLIDEIN_SRC_DIR |
/home/gfactory/glideinwms |
glideinWMS source code directory |
GLIDEIN_FACTOOLS |
/home/gfactory/factools |
factools repo location |
Assumed gfactory Path Setup
This document assumes the
gfactory user has the following set in the $PATH:
- $GLIDEIN_SRC_DIR/factory/tools/
- $GLIDEIN_FACTOOLS/generic/bin/
Basic Procedures
Reconfiguring Factory
- Edit
/etc/gwms-factory/glideinWMS.xml
- Stop the factory, reconfigure, and then restart:
service gwms-factory stop
service gwms-factory reconfig
service gwms-factory start
Restarting Factory after Reboot
- As root start httpd:
/etc/init.d/httpd start
- As root start condor:
/etc/init.d/condor start
- Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
- As gfactory start the factory:
$GLIDEIN_FACTORY_DIR/factory_startup start
Site Debugging Procedures
Putting Entries in Temporary Downtime
./factory_startup down -entry entry_name -comment 'comment on why it is down'
You can optionally put an end time with the option
-end [[[YYYY-]MM-]DD-]HH:MM[:SS]
Examples:
-end 07:00
-end 05-19-07:00
-end 2014-05-19-07:00
Maintenance
Installing Factory from RPMs
Click here for instructions on how install a Factory from scratch using the OSG RPMs.
Installing factools
Repo
The
factools
repo can be found at:
https://github.com/jdost321/factools
In the
gfactory
user home directory, run:
git clone git://github.com/jdost321/factools.git
refer to
factools/README
on how to set up environment to enable
factools
usage.
Adding a New Site to Glidein Factory
Click here for instructions on how to add a site for VOs to use.
Shared Factory Config
Click here for information about the shared factory config
Entry Templates
NOTE this section is likely obsolete
CMS cream:
<entry name="" comment="" enabled="True" gatekeeper="https://%HOSTNAME%:8443/ce-cream/services/CREAM2 %BATCH% %QUEUE%" gridtype="cream" verbosity="std" work_dir="TMPDIR">
<config>
<max_jobs held="25" idle="400" running="10000">
<max_job_frontends>
</max_job_frontends>
</max_jobs>
<release max_per_cycle="20" sleep="0.2"/>
<remove max_per_cycle="5" sleep="0.2"/>
<restrictions require_voms_proxy="False"/>
<submit cluster_size="10" max_per_cycle="100" sleep="0.2"/>
</config>
<downtimes/>
<allow_frontends>
</allow_frontends>
<attrs>
<attr name="CONDOR_OS" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="default"/>
<attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
<attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
<attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="114840"/>
<attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
<attr name="GLIDEIN_SEs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
<attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value=""/>
<attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS"/>
<attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/>
</attrs>
<files>
</files>
<infosys_refs>
<infosys_ref ref="GlueCEUniqueID=" server="exp-bdii.cern.ch" type="BDII"/>
</infosys_refs>
<monitorgroups>
<monitorgroup group_name="CMST2"/>
<monitorgroup group_name="CMS"/>
</monitorgroups>
</entry>
Cloning Factories
Below are examples of doing a global clone from UCSD to GOC and CERN factories. You can record your clones in the
Factory Cloning Log.
DISCLAIMER The examples are subject to change due to the constantly evolving nature of our config files. They are current as of 2015-01-13.
Description of clone_glidein Arguments
- -merge yes/no/only
- yes - modify existing entries in addition to adding new ones
- no - only add new entries (default)
- only - only merge existing; don't add new entries
- -preserve_enable - when merging don't disable sites that are still enabled in original config
- -disable_old - if site is in original config but no longer in in "other" config, disable it
Temporary Fix for v3_2_5 -> v3_2_3
NOTE until all factories are v3_2_5, you will see errors like:
Unexpected error occurred loading the configuration file.
Unknown parameter glidein.entries.ATLAS_US_Michigan_gate01.config.submit.submit_attrs
To avoid this, before running the clone tool, remove the new v3_2_5 attributes:
grep -v submit_attrs glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
Then proceed to clone normally using
glideinWMS.xml.ucsd2
instead.
Cloning UCSD -> GOC
These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
-
clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes glideinWMS.xml
- When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig
Cloning GOC -> UCSD is done in the exact same way so it is not shown here.
Cloning UCSD -> CERN
These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
- Use include and exclude constraints to only add regular CMS sites
clone_glidein -other glideinWMS.xml.ucsd -out glideinWMS.xml.test -merge yes -include GLIDEIN_Supported_VOs CMS -exclude GLIDEIN_Supported_VOs CMSOverflow glideinWMS.xml
- When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig
Cloning GOC -> CERN is done in the exact same way so it is not shown here.
Cloning CERN -> UCSD
These instructions assume you have copied the CERN config to the respective factory and named it glideinWMS.xml.cern.
- Exclude the cloud resources:
clone_glidein -other glideinWMS.xml.cern -out glideinWMS.xml.test -exclude name CMS_T1_TW_ASGC_AI -exclude name CMS_T2_CH_CERN_AI -exclude name CMS_T2_CH_CERN_HLT -merge yes glideinWMS.xml
- When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig
Cloning CERN -> GOC is done in the exact same way so it is not shown here.
Cloning UCSD -> GOC-ITB
NOTE while we test glexec sites take care to not accidentally disable glexec on GOC-ITB entries. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.
These instructions assume you have copied the UCSD config to the respective factory and named it glideinWMS.xml.ucsd.
- Append _ITB to OSGVO and associated names:
sed -e 's/OSGVO\([,"]\)/OSGVO_ITB\1/g' -e 's/OSGVOHTPC/OSGVOHTPC_ITB/g' -e 's/OSGVOBigMem/OSGVOBigMem_ITB/g' -e 's/OSGVO_MULTICORE/OSGVO_MULTICORE_ITB/g' glideinWMS.xml.ucsd > glideinWMS.xml.ucsd2
-
clone_glidein -other glideinWMS.xml.ucsd2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
- When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig
Cloning GOC-ITB -> UCSD
NOTE while we test glexec sites take care to not accidentally enable glexec on UCSD entries that haven't been tested. The following instructions don't account for this. Please diff resulting xml and make any additional corrections neccessary by hand.
These instructions assume you have copied the GOC-ITB config to the respective factory and named it glideinWMS.xml.itb.
- Remove _ITB from OSGVO_ITB and associated names:
sed -e 's/OSGVO_ITB/OSGVO/g' -e 's/OSGVOHTPC_ITB/OSGVOHTPC/g' -e 's/OSGVOBigMem_ITB/OSGVOBigMem/g' -e 's/OSGVO_MULTICORE_ITB/OSGVO_MULTICORE/g' glideinWMS.xml.itb > glideinWMS.xml.itb2
-
clone_glidein -other glideinWMS.xml.itb2 -out glideinWMS.xml.test -merge yes glideinWMS.xml
- When satisfied, replace current glideinWMS.xml with glideinWMS.xml.test and reconfig
Installing Factory Condor from scratch
Please see instructions under
Installing Factory Condor from Scratch
Upgrading Factory Condor
Factories with an RPM install:
The following commands need to be run as the
root user:
- Stop glideinWMS:
service gwms-factory stop
- Stop Condor:
service condor stop
- Update condor using yum. Note: for the ITB factory, you'll likely want to use the osg-development repo instead of osg.
yum update --enablerepo epel --enablerepo osg condor condor-classads condor-cream-gahp condor-procd
- Start Condor:
service condor start
- Wait for the appropriate amount of time, then start glideinWMS:
service gwms-factory start
Factories using a non-RPM install:
Go to the condor website and download the tarball as
root user:
http://research.cs.wisc.edu/htcondor/downloads/
As of 2014-01-09:
For the UCSD factory we currently use
condor-rel-x86_64_RedHat6-stripped.tar.gz
For CERN we use
condor-rel-x86_64_RedHat5-stripped.tar.gz
For the other factories we use
condor-rel-x86_RedHat5-stripped.tar.gz
cd /root/Downloads wget http://parrot.cs.wisc.edu//symlink/tmp_path_to_tarball/condor-rel-x86_RedHat5-stripped.tar.gz
As
gfactory stop Factory:
cd $GLIDEIN_FACTORY_DIR
./factory_startup stop
The next commands need to be run as the
root user.
- Stop Condor:
/etc/init.d/condor stop
- Run upgrade script:
/root/glideinwms/install/glidecondor_upgrade condor-rel-x86_RedHat5-stripped.tar.gz
- Start Condor with init.d script:
/etc/init.d/condor start
- Run top and watch the load. Only proceed after the load average drops considerably and %id is reasonably > 0%
As
gfactory start the factory:
./factory_startup start
Upgrading Glidein Condor
Click here for instructions on how to upgrade the Condor tarballs for glideins to use.
Upgrading GlideinWMS in Place
Factories with RPM installs:
Note: The following needs to be done as
root
- Shut down the factory
service gwms-factory stop
- Update the factory packages using yum. Note: The ITB factory may need the osg-development repo instead of osg
yum update --enablerepo epel --enablerepo osg glideinwms-factory glideinwms-factory-condor
- Follow the instructions to rebuild all of the glidein condor tarballs and update the factory config with the new ones as outlined in the Upgrading Glidein Condor section.
- Upgrade the factory
service gwms-factory upgrade
- Start the factory back up
service gwms-factory start
Factories with non-RPM installs:
This should be the standard method to upgrade GlideinWMS unless significant changes have been made to the code base. Otherwise, it is preferable to create a new factory instance with the following instructions:
Upgrading GlideinWMS with New Instance
Back Up Old GlideinWMS (Optional)
Check if there are any manually applied patches:
cd $GLIDEIN_SRC_DIR
git status
If there are and they are worth saving, it is easiest to just backup the whole git repo:
cd .. rsync -av glideinWMS/ glideinWMS-old
old should signify the glideinWMS version number you are backing up.
Upgrade Procedure
1 Shut down Factory:
cd $GLIDEIN_FACTORY_DIR
./factory_startup stop
- Check to make sure there are no running factory python processes:
ps -u gfactory
- Fetch latest code and checkout new where new is the desired tag or branch name:
cd $GLIDEIN_SRC_DIR git fetch git checkout new
- Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
- Run upgrade and supply full absolute path of config file:
cd $GLIDEIN_FACTORY_DIR
./factory_startup upgrade ${GLIDEIN_FACTORY_DIR}.cfg/glideinWMS.xml
- Restart Factory:
./factory_startup start
Upgrading GlideinWMS with New Instance
These instructions should only be followed if significant changes have been made to the code base. Otherwise, it is preferable to upgrade in place with the following instructions:
Upgrading GlideinWMS in Place
Back Up Old GlideinWMS (Optional)
Check if there are any manually applied patches:
cd $GLIDEIN_SRC_DIR
git status
If there are and they are worth saving, it is easiest to just backup the whole git repo:
cd .. rsync -av glideinWMS/ glideinWMS-old
old should signify the glideinWMS version number you are backing up.
Upgrade Procedure
- Create new instance directory and copy over config file:
cd $GLIDEIN_FACTORY_DIR/.. mkdir glidein_new.cfg cp glidein_old.cfg/glideinWMS.xml glidein_new.cfg/
- Copy over any validation scripts or wrappers used in previous instance:
cp glidein_old.cfg/*.sh glidein_new.cfg/ cp glidein_old.cfg/*.source glidein_new.cfg/
- Create new Prestage dir for tarballs:
mkdir glidein_new.cfg/Prestage
- Edit
glidein_new.cfg/glideinWMS.xml
to use new name: glidein_name="new"
- Also replace any references of
glidein_old.cfg
with glidein_new.cfg
in the new config file. You might also like to remove all disabled ( enabled="False"
) entries. You can do this simply by doing: sed '/enabled="False"/,/<\/entry>/d' -i glidein_new.cfg/glideinWMS.xml
1 Shut down Factory:
cd $GLIDEIN_FACTORY_DIR
./factory_startup stop
- Check to make sure there are no running factory python processes:
ps -u gfactory
- Fetch latest code and checkout new where new is the desired tag or branch name:
cd $GLIDEIN_SRC_DIR git fetch git checkout new
- Now you must build all glidein condor tarballs using create_condor_tarball in the new Prestage directory, and update the new config file accordingly at this point, as outlined in the Upgrading Glidein Condor section.
- Create new factory instance:
cd $GLIDEIN_FACTORY_DIR/.. $GLIDEIN_SRC_DIR/creation/create_glidein glidein_new.cfg/glideinWMS.xml
- Copy the old downtimes file to newly created instance dir:
cp glidein_old/glideinWMS.downtimes glidein_new/
- Change into the newly created
glidein_new
directory and start up the new instance.
Post Upgrade Actions
Edit
gfactory
user
.bash_profile
:
export GLIDEIN_FACTORY_DIR=/home/gfactory/glideinsubmit/glidein_new
At UCSD and GOC as
root
, update the osg_gfactory monitoring symlink:
cd /var/www/html rm osg_gfactory ln -s glidefactory/monitor/glidein_new osg_gfactory
Upgrading GlideinWMS v2_7 Gotchas
GlideinWMS v2_7 has significant changes so it is best to follow the above:
Upgrading GlideinWMS with New Instance
The source code directory must be renamed to all lowercase
glideinwms
. A good time to do this is in step 8 of the above instructions before the git fetch.
The
gfactory
user's
.bash_profile
will have to be modded:
export GLIDEIN_SRC_DIR=/path/to/src/glideinwms
An additional mod needs to be added to the
.bash_profile
to get around the
analyze_entries
bug:
export GLIDEIN_MON_URL=$GLIDEIN_FACTORY_DIR
Finally, factools will have to be switched to the special compat branch:
cd $GLIDEIN_FACTOOLS
git checkout dev_2_7_compat
Areas needing backup
The glidein factory is mostly stateless... if we were to lose the disk used by it, we should be able to reconstruct the gfactory within hours by using a few config files.
The main configuration file is
glideinWMS.xml
. It defines almost everything else in a factory configuration.
To be on the safe side, one should however backup the whole factory directory tree... currently this is:
/var/gfactory/glideinsubmit/glidein_v2_0/
Since there may be several factories installed on the same node, backing up the base directory is the easiest solution to not forget any of them:
/var/gfactory/glideinsubmit/
Please notice that the directories above contain symlinks to other areas in the file system;
none of those need to be backed-up, as they can be recreated if needed.
Moreover, while the base factory directory is relatively static and small (currently ~50M), the linked directories are very dynamic and can grow quite a bit.
Nothing else in the factory should need to be backed up;
all the code should be in Git or downloadable from an official repository.
If there are any experimental or in-development code pieces, those should use a separate backup policy.
The factory also heavily relies on Condor, so basic Condor config files should be backed-up as well.
Unfortunatelly, the config files are split between three directories, so all three must be backed up
/opt/glidecondor/etc/
/opt/glidecondor/certs/
/etc/condor/
Condor also needs the host certificate to function;
/etc/grid-security
should thus be backed-up, too.
Nothing else in Condor needs being backed up, as it can be easily recreated using the glideinWMS installaion script.
The same should apply to all other software components the factory is relying on.
Dealing with Scalability limits
Number of Process Limits
In RHEL 6 the default number of processes per user is conservatively low, set to 1024. This will likely affect factory performance at full scale. Add
/etc/security/limits.d/91-userlimits.conf
:
# we need many processes, for Condor
* soft nproc 128297
FD Limits
NOTE This may no longer be an issue, as glideinWMS has significantly reduced the number of needed FDs in the factory code
Factory scales with the number of entries in the config. Eventually gfactory user max open file limits will be hit. This can be seen in ~/glideinsubmit/glidein_Production_v2_0/log/factory/factory.*.info.log:
[2012-05-11T12:44:02-07:00 6730] WARNING: Exception occurred: ['Traceback (most recent call last):\n', ' File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 432, in main\n glideinDescript,entries,restart_attempts,restart_interval)\n', ' File "/home/gfactory/glideinWMS/factory/glideFactory.py", line 213, in spawn\n childs[entry_name]=popen2.Popen3("%s %s %s %s %s %s %s"%(sys.executable,os.path.join(STARTUP_DIR,"glideFactoryEntry.py"),os.getpid(),sleep_time,advertize_rate,startup_dir,entry_name),True)\n', ' File "/usr/lib64/python2.4/popen2.py", line 43, in __init__\n c2pread, c2pwrite = os.pipe()\n', 'OSError: [Errno 24] Too many open files\n']
To deal with this, increase ulimits. Right now we have it at 50k for gfactory user. In ~/.bash_profile:
ulimit -n 50240
in /etc/security/limits.conf:
gfactory hard nofile 50240
After changing log out then back in as gfactory and stop /restart the factory.
Factory Specific Notes
Factory Software and Patches
UCSD
GOC
Date |
Software |
Type |
Description |
2014-06-10 |
gwms |
v3_2_5 |
|
2014-06-10 |
gwms |
git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 |
Add HTCondorCE RSL support |
2014-06-10 |
gwms |
git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c |
Add condor_chirp to tarballs |
2014-02-11 |
condor |
8.0.5 |
|
CERN 0305
CERN 32
2014-06-10 |
gwms |
v3_2_5 |
|
2014-06-10 |
gwms |
git cherry-pick 664c1daf5d651369991de1d5e33b5c6538c0c5f4 |
Add autoupdate to monitoring page |
2014-06-10 |
gwms |
git cherry-pick 427f074b3dbccd2a4c997211e3a9a62e2e377d58 |
Add HTCondorCE RSL support |
2014-06-10 |
gwms |
git cherry-pick e02560caa1478a64464f881aa062d4cd75a9885c |
Add condor_chirp to tarballs |
2014-02-11 |
condor |
8.0.5 |
|
GOC-ITB
GOC Factory Things to Remember
Production Factory: glidein.grid.iu.edu
ITB Factory: glidein-itb.grid.iu.edu
Turn off timeout for sudo
run:
/usr/sbin/visudo
add the following:
Defaults timestamp_timeout = 0
Firewall settings
For condor we give a port range of 20k-50k. See the
/etc/iptables.d
files for details. Also the condor config must know about it:
###################
# Firewall limits
###################
HIGHPORT=50000
LOWPORT=20000
Frontend Support
Adding a New Frontend
Click here for instructions on how to register a new Frontend to the Factory.
How To Open A Ticket To Contact Glidein Factory Support
NOTE This procedure is likely obsolete and needs to be verified with GOC
Although we encourage users to contact us directly at
osg-gfactory-support@physics.ucsd.edu, a ticket may be opened should the user deem it appropriate.
Until GOC institutes a custom form for the Glidein Factory, begin by visiting
https://ticket.grid.iu.edu/goc/other and load your certification. Please select the VO on whose behave you are submitting the ticket. Then under "Add CC" add
osg-gfactory-support@physics.ucsd.edu. Finally, type a message to describe the problem and hit submit.
Monitoring Reference
Glidein Factory Status
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatus.html
Load this, the default "Entry" is 'total' (it can also be per-site), and hit "update."
At the simplest level, there are four graphs to look at which are all displayed on top of one another on this page:
- Running glidein jobs (green solid, on by default)
- Glideins at collector (black line, not on by default)
- Glideins claimed by user jobs (purple line, on by default)
- Glideins not matched (yellow line, on by default)
What to look for with these:
Glideins claimed (purple) should not be much lower than the green envelope.
Glideins at collector (black) should also not be much lower than the green envelope.
Glideins not matched (yellow) should not be very large (relative to glideins claimed or running).
Each of these can be temporary, ie, not matched can spike then go down when many jobs are submitted at once. This is not a problem. When the above conditions persist, a problem is more likely.
Glidein Factory Status Now
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html This page displays a table of live data which corresponds to the same data as shown in the plots under Glidein Factory Status. The information is further divided by VO. See
GlideinFactoryStatusNow (
NOTE needs updating) for a longer and more detailed discussion.
Log Reference
Glidein output logs:
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.out $GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/job.*.err
Glidein user logs:
$GLIDEIN_FACTORY_DIR/client_log/user_*/entry_*/condor_activity_*.log
Condor daemon logs:
/opt/glidecondor/condor_local/log/*Log
NOTE On GOC machines:
/usr/local/glidecondor/condor_local/log/*Log
Condor gridmanager logs:
/dev/shm/GridmanagerLog.schedd_glideins*
NOTE On GOC machines:
/tmp/GridmanagerLog.schedd_glideins*
Factory daemon logs:
$GLIDEIN_FACTORY_DIR/log/factory/factory.*.log $GLIDEIN_FACTORY_DIR/log/entry_*/factory.*.log
Completed glidein logs:
$GLIDEIN_FACTORY_DIR/log/entry_*/completed_jobs_*.log
Tool Reference
Running analyze_entries
status report
cd $GLIDEIN_FACTORY_DIR
analyze_entries -x 24 -s waste
Run command with -h to print explanation of possible options.
This report is sent to
osg-gfactory-reports@physics.ucsd.edu daily.
Using proxy_info
to Verifiy Pilot Proxies
An example of how to verify the pilot proxies used by the frontend.
- Get a list of the proxies for a VO and CE:
proxy_info fecms ls -l /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/
- Display a particular proxy's information:
proxy_info fecms info -all /var/gfactory/clientproxies/user_fecms/glidein_v2_0/entry_CMS_T2_US_UCSD_gw2/x509_CMS_T2_US_UCSD_gw2@v2_0@UCSD@UCSD.minus,v5_0.dot,main_umrw_5.proxy
- For additional tool help run:
proxy_info -h
NOTE at CERN you must first source:
source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
Site Debugging Reference
How to contact Grid sites
Non-CMS issues at OSG sites
Use GOC:
https://ticket.grid.iu.edu/goc/submit
- For
Email Address
use osg-gfactory-support@physics.ucsd.edu
- Check the
Resource
box and find the name corresponding to the GLIDEIN_ResourceName
attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
- Include the Resource name in the Title so it is easy to find
CMS issues for All Sites
Use Savannah:
https://savannah.cern.ch/
- Search for
CMS
- Select
CMS Computing Infrastructure Support
- Click
Submit a new item
NOTE the following assumes you have administrative rights
Fill out the following fields:
- For
Catagory
select Facilities
- For
Assigned to
select cmscompinfrasup-sitename
- Set
Use GGUS
to No
(this can be changed to Yes
later if admins never respond)
- For
Site
find the name corresponding to the GLIDEIN_CMSSite
attribute in the $GLIDEIN_FACTORY_DIR/glideinWMS.xml
- Include the CMS Site name in the Title so it is easy to find
- in
Add Email Addresses
add osg_gfactory
NOTE As an exception UK admins complain that we should always set
Use GGUS
to
Yes
or they will not see the ticket:
https://savannah.cern.ch/support/index.php?134388
NOTE It seems admins at T2_FR_GRIF_IRFU require ggus as well.
If using ggus, be sure to add the gfactory-support email list in the "Involve others" field. Otherwise, GGUS won't send out an email to us whenever the ticket is updated.
NOTE if the site squad in question cannot be found in
Assigned to
then just follow the same instructions as below:
Non-CMS issues at European sites
Non-CMS issues at European sites
Use GOC:
https://ticket.grid.iu.edu/goc/submit
IMPORTANT leave
Resource
unchecked.
Explain in the Description it is an EGI resource along with the
GLIDEIN_ResourceName
and mention you are forwarding it to GGUS on behalf of the affected VO. After submitting the ticket, click the
GGUS (Prod)
box in the
Ticket Exchange
options and click
Update
.
Globus Hold Reasons
Globus Error Code |
Held Reason |
Job is Recoverable |
10 |
globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without |
No |
121 |
the job state file doesn't exist |
No |
126 |
it is unknown if the job was submitted |
Yes |
12 |
the connection to the server failed (check host and port) |
Yes |
131 |
the user proxy expired (job is still running) |
Maybe |
17 |
the job failed when the job manager attempted to run it |
No |
22 |
the job manager failed to create an internal script argument file |
No |
31 |
the job manager failed to cancel the job as requested |
No |
3 |
an I/O operation failed |
Yes |
47 |
the gatekeeper failed to run the job manager |
No |
48 |
the provided RSL could not be properly parsed |
No |
4 |
jobmanager unable to set default to the directory requested |
No |
76 |
cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space |
Maybe (Short term: No) |
79 |
connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... |
No |
7 |
an authorization operation failed |
Yes |
7 |
authentication with the remote server failed |
Yes |
8 |
the user cancelled the job |
No |
94 |
the jobmanager does not accept any new requests (shutting down) |
Yes |
9 |
the system cancelled the job |
No |
? |
Job failed, no reason given by GRAM server |
No |
122 |
could not read the job state file |
Maybe (short term: no) |
132 |
the job was not submitted by original jobmanager |
No (likely to be fatal) |
Some background on Globus
In Globus' Hold Reasons, "job manager" refers to the process running on the CE responsible for submitting to the local batch system. The process is called
globus-job-manager
.
Additional notes
-
Globus error 79: connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ..
- This can happen if it is a condor site and the admin removes held glideins from their side.
- Also happens to every CERN Production Glidein every Monday on every gt5 site, but as of yet we still don't know why:
-
Globus error 17
- Usually transient but if site is pbs and 100% of a vo is going held with 17 it could be they are not authorized for that pbs queue:
-
Globus error 17, 31, 79, 121, 155
- These glideins may not be recoverable, and the factory attempts to remove them.
- The factory does not always succeed, so you may have to do it manually with
-forcex
- In particular, even if you remove these glideins, when they turn into unknown state ("X"), they might turn back into held state ("H"). So
-forcex
is the way to remove them definitively
-
Globus error 155
- The
globus-job-manager
is likely unable to send a file back to the factory
Nordugrid Hold Reasons
CREAM Hold Reasons (Work in progress)
Link to a summary page on
CREAM troubleshooting
Reasons we mostly understand
-
CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed OpenSSL Error: s3_clnt.c:1063: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash e7734335
- Can occur if the CA on the factory is out of date, and the gatekeeper identifies itself with a certificate that's newer than our CA. Try using "yum update" to get a more current CA.
Reasons we don't understand
-
CREAM error: Transfer failed: GRIDFTP_TRANSFER timed out
-
CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Queue is not enabled MSG=queue is disabled: user cmprd003@ce08.pic.es, queue glong_sl5-) N/A (jobId = CREAM102242343)
- This can be seen when glideins are submitted to a site in downtime.
- The following are likely because the job manager on the other end has no record of the glideins anymore and can probably just safely be removed (if the site isn't in downtime).
-
CREAM error: reason=999
- (not really sure what this means)
-
CREAM error: CREAM_Job_Purge Error: job does not exist
-
CREAM error: job aborted because the execution of the JOB_START command has been interrupted by the CREAM shutdown
-
CREAM error: BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server pbs03.pic.es (errno=15007) Unauthorized Request -) N/A (jobId = CREAM258408629)
-
CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file /opt/glite/var/cream_sandbox/lt2-cmsprd/_DC_ch_DC_cern_OU_computers_CN_cmspilotjob_vocms157_cern_ch_cms_Role_production_Capability_NULL_lt2-cmsprd713/95/CREAM956614543/OSB/job.714256.8.out 500-globus_xio: System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500 End.
- If the following are seen over many entries served by the same gridmanager it may be a local issue (but not always). Killing the gridmanager without -9 seems to clear them up:
-
CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 Command failed. : globus_xio: An end of file occurred
-
CREAM error: Transfer failed: globus_ftp_control_local_port(): Handle not in the proper state CLOSING.
-
CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_gss_assist: Error invoking callout 530-globus_callout_module: The callout returned an error 530-an unknown error occurred 530 End.
-
CREAM error: Transfer failed: globus_ftp_client: the server responded with an error 500 500-Command failed. : callback failed. 500-an end-of-file was reached 500-globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read. 500-globus_xio: An end of file occurred 500 End.
-
CREAM error: Transfer failed: globus_ftp_control: gss_init_sec_context failed globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired
New Improved Docs (based on Alison's notes)
FactoryOpsGlideinWMS
FactoryInfo
Ops support
- Internal tickets
- mailing lists
- access and cross-factory support
Factory Ops
- Creating a new instance (and preserve monitoring history?)
- Initial setup (daily emails, processes/monitoring, ??)
- Adding new entries
- Adding a frontend
- Upgrading
- Cloning
- sites from one factory to another
- Global cloning, such as t1 site group
- Removing schedds (I think docs for this may be wrong?)
- Attributes (link to gwms docs)
- Finding missing sites
- Removing glideins (includes scripts)
- Submitting test jobs
- Putting sites in downtime
- Submitting Tickets
- Decommissioning sites
- Factory Disk warnings
- Entry issues
- Removing old entries
-
Daily Ops Monitoring
- Mailing list
- Internal tickets (Jira)
- Daily emails
- Analyze Entries
- Web pages
- Held jobs
- Infosys
- Misc
- .err log problems
- HOLD problems
- Condor Activity Log problems
Daily Ops Other issues
- Restarting the grid manager
- Handling stuck waiting glideins
- Rundiffs
- Unmatched jobs
Additional References
- logs
- monitoring tools
- proxies
- ssh logins
- git commands
- frontend security info
- BDII
- Log Retention rules
- Security
- Condor G
- Useful scripts
- Misc
Future work
Authors
--
TerrenceMartin
--
IgorSfiligoi