Install and Configure BOSCO for Glidein-Based Submission

About this Document

This document describes how to install and configure BOSCO to allow a glideinWMS factory to submit glideins to the BOSCO resource's local batch queue on behalf of a VO frontend. Note, however, the installation and configuration process outlined below is highly specific to the case when you ONLY have ssh-key login access to the user account on the BOSCO resource, i.e., you do not have the ssh password. In addition, it is also important to note that this document is preliminary. As such, it may not represent the best way or the easiest way to install and configure BOSCO. The process below simply attempts to minimize its modification of the standard BOSCO installation and configuration process.

This document follows the general OSG documentation conventions:

  1. A User Command Line is illustrated by a green box that displays a prompt:
     [user@client ~]$ 
  2. A Root Command Line is illustrated by a red box that displays the root prompt:
     [root@client ~]$ 
  3. Lines in a file are illustrated by a yellow box that displays the desired lines in a file:
     priorities=1 

Definitions

Hostnames:

  • BOSCO_HOST is the hostname of the host from which glideins will be submitted to the BOSCO resource's local batch queue.
  • FACTORY_HOST is the hostname of the host where you've installed and configured your glideinWMS factory.
  • FRONTEND_HOST is the hostname of the host where you've installed and configured your VO's glideinWMS frontend.

Usernames:

  • BOSCO_USER is the username of the user on the BOSCO_HOST that has access to the BOSCO resource's local batch queue; e.g., cmsbosco
  • FACTORY_ADMIN_USER is the username of the user on the FACTORY_HOST used for all non-root administrative tasks; e.g., gfactory
  • FACTORY_VO_USER is the username of the user on the FACTORY_HOST from which glideins are submitted to the BOST_HOST; e.g., fecmsglobal
  • FRONTEND_USER is the username of the user on the FRONTEND_HOST that submits requests for glideins to the FACTORY_HOST; e.g., frontend

Requirements

  • glideinWMS Factory 3.2.6 or later (3.2.8 recommended)
  • HTCondor 8.2.4 or later
  • condor-bosco
  • Already have an ssh login to the target BOSCO_HOST

Installation and Configuration

  1. Login to the FRONTEND_HOST via ssh as the FRONTEND_USER. NOTE it is important to log in with -A, this assumes you already have your personal public key access to the BOSCO_HOST. The bosco_cluster --add command will use this login to copy the bosco credentials over to the node.
     [user@client ~]$ ssh -A FRONTEND_USER@FRONTEND_HOST
  2. Download the BOSCO installer tarball in the FRONTEND_USER home directory.
     [FRONTEND_USER@FRONTEND_HOST ~]$ wget ftp://ftp.cs.wisc.edu/condor/bosco/1.2/boscoinstaller.tar.gz 
  3. Unzip and untar the BOSCO installer in the FRONTEND_USER home directory.
     [FRONTEND_USER@FRONTEND_HOST ~]$ tar -xzf boscoinstaller.tar.gz 
  4. Run the boscoinstaller script to install BOSCO on the FRONTEND_HOST.
    [FRONTEND_USER@FRONTEND_HOST ~]$ python boscoinstaller 
  5. Generate a passwordless rsa key, just press enter twice with no password when it prompts for one. Note it is important to name the key bosco_key.rsa:
     [FRONTEND_USER@FRONTEND_HOST ~]$ ssh-keygen -t rsa -f ~/.ssh/bosco_key.rsa
  6. Since BOSCO is not installed in your FRONTEND_HOST path, we must (at least temporarily) source its environment configuration file, bosco_setenv. Please run the following:
     [FRONTEND_USER@FRONTEND_HOST ~]$ source ~/bosco/bosco_setenv 
  7. For whatever reason the installer doesn't create the .bosco dir so create it manually:
     [FRONTEND_USER@FRONTEND_HOST ~]$ mkdir ~/.bosco 
  8. Start up BOSCO:
     [FRONTEND_USER@FRONTEND_HOST ~]$ bosco_start
  9. Add the BOSCO_HOST by running the bosco_cluster script with the following parameters, this will forward the passwordless bosco ssh key, and install bosco on the remote side:
     [FRONTEND_USER@FRONTEND_HOST ~]$ bosco_cluster --add BOSCO_USER@BOSCO_HOST BATCH_TYPE 
    where BATCH_TYPE = pbs, condor, etc.
  10. Run a BOSCO test job to check the connection between the FRONTEND_HOST and the BOSCO_HOST and its worker nodes.
     [FRONTEND_USER@FRONTEND_HOST ~]$ bosco_cluster --test BOSCO_USER@BOSCO_HOST 
  11. If successful, run bosco_stop on the FRONTEND_HOST.
     [FRONTEND_USER@FRONTEND_HOST ~]$ bosco_stop 
  12. Finally, add the following elements to your frontend configuration file, frontend.xml. Note, you may add them to either the group or global credential definition. Note: All paths should be absolute, not relative.
     <credentials>
       <credential absfname="/path/to/grid_proxy" security_class="frontend" trust_domain="grid" type="grid_proxy"/>
       <credential absfname="/home/frontend/.ssh/bosco_key.rsa.pub" keyabsfname="/home/frontend/.ssh/bosco_key.rsa" pilotabsfname="/path/to/grid_proxy" security_class="frontend" trust_domain="bosco" type="key_pair"/>
    </credentials>
    
  13. Please stop, reconfig, and restart your frontend. If successful, the FRONTEND_HOST is now properly configured.
    [root@FRONTEND_HOST ~]$ service gwms-frontend stop
    [root@FRONTEND_HOST ~]$ service gwms-frontend reconfig
    [root@FRONTEND_HOST ~]$ service gwms-frontend start
    
  14. Next, login to the FACTORY_HOST via ssh as root.
     [user@client ~]$ ssh root@FACTORY_HOST
  15. Install condor-bosco on the FACTORY_HOST from root.
     [root@FACTORY_HOST ~]$ yum install condor-bosco 
  16. Remove and retouch the 60-campus_factory.config file.
    [root@FACTORY_HOST ~]$ rm /etc/condor/config.d/60-campus_factory.config
    [root@FACTORY_HOST ~]$ touch /etc/condor/config.d/60-campus_factory.config
  17. Now, add the entry for the BOSCO_HOST to factory configuration file, glideinWMS.xml.
    <entry name="CMS_TX_US_XXXXX_BOSCO" auth_method="key_pair" enabled="True" gatekeeper="BOSCO_USER@BOSCO_HOST" gridtype="batch BATCH_TYPE" rsl="" trust_domain="bosco" verbosity="std" work_dir="~/">
       <config>
          <max_jobs>
             <default_per_frontend glideins="256" held="50" idle="50"/>
             <per_entry glideins="256" held="50" idle="50"/>
             <per_frontends>
             </per_frontends>
          </max_jobs>
          <release max_per_cycle="20" sleep="0.2"/>
          <remove max_per_cycle="5" sleep="0.2"/>
          <restrictions require_glidein_glexec_use="False" require_voms_proxy="False"/>
          <submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="fixed">
             <submit_attrs>
             </submit_attrs>
          </submit>
       </config>
       <allow_frontends>
       </allow_frontends>
       <attrs>
          <attr name="CONDOR_VERSION" const="False" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="default"/> <attr name="GLEXEC_JOB" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="False" type="string" value="False"/>
          <attr name="GLIDEIN_CMSSite" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="TX_US_XXXXX"/>
          <attr name="GLIDEIN_CPUS" const="True" glidein_publish="False" job_publish="True" parameter="True" publish="True" type="string" value="8"/>
          <attr name="GLIDEIN_Country" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="US"/>
          <attr name="GLIDEIN_Glexec_Use" comment="This has been REQUIRED for historical reasons, OPTIONAL/NONE alt values" const="False" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="NONE"/>
          <attr name="GLIDEIN_MaxMemMBs" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="int" value="49152"/>
          <attr name="GLIDEIN_Max_Walltime" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="int" value="171000"/>
          <attr name="GLIDEIN_ResourceName" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="TX_US_XXXXX"/>
          <attr name="GLIDEIN_Site" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="TX_US_XXXXX"/>
          <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="CMS,MIS"/>
          <attr name="USE_CCB" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="True"/> <attr name="X509_CERT_DIR" const="True" glidein_publish="False" job_publish="True" parameter="True" publish="True" type="string" value="/cvmfs/oasis.opensciencegrid.org/mis/certificates"/>
       </attrs>
       <files>
       </files>
       <infosys_refs>
       </infosys_refs>
       <monitorgroups>
       </monitorgroups>
    </entry>
    
  18. Finally, build up a global ssh fingerprint list so that the FACTORY_HOST trusts the keys of both the BOSCO_HOST and the FRONTEND_HOST.
    [root@FACTORY_HOST ~]$ ssh-keyscan -t rsa,dsa BOSCO_HOST >> /etc/ssh/ssh_known_hosts
    [root@FACTORY_HOST ~]$ ssh-keyscan -t rsa,dsa FRONTEND_HOST >> /etc/ssh/ssh_known_hosts
    
  19. Stop, reconfigure and restart your factory. If successful, the FACTORY_HOST is now properly configured. You may now submit user jobs to the BOSCO_HOST via the FRONTEND_HOST.
    [root@FRONTEND_HOST ~]$ service gwms-factory stop
    [root@FRONTEND_HOST ~]$ service gwms-factory reconfig
    [root@FRONTEND_HOST ~]$ service gwms-factory start 
    

Troubleshooting

If glideins and/or direct bosco user jobs fail to be successfully submitted into a local pbs/slurm batch system, it may be useful to modify the ~/bosco/glite/bin/pbs_submit.sh submission script on the BOSCO_HOST to see the qsub/sbatch error messages directly.

Before:

jobID=`${pbs_binpath}/qsub $bls_tmp_file` # actual submission
retcode=$?
if [ "$retcode" != "0" ] ; then
       rm -f $bls_tmp_file
       exit 1
fi

After:

jobID=`${pbs_binpath}/qsub $bls_tmp_file` # actual submission
retcode=$?
echo “Full qsub output: $jobID” 1>&2
if [ "$retcode" != "0" ] ; then
       rm -f $bls_tmp_file
       exit 1
fi

Additional Documentation

-- JeffreyDost - 2015/05/12

Topic revision: r23 - 2016/11/01 - 22:42:10 - MartinKandes
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback