The glideinWMS AMI (Amazon Machine Instance)
This document describes how the AMI startup process works right now. There are changes that are coming and they will be described in the relevant places, marked in
red font.
Assumptions
It is assumed that the Factory Condor instance that launched the AMI will also pass along a tarball in the user data containing the glidein_startup.sh script, the proxy used for authentication to the user collector, and an ini file containing all the arguments necessary to launch the glidein_startup.sh script. (Note: Soon, the glidein_startup.sh script will be moved to the stage directory on the Factory web server. The ini file will contain the necessary information to download the glidein_startup.sh script from the factory web server.)
Basic Environment
- The AMI has a user called glidein_user that is used to run the pilot as. Its home directory is /home/glidein_pilot. The Glidein Pilot will be launched in the glidein_pilot home directory.
- The OSG Worker Node client is installed.
- The CERN VMFS software is installed. (Note: This software is specifically to mount WLCG VO software. This is not required, but does make life easier for CMS.)
- A service called GlideinPilot? is registered and chkconfig'd on.
- The GlideinPilot? service launches /home/glidein_user/PilotLauncher.py. (Note: PilotLauncher? .py will be moved to /usr/sbin in the near future.)
Startup Process
The
GlideinPilot? service is configured to start at boot time. It creates the necessary symbolic links to make the $OSG_APP environment mimic the OSG Grid $OSG_APP area (Note: This functionality will be moved to the VO Frontend scripts in the future). Then, it launches
PilotLauncher? .py
PilotLauncher? .py performs the following tasks:
- daemonizes itself
- drop privileges from root to glideinUser (Note: This will change once the glidein_startup.sh script and associated scripts are changed to handle this. Then, glidein_startup.sh will be responsible for dropping privileges.)
- open up logs for debugging
- get the user data - should be a tar file
- unpack the user data
- ensure that the proxy is owned by the glidein_user - change ownership if necessary
- generate the pilot launch command
- Set the proper environment
- Get the arguments from the ini file in the user data
- launch glidein_startup.sh with the proper environment and arguments
- wait for glidein_startup.sh to complete
- close logs and perform clean up tasks
- shutdown ami
Also, if glidein_startup.sh exits (for any reason),
PilotLauncher? will shutdown the AMI.
Cloud Related Problems
In the grid world, glidein_startup.sh is the job. In the Cloud world, the AMI is the job. This poses a problem currently. In the grid world, batch systems enforce a maximum run time of sorts. The batch system will terminate a job that goes over that limit. This ensures that rogue pilots don't run forever. This mechanism doesn't exist in the Cloud world. Additionally, the Factory has absolutely no way of knowing whether or not the glidein_startup.sh script successfully launched or is even active. Currently there is no communication from the pilot back to the Factory. The AMI is only terminated if a) the Factory admin performs a
condor_rm on the job, b) an admin with access to EC2 account keys terminates the AMI, or c)
PilotLauncher? .py issues a shutdown command. We have a potential hole where something hangs up on the AMI and prevents
PilotLauncher? .py from terminating the AMI, but no useful work can be accomplished with this particular instance. We need to have some mechanism that terminates AMI's based on sanity checks of some sort. This needs to be done at the Factory.
--
AnthonyTiradani - 2010/09/28