Getting CRAB logs the brute force way
Without using condor history it is possible to find the crab logs in a given period by looking at the date stamps of the individual log files.
Getting error information
Need to describe the problem / extract the relevant information. Looking in the framework job report (xml) here is an example of an error. This looks like it might be a good way of getting at the necessary text...
<FrameworkError ExitStatus="8001" Type="CMSException">
cms::Exception caught in cmsRun
---- FileOpenError BEGIN
---- FatalRootError BEGIN
Fatal Root Error: @SUB=TKey::ReadObj
Unknown class
---- FatalRootError END
RootInputFileSequence::initFile(): Input file dcap://dev03.ihepa.ufl.edu:22125/pnfs/ihepa.ufl.edu/data/raid/store/mc/Summer08/QCDpt470/GEN-SIM-RECO/IDEAL_V11_redigi_v1/0004
/B2185720-1ACE-DD11-81DC-0030487E54B5.root was not found or could not be opened.
cms::Exception caught in EventProcessor and rethrown
---- FileOpenError END
</FrameworkError>
<FrameworkError ExitStatus="8001" Type="WrapperExitCode"/>
<FrameworkError ExitStatus="8001" Type="ExeExitCode"/>
Stuff to get
- The last successfully accessed run/event number
- The timestamp of that event / the report of an error
- The file being accessed
- The node the job was run on.
Monitoring CRAB job exit codes
I have writen some scripts that monitor the exit status of crab jobs that were submitted through the UCSD crab server. They work by asking condor_history for the information about jobs that finished in the last 24 hours and then using this to track down the log files.
I do not know where exactly condor fits in to any of this, how the crab server works or any other details. The condor_history seems to know about crab jobs and when they finished though, so that is what I am using.
The procedure works like this
Finding jobs that finished in the last 24 hours
This is done by two cron jobs on the machine 'glidein-2'
The first script runs ever two hours and queries the condor_history for jobs that finished in the last two hours. It writes a time stamped file to /home/dlevans/ which contains the xml formatted information about the jobs that finished in that period. The reason for making this snapshot every two hours is because the condor history has a finite length.
#!/bin/bash
#
# A script to make an xml history_glidein-2.t2.ucsd.edu of condor jobs
# from the last 2 hours. The xml history_glidein-2.t2.ucsd.edu is
# edited to make it readable by python sax parser.
#
export PATH=/data/glidecondor/sbin:/data/glidecondor/bin:$PATH
export CONDOR_CONFIG=/data/glidecondor/etc/condor_config
### get the history_glidein-2.t2.ucsd.edu of jobs completed
### in the last 2 hours (7200 seconds)
DATE=$(date +%s)
DATEM10=$(($DATE - 7200))
DATESTAMP=$(date +%F)
TIMESTAMP=$(date +%k)
# reove whitespace
DATESTAMP=$(echo $DATESTAMP | sed 's/^[ ]*//')
TIMESTAMP=$(echo $TIMESTAMP | sed 's/^[ ]*//')
condor_history -xml -constraint 'EnteredCurrentStatus > '$DATEM10 > /home/dlevans/condorhistory_glidein-2.t2.ucsd.edu.xml.tmp
if [ $? -eq 0 ]; then
mv /home/dlevans/condorhistory_glidein-2.t2.ucsd.edu.xml.tmp /home/dlevans/condorhistory\_$DATESTAMP\-$TIMESTAMP\_glidein-2.t2.ucsd.edu.xml
fi
At 1AM PST every day, a second script runs. The purpose of the second script is to join together the two hours files for the previous day into a single more manageable 24 hour file. This is also time stamped in the file name and goes to /home/dlevans.
#!/bin/bash
### script to join 2 hourly condor_history logs
### into a single (timestamped) log and then clean
### up the xml so the sax xml parser can read it
### do this for the logs from yesterday
YESTERDAY=$(date --date=@$(( $(date +%s) - 86400 )) +'%F')
FILES=$(ls /home/dlevans/condorhistory_*_glidein-2.t2.ucsd.edu.xml | grep $YESTERDAY)
OUTFILE="/home/dlevans/condorhistory_24hours_$YESTERDAY.xml"
### check there were some logs files yesterday
if [[ $FILES != "" ]];
then
### concatanate files
### into 24 hour history
echo "" > $OUTFILE
for FILE in $FILES
do
echo $FILE
cat $FILE >> $OUTFILE
done
### clean up 24 hour history
### remove the bad lines
./removeline.sh "DOCTYPE" $OUTFILE
./removeline.sh "xml version" $OUTFILE
### make sure there is only one opening and closing classads tag
./removeline.sh "classads" $OUTFILE
sed -i 1i"<classads>" $OUTFILE
echo "</classads>" >> $OUTFILE
else
echo "No log files yesterday"
fi
Processing the logs for a 24 hour period
I have written various scripts that process the 24 hour log files and get some information organised by site about what exit codes were found. To get this collection of scripts, do
cvs co -d STEP09/loghandler
UserCode? /DLEvans/STEP09/loghandler
First, if there are still some spurious lines in the 24 hour xml log, you can clean them up with cleanupxml.sh. The argument is the xml file to clean. The script will replace it with a cleaned one.
The first step is to run the logHandler python script. This will make a text file from the xml file and use the content of the xml file to work out the
CMSSW/CRAB exit status of each job. Each line is a colon deliminated list which desribes the properties of each job relevant to this analysis, including the exit status. The output file will go to /home/dlevans/.
The second step is to run the reportGenerator python script. This will take the output of the logHandler and organise it according to site. Using the information in the siteIndex.txt file it produces reports/mail.txt which will contain a ready made e-mail message for those sites that had failed jobs. It will copy the logs of the failed jobs to a web viewable space on UAF2.
--
BristolDave - 2009/06/11