Difference: DaveEvansSTEP09LogHandler (1 vs. 4)

Revision 42009/07/24 - Main.BristolDave

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Deleted:
<
<
 

Getting CRAB logs the brute force way

Without using condor history it is possible to find the crab logs in a given period by looking at the date stamps of the individual log files.

Line: 5 to 4
  Without using condor history it is possible to find the crab logs in a given period by looking at the date stamps of the individual log files.
Added:
>
>

Getting error information

Need to describe the problem / extract the relevant information. Looking in the framework job report (xml) here is an example of an error. This looks like it might be a good way of getting at the necessary text...

        <FrameworkError ExitStatus="8001" Type="CMSException">
                cms::Exception caught in cmsRun
                                                ---- FileOpenError BEGIN
                                                ---- FatalRootError BEGIN
                                                Fatal Root Error: @SUB=TKey::ReadObj
                                                Unknown class 
                                                ---- FatalRootError END
                                                
                                                RootInputFileSequence::initFile(): Input file dcap://dev03.ihepa.ufl.edu:22125/pnfs/ihepa.ufl.edu/data/raid/store/mc/Summer08/QCDpt470/GEN-SIM-RECO/IDEAL_V11_redigi_v1/0004
/B2185720-1ACE-DD11-81DC-0030487E54B5.root was not found or could not be opened.
                                                cms::Exception caught in EventProcessor and rethrown
                                                ---- FileOpenError END
        </FrameworkError>
        <FrameworkError ExitStatus="8001" Type="WrapperExitCode"/>
        <FrameworkError ExitStatus="8001" Type="ExeExitCode"/>

Stuff to get

  • The last successfully accessed run/event number
  • The timestamp of that event / the report of an error
  • The file being accessed
  • The node the job was run on.

 

Monitoring CRAB job exit codes

I have writen some scripts that monitor the exit status of crab jobs that were submitted through the UCSD crab server. They work by asking condor_history for the information about jobs that finished in the last 24 hours and then using this to track down the log files.

Revision 32009/06/22 - Main.BristolDave

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Added:
>
>

Getting CRAB logs the brute force way

Without using condor history it is possible to find the crab logs in a given period by looking at the date stamps of the individual log files.

 

Monitoring CRAB job exit codes

I have writen some scripts that monitor the exit status of crab jobs that were submitted through the UCSD crab server. They work by asking condor_history for the information about jobs that finished in the last 24 hours and then using this to track down the log files.

Revision 22009/06/15 - Main.BristolDave

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Changed:
<
<
test
>
>

Monitoring CRAB job exit codes

I have writen some scripts that monitor the exit status of crab jobs that were submitted through the UCSD crab server. They work by asking condor_history for the information about jobs that finished in the last 24 hours and then using this to track down the log files.

I do not know where exactly condor fits in to any of this, how the crab server works or any other details. The condor_history seems to know about crab jobs and when they finished though, so that is what I am using.

The procedure works like this

Finding jobs that finished in the last 24 hours

This is done by two cron jobs on the machine 'glidein-2'

The first script runs ever two hours and queries the condor_history for jobs that finished in the last two hours. It writes a time stamped file to /home/dlevans/ which contains the xml formatted information about the jobs that finished in that period. The reason for making this snapshot every two hours is because the condor history has a finite length.

#!/bin/bash

#
# A script to make an xml history_glidein-2.t2.ucsd.edu of condor jobs
# from the last 2 hours. The xml history_glidein-2.t2.ucsd.edu is 
# edited to make it readable by python sax parser. 
#

export PATH=/data/glidecondor/sbin:/data/glidecondor/bin:$PATH
export CONDOR_CONFIG=/data/glidecondor/etc/condor_config

### get the history_glidein-2.t2.ucsd.edu of jobs completed 
### in the last 2 hours (7200 seconds)
DATE=$(date +%s)
DATEM10=$(($DATE - 7200))
DATESTAMP=$(date +%F)
TIMESTAMP=$(date +%k)
# reove whitespace
DATESTAMP=$(echo $DATESTAMP | sed 's/^[ ]*//')
TIMESTAMP=$(echo $TIMESTAMP | sed 's/^[ ]*//')

condor_history -xml -constraint 'EnteredCurrentStatus > '$DATEM10 >          /home/dlevans/condorhistory_glidein-2.t2.ucsd.edu.xml.tmp

if [ $? -eq 0 ]; then
        mv /home/dlevans/condorhistory_glidein-2.t2.ucsd.edu.xml.tmp                  /home/dlevans/condorhistory\_$DATESTAMP\-$TIMESTAMP\_glidein-2.t2.ucsd.edu.xml
fi

At 1AM PST every day, a second script runs. The purpose of the second script is to join together the two hours files for the previous day into a single more manageable 24 hour file. This is also time stamped in the file name and goes to /home/dlevans.

 #!/bin/bash

### script to join 2 hourly condor_history logs
### into a single (timestamped) log and then clean 
### up the xml so the sax xml parser can read it
### do this for the logs from yesterday

YESTERDAY=$(date --date=@$(( $(date +%s) - 86400 )) +'%F')
FILES=$(ls /home/dlevans/condorhistory_*_glidein-2.t2.ucsd.edu.xml | grep $YESTERDAY)
OUTFILE="/home/dlevans/condorhistory_24hours_$YESTERDAY.xml"

### check there were some logs files yesterday
if [[ $FILES != "" ]];
then

        ### concatanate files
        ### into 24 hour history
        echo "" > $OUTFILE
        for FILE in $FILES
        do
                echo $FILE
                cat $FILE >> $OUTFILE
        done

        ### clean up 24 hour history
        ### remove the bad lines
        ./removeline.sh "DOCTYPE" $OUTFILE
        ./removeline.sh "xml version" $OUTFILE

        ### make sure there is only one opening and closing classads tag
        ./removeline.sh "classads" $OUTFILE
        sed -i 1i"<classads>" $OUTFILE
        echo "</classads>" >> $OUTFILE

else
        echo "No log files yesterday"
fi

Processing the logs for a 24 hour period

I have written various scripts that process the 24 hour log files and get some information organised by site about what exit codes were found. To get this collection of scripts, do

cvs co -d STEP09/loghandler UserCode? /DLEvans/STEP09/loghandler

First, if there are still some spurious lines in the 24 hour xml log, you can clean them up with cleanupxml.sh. The argument is the xml file to clean. The script will replace it with a cleaned one.

The first step is to run the logHandler python script. This will make a text file from the xml file and use the content of the xml file to work out the CMSSW/CRAB exit status of each job. Each line is a colon deliminated list which desribes the properties of each job relevant to this analysis, including the exit status. The output file will go to /home/dlevans/.

The second step is to run the reportGenerator python script. This will take the output of the logHandler and organise it according to site. Using the information in the siteIndex.txt file it produces reports/mail.txt which will contain a ready made e-mail message for those sites that had failed jobs. It will copy the logs of the failed jobs to a web viewable space on UAF2.

  -- BristolDave - 2009/06/11 \ No newline at end of file

Revision 12009/06/11 - Main.BristolDave

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"
test

-- BristolDave - 2009/06/11

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback