--
IanMacNeill - 2010/07/16
This article is still in progress.
Glidein Factory Status Now
Contents
Monitoring Webpage
http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html Load the webpage and be sure to click update. You will see a table that looks like this.
insert image of table with numbers
Drop Down Menu
The page contains a dropdown menu (1) which allows one to view all front ends at once or each front end individually.
insert image of dropdown
Update Table Button
Click the Update Table Button (2) to retrieve recent data. If you don't, you may be looking at old data. The time stamp below is set to Pacific time (here at UCSD).
insert image of update and time stamp
The Table
All of the table is divided into 3 parts, one of which lists the "Status", the "Requested" another, and those who on their own machines are called the Users, are in the "Client Monitor."
insert image of table
Status
The Status column (3) reflects the state machine of jobs as condor processes them. Each subcolumn is a potential stop along the way. A normal job follows the state maching below: insert state machine ->Idle->Waiting->Staging In->Pending->Running->Staging Out-> Any point along this loop, a job may become Held or Unknown.
Running
This status means that the job is running.
Idle (Status)
This means that the job has been remotely queued up but has not yet staged in to the local queue.
Waiting
This is a local queue and does not reflect jobs outside the node.
Pending
This is a remote site queue. It contains jobs not yet executed.
Staging In
A job is in this state while transfering job from remote to local queue. The amount of time to do this is small compare to the rest of the process. Given the transient nature of this operation, we expect the number in this column to be small since it is unlikely that many jobs will be caught staging in at the same time.
Staging Out
This is the same as Staging In except that the job is transfering from the local queue back to the remote queue.
Unknown
This is a similar state to Held. It can occur at any point in the above state machine after condor_G has sent the job to the remote site and lost track of it. When look for a problem when unknown>0.
Held
A job may enter the Held state for a number of reasons at any time in the process. It is not bad in and of itself. Condor will sometimes resolve the issue and pull the job out of the Held state. We look for jobs to have Held>>0. Information regarding the held reason may be accessed via condor_q -held (you will need to specify a schedd and a sitename, too).
Requested
The Requested (4) column lists details about limits on requested glideins at the site.
Max Run
A built in safeguard telling the host the total that can be running at once.
Idle (Requested)
Idle (Requested) and Idle (Status) should be close. This tells the front end how many jobs to leave queued up as Idle.
Client Monitor
The Client Monitor (5) lists details from the crab server side about the user jobs. The previous 2 columns (Status and Requested) speak to condor function and efficiency, while this column lists information that is used to ensure that the user jobs (however inefficiently it may be) are still receiving computing time and finishing.
Claimed
User Running
Unmatched
User Idle
It is a problem if this is a big number and User Running is small or 0. This means that jobs are queued up but are not being given any computing power. Pay particular attention to this since this means user jobs are not going through at all.
Registered
This number should be the same as running. Also Registered=Unmatched+Claimed.
Info Age
This column is not useless. Nothing, however, will be presented on it here, and thus it may be ignored.