-- IanMacNeill - 2010/07/16

This article is still in progress.

Glidein Factory Status Now


Monitoring Webpage

http://glidein-1.t2.ucsd.edu:8319/osg_gfactory/factoryStatusNow.html Load the webpage and be sure to click update. You will see a table that looks like this.

insert image of table with numbers

Drop Down Menu

The page contains a dropdown menu (1) which allows one to view all front ends at once or each front end individually.

insert image of dropdown

Update Table Button

Click the Update Table Button (2) to retrieve recent data. If you don't, you may be looking at old data. The time stamp below is set to Pacific time (here at UCSD).

insert image of update and time stamp

The Table

All of the table is divided into 3 parts, one of which lists the "Status", the "Requested" another, and those who on their own machines are called the Users, are in the "Client Monitor."

insert image of table


The Status column (3) reflects the state machine of jobs as condor processes them. Each subcolumn is a potential stop along the way. A normal job follows the state maching below: insert state machine ->Idle->Waiting->Staging In->Pending->Running->Staging Out-> Any point along this loop, a job may become Held or Unknown.
This status means that the job is running.
Idle (Status)
This means that the job has been remotely queued up but has not yet staged in to the local queue.
This is a local queue and does not reflect jobs outside the node.
This is a remote site queue. It contains jobs not yet executed.
Staging In
A job is in this state while transfering job from remote to local queue. The amount of time to do this is small compare to the rest of the process. Given the transient nature of this operation, we expect the number in this column to be small since it is unlikely that many jobs will be caught staging in at the same time.
Staging Out
This is the same as Staging In except that the job is transfering from the local queue back to the remote queue.
This is a similar state to Held. It can occur at any point in the above state machine after condor_G has sent the job to the remote site and lost track of it. When look for a problem when unknown>0.
A job may enter the Held state for a number of reasons at any time in the process. It is not bad in and of itself. Condor will sometimes resolve the issue and pull the job out of the Held state. We look for jobs to have Held>>0. Information regarding the held reason may be accessed via condor_q -held (you will need to specify a schedd and a sitename, too).


The Requested (4) column lists details about limits on requested glideins at the site.
Max Run
A built in safeguard telling the host the total that can be running at once.
Idle (Requested)
Idle (Requested) and Idle (Status) should be close. This tells the front end how many jobs to leave queued up as Idle.

Client Monitor

The Client Monitor (5) lists details from the crab server side about the user jobs. The previous 2 columns (Status and Requested) speak to condor function and efficiency, while this column lists information that is used to ensure that the user jobs (however inefficiently it may be) are still receiving computing time and finishing.

User Running

User Idle
It is a problem if this is a big number and User Running is small or 0. This means that jobs are queued up but are not being given any computing power. Pay particular attention to this since this means user jobs are not going through at all.
This number should be the same as running. Also Registered=Unmatched+Claimed.
Info Age
This column is not useless. Nothing, however, will be presented on it here, and thus it may be ignored.
Topic revision: r3 - 2012/09/18 - 22:06:47 - JeffreyDost
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback