New Monitoring Requirements for CMS T2

These monitoring components are run out of RSV probes

* Availability

The unfortunate fact of the matter is that the WLCG has different measures of availability than CMS does; the latter are more focused on whether a CMS user can get jobs through. We've been more focused on the CMS end, in part because it is only recently that the OSG RSV system for reporting to the WLCG has been available. But we have to watch both, I'm afraid. (Good news -- as discussed at Purdue, there are hooks to get the RSV info into a Nagios console.)

So what should we be looking for in the RSV probes? A Web page to look at is

(You may need to go to the "Tests Displayed" box and choose all of them to see everything.) It shows the results of all the RSV probes for all OSG sites that are currently reporting. Note that there are only four probes that are currently considered "critical" -- they are indicated in the table in the upper right. Rob says that some of the tests are still being shaken down -- in particular, the test that Purdue is being warned on right now might be too "tight." However, errors should be taken seriously, especially for the critical tests. (We are not penalized for warnings, only errors.)

Want to learn more about the tests? See <>. These probes can actually be run from the command line, so you can easily see what your output is and debug in real time (unlike CMS SAM tests, it seems).

Want to see the history of how your site has been doing? See <>. The interface for this is unfortunately difficult. In the left bar, you want to choose "SAM Test Results," and then then choose your site out of the menu of Tier-2 Sites. The menu is not ordered in any clear way. Click "Display Graphs" to then see what is up; some of the resulting graphs are clickable for further information.

Finally, in all this, there is one more action item: Rob has released some new probes. Here's what he said:

> The newest tarball of probes is located at Can you please take a look and have the Tier 2 admins do some testing. I think at first running them from the command line and sending any issue to would be a good start. The two specific probes we are concerned about for SEs are srm-ping-probe and srmcp-srm-probe. Usage can be found using the --help switch.

(Pick up the most recent tarball from that page, of course.) These new SRM tests are not in production yet, but Rob wants to get them there. Feedback would be helpful.

I will attempt to highlight our usage and availability stats at future T2 meetings (modulo the ease of the display tools). Please let me know if you have questions, and thanks for your help in this important matter. Best wishes.


-- TerrenceMartin - 16 Apr 2008

Topic revision: r1 - 2008/04/16 - 16:53:37 - TerrenceMartin
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback