UCSD T2 Nagios Administration and Configuration

Contents

Introduction

This document covers varios aspects of the UCSD CMS T2 Nagios installation and configuration.

Important Configuration Files

The following list are related to configuring Nagios at UCSD. The UCSD T2 Nagios configuration uses a series of files group by nagios type and then further grouped based on the structure of the UCSD CMS T2 center.

cgi.cfg
checkcommands.cfg
commands.cfg
hostgroup-gftpservers.cfg
hostgroup-nodes.cfg
hostgroup-servers.cfg
hosts-gftpservers.cfg
hosts-nodes.cfg
hosts-servers.cfg
htpasswd.users
minimal.cfg
nagios.cfg
nsca.cfg
send_nsca.cfg
services-gftpservers.cfg
services-nodes.cfg
services-servers.cfg

Adding host services

Relavent cfg files for adding services: services-*.cfg, checkcommand.cfg, commands.cfg, minimal.cfg (maybe)

Before asking nagios to monitor a service, nagios needs to first know about the service. commands.cfg, and checkcommands.cfg are where service check commands are declared to nagios. With our current setup of nagios, we do not have every plugin command declared in commands.cfg so they'll need to be added as you go

Commands.cfg example entry

define command{
        command_name    check_slash_free
        command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C '$USER1$/check_disk  -w $ARG1$ -c $ARG2$ -p /'
        }
This command makes use of another command (check_by_ssh) to get into a host, then run another command (check_disk)

Checkcommands.cfg example entry

define command{
        command_name    check_slash_free
        command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p /'
        }
Your corresponding entry in checkcommands should just be copied over from commands.cfg
**If your command in commands.cfg combines several checks as in the example, you'll need to add it to checkcommands.cfg as it's not there by default

The automated check instructions come from services-*.cfg; so choose the cfg file appropriate to the hosts you'd like to check There are already active check templates in place (recommended), but to alter the specifics about the service check there are a list of options to toggle

Example service entry using template

 
define service{
        use                    generic-service         <----Use the generic template         
        host_name              t2sentry0.t2.ucsd.edu     <---Host the check will be run on 
        service_description    SSH                                      
        is_volatile            0                       <---This option is best left turned off
        check_period           24x7                  <--- Check in minimal.cfg under "time periods" to see other check_periods
                                                           or to create your own                          
        max_check_attempts     4                    <---# of times a non-ok state is checked before notification
        normal_check_interval  5                      <---a check will happen every 5 minutes if ok, and every n minutes 
                                                        after 5 attempts if not ok 
        retry_check_interval   1                       <---minutes between check attempts for a non-ok state 
        contact_groups         admins               <---groups to notify, see minimal.cfg under "contact groups" 
                                                         for members 
        notification_options   w,u,c,r                 <--- notifications sent at...w=warning state, u=unknown state, 
                                                            c=critical state, r=recovery from non-ok state, f = start and stop of 
                                                            flapping state 
        notification_interval  960                   <---notifications sent out every 16 hours if service still in a non-ok state  
        notification_period    24x7                 <---notifcations sent out only during specified period; see minimal.cfg under
                                                        "time periods" 
        check_command          check_ssh!30         <--- see checkcommand.cfg for command alias, with options specified
                                                         and seperated by ! IN ORDER OF APPEARANCE ON THE COMMAND LINE
        }

Example check_command in a service definition with the corresponding entry in commands.cfg


define service{ 
...
check_command                   check_slash_free!30%!20%
...
}


from commands.cfg:

define command{
        command_name    check_slash_free
        command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C \
'$USER1$/check_disk  -w $ARG1$ -c $ARG2$ -p /'
        }

Now if the generic active check template does not offer enough flexibility, remove the line "use generic-service" and add the missing options from the template to what is above.

Example full service definition

 
define service{
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        register                        1       ;register the service with nagios
        }

Once the service is added, ensure that the check_command is properly formatted and is itself present in both commands.cfg and checkcommands.cfg go to the command line (while still in /etc) and run:

 
#nagios -v nagios.cfg 
This will run through the nagios cfg files and simulate a start, weeding out any blatant errors and specifying which cfg files have mistakes. If there are no errors the last thing to do is run:
# /etc/rc.d/init.d/nagios restart 
The final check is to take a look at the web interface http://t2sentry0.t2.ucsd.edu/nagios and see if the service has shown up under the services tab with the appropriate host. It generally takes nagios a minute or two to update the web interface, and the checks will proceed at the specified intervals.

Contacts for a service

The purpose of the minimal.cfg file in all of this is that it specifies contacts for notifications. If it doesn't make sense for the nagios admins to get the notification, then the proper person will need to be added as a contact, then that contact will need to be added to a contact group

  • Minimal.cfg example:
*Under the contacts heading is where definitions should be placed.*

define contact{
        contact_name                    nagios-admin
        alias                           Nagios Admin
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,r                    ;host notification options are: d=down, r=recovery from u/d state, u = unreachable, f=flapping goes on/off 
        service_notification_commands   notify-by-email        ;other options can be found in checkcommands (they will need to be added to commands.cfg)
        host_notification_commands      host-notify-by-email   ;other options can be found in checkcommands (they will need to be added to commands.cfg)
        email                           tmartin@physics.ucsd.edu
        }

*Under the contact groups heading is where group definitions should be placed.*

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 tester, nagios-admin
        }

If necessary groups can be created to notify certain people regarding different components of the cluster (i.e. creating a dcache, or PhEDEx? group)

Recipe for Adding Host Services

For this example we'll add the service "Root Partition," which will check disk usage for the root directory. 1. First take a look at the /etc/nagios/checkcommands.cfg:

# 'check_local_disk' command definition
define command{
        command_name    check_local_disk
        command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
        }

This is the command I want to use, taking advantage of the different options available to check_disk. I want to be able to specify the pathname to check so for the command line I specify the warning threshold, the critical threshold, and the pathname as arguments. (In the service definition we define values for $ARG(1-3)$

2. Now that the check command is in place, open up /etc/nagios/commands.cfg


# Command used to check disk space usage on local partitions

define command{
   command_name   check_local_disk
   command_line   $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
   }

The commands.cfg entry can be copied directly from checkcommands.cfg

3. Now that there is a proper command that nagios can process, open up the proper service.cfg, in this case we want it on t2sentry0.t2.ucsd.edu, so /etc/nagios/services-servers.cfg is needed.

 
# Define a service to check the disk space of the root partition
# on the local machine.  Warning if < 20% free, critical if
# < 10% free space on partition.

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       t2sentry0.t2.ucsd.edu
        service_description             Root Partition
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
   notification_options      w,u,c,r
        notification_interval           960
        notification_period             24x7
   check_command         check_local_disk!20%!10%!/
        }

This is the entry added to /etc/nagios/services-servers.cfg. The generic-service template is used (see above section). The service with the specified arguments will be referred to by nagios by "Root Partition." The check period is set to be the standard 24x7; nagios will try to check 4 times before notifying if the initial check returns a non-ok state; the check interval will be 5 minutes, the time between attempted rechecks will be 1 minute. We want notifications sent out on a warning state, unknown state, critical state, and recovered state. The notifcation interval is set to be 16 hours, and they'll be sent out 24x7. The most important thing in the definition is the check_command, which will be the service check that is executed. This one checks the disk usage of the / directory, sending out a warning notification if less than 20 percent is free, and sending out a critical notification if less than 10 percent is free.

5. To find any obvious errors I need to run (while still in the directory /etc/nagios):

#nagios -v nagios.cfg

If there are no errors, the proper number of services/commands/hosts...etc should all be processed, otherwise a helpful error message will pop up stating which cfg file didn't cut the mustard.

6. If everything checks out, all that's left to do is run:

# /etc/rc.d/init.d/nagios restart 

After a web page refresh, the service will now show up and start sending notifications

Setting Host Thresholds

Setting Service Thresholds

Adding Passing Service Checks to Nagios

If you are looking for how to write a passive check script for Nagios go here, otherwise read on for adding passive checks to Nagios.

Installing Nagios client software on a target system

Creating an RPM Package of Common Nagios Scripts

Authors

-- BruceThayre - 03 Oct 2006

-- TerrenceMartin - 02 Oct 2006

Topic revision: r5 - 2006/10/27 - 08:25:10 - BruceThayre
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback