UCSD T2 Nagios Administration and Configuration
Contents
Introduction
This document covers varios aspects of the UCSD CMS T2 Nagios installation and configuration.
Important Configuration Files
The following list are related to configuring Nagios at UCSD. The UCSD T2 Nagios configuration uses a series of files group by nagios type and then further grouped based on the structure of the UCSD CMS T2 center.
cgi.cfg
checkcommands.cfg
commands.cfg
hostgroup-gftpservers.cfg
hostgroup-nodes.cfg
hostgroup-servers.cfg
hosts-gftpservers.cfg
hosts-nodes.cfg
hosts-servers.cfg
htpasswd.users
minimal.cfg
nagios.cfg
nsca.cfg
send_nsca.cfg
services-gftpservers.cfg
services-nodes.cfg
services-servers.cfg
Adding host services
Relavent cfg files for adding services: services-*.cfg, checkcommand.cfg, commands.cfg, minimal.cfg (maybe)
Before asking nagios to monitor a service, nagios needs to first know about the service.
commands.cfg, and checkcommands.cfg are where service check commands are declared to nagios.
With our current setup of nagios, we do not have every plugin command declared in commands.cfg so they'll need to be added as you go
Commands.cfg example entry
define command{
command_name check_slash_free
command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p /'
}
This command makes use of another command (check_by_ssh) to get into a host, then run another command (check_disk)
Checkcommands.cfg example entry
define command{
command_name check_slash_free
command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p /'
}
Your corresponding entry in checkcommands should just be copied over from commands.cfg
**If your command in commands.cfg combines several checks as in the example, you'll need to add it to checkcommands.cfg as it's not there by default
The automated check instructions come from services-*.cfg; so choose the cfg file appropriate to the hosts you'd like to check
There are already active check templates in place (recommended), but to alter the specifics about the service check there are a list of options to toggle
Example service entry using template
define service{
use generic-service <----Use the generic template
host_name t2sentry0.t2.ucsd.edu <---Host the check will be run on
service_description SSH
is_volatile 0 <---This option is best left turned off
check_period 24x7 <--- Check in minimal.cfg under "time periods" to see other check_periods
or to create your own
max_check_attempts 4 <---# of times a non-ok state is checked before notification
normal_check_interval 5 <---a check will happen every 5 minutes if ok, and every n minutes
after 5 attempts if not ok
retry_check_interval 1 <---minutes between check attempts for a non-ok state
contact_groups admins <---groups to notify, see minimal.cfg under "contact groups"
for members
notification_options w,u,c,r <--- notifications sent at...w=warning state, u=unknown state,
c=critical state, r=recovery from non-ok state, f = start and stop of
flapping state
notification_interval 960 <---notifications sent out every 16 hours if service still in a non-ok state
notification_period 24x7 <---notifcations sent out only during specified period; see minimal.cfg under
"time periods"
check_command check_ssh!30 <--- see checkcommand.cfg for command alias, with options specified
and seperated by ! IN ORDER OF APPEARANCE ON THE COMMAND LINE
}
Example check_command in a service definition with the corresponding entry in commands.cfg
define service{
...
check_command check_slash_free!30%!20%
...
}
from commands.cfg:
define command{
command_name check_slash_free
command_line $USER1$/check_by_ssh -i /var/ssh/nagios-key -l root -H $HOSTADDRESS$ -C \
'$USER1$/check_disk -w $ARG1$ -c $ARG2$ -p /'
}
Now if the generic active check template does not offer enough flexibility, remove the line "use generic-service" and add the missing options from the template to what is above.
Example full service definition
define service{
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
register 1 ;register the service with nagios
}
Once the service is added, ensure that the check_command is properly formatted and is itself present in both commands.cfg and checkcommands.cfg
go to the command line (while still in /etc) and run:
#nagios -v nagios.cfg
This will run through the nagios cfg files and simulate a start, weeding out any blatant errors and specifying which cfg files have mistakes. If there are no errors the last thing to do is run:
# /etc/rc.d/init.d/nagios restart
The final check is to take a look at the web interface
http://t2sentry0.t2.ucsd.edu/nagios and see if the service has shown up under the services tab with the appropriate host. It generally takes nagios a minute or two to update the web interface, and the checks will proceed at the specified intervals.
Contacts for a service
The purpose of the minimal.cfg file in all of this is that it specifies contacts for notifications. If it doesn't make sense for the nagios admins to get the notification, then the proper person will need to be added as a contact, then that contact will need to be added to a contact group
*Under the contacts heading is where definitions should be placed.*
define contact{
contact_name nagios-admin
alias Nagios Admin
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,r ;host notification options are: d=down, r=recovery from u/d state, u = unreachable, f=flapping goes on/off
service_notification_commands notify-by-email ;other options can be found in checkcommands (they will need to be added to commands.cfg)
host_notification_commands host-notify-by-email ;other options can be found in checkcommands (they will need to be added to commands.cfg)
email tmartin@physics.ucsd.edu
}
*Under the contact groups heading is where group definitions should be placed.*
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members tester, nagios-admin
}
If necessary groups can be created to notify certain people regarding different components of the cluster (i.e. creating a dcache, or
PhEDEx? group)
Recipe for Adding Host Services
For this example we'll add the service "Root Partition," which will check disk usage for the root directory.
1. First take a look at the /etc/nagios/checkcommands.cfg:
# 'check_local_disk' command definition
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}
This is the command I want to use, taking advantage of the different options available to check_disk. I want to be able to specify the pathname to check so for the command line I specify the warning threshold, the critical threshold, and the pathname as arguments. (In the service definition we define values for $ARG(1-3)$
2. Now that the check command is in place, open up /etc/nagios/commands.cfg
# Command used to check disk space usage on local partitions
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}
The commands.cfg entry can be copied directly from checkcommands.cfg
3. Now that there is a proper command that nagios can process, open up the proper service.cfg, in this case we want it on t2sentry0.t2.ucsd.edu, so /etc/nagios/services-servers.cfg is needed.
# Define a service to check the disk space of the root partition
# on the local machine. Warning if < 20% free, critical if
# < 10% free space on partition.
define service{
use generic-service ; Name of service template to use
host_name t2sentry0.t2.ucsd.edu
service_description Root Partition
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_local_disk!20%!10%!/
}
This is the entry added to /etc/nagios/services-servers.cfg. The generic-service template is used (see above section). The service with the specified arguments will be referred to by nagios by "Root Partition." The check period is set to be the standard 24x7; nagios will try to check 4 times before notifying if the initial check returns a non-ok state; the check interval will be 5 minutes, the time between attempted rechecks will be 1 minute. We want notifications sent out on a warning state, unknown state, critical state, and recovered state. The notifcation interval is set to be 16 hours, and they'll be sent out 24x7.
The most important thing in the definition is the check_command, which will be the service check that is executed. This one checks the disk usage of the / directory, sending out a warning notification if less than 20 percent is free, and sending out a critical notification if less than 10 percent is free.
5. To find any obvious errors I need to run (while still in the directory /etc/nagios):
#nagios -v nagios.cfg
If there are no errors, the proper number of services/commands/hosts...etc should all be processed, otherwise a helpful error message will pop up stating which cfg file didn't cut the mustard.
6. If everything checks out, all that's left to do is run:
# /etc/rc.d/init.d/nagios restart
After a web page refresh, the service will now show up and start sending notifications
Setting Host Thresholds
Setting Service Thresholds
Adding Passing Service Checks to Nagios
If you are looking for how to write a passive check script for Nagios go here, otherwise read on for adding passive checks to Nagios.
Installing Nagios client software on a target system
Creating an RPM Package of Common Nagios Scripts
Authors
--
BruceThayre - 03 Oct 2006
--
TerrenceMartin - 02 Oct 2006