Why and how build a distributed monitoring solution with Nagios – Part 1

Aug 302011

This is an article of mine first published on Openlogic/Wazi

With Nagios, the leading open source infrastructure monitoring application, you can monitor your whole enterprise by using a distributed monitoring scheme in which local slave instances of Nagios perform monitoring tasks and report the results back to a single master. You manage all configuration, notification, and reporting from the master, while the slaves do all the work.

This design takes advantage of Nagios’s ability to utilize passive checks – that is, external applications or processes that send results back to Nagios. In a distributed configuration, these external applications are other instances of Nagios.

Why use a distributed configuration? You might need a distributed solution if you have hosts and services on a separate LAN that are not reachable by your main Nagios instance. A distributed implementation also gives you enhanced privacy, in that users can be defined to access only certain services and hosts. You get centralized configuration and notification, giving you a single consolidated view of the status of all your devices and subnets.

A distributed solution is often called a master/slave solution. On the master you have a copy of every service that you want to check on the slaves, but the copy on the master has the active check disabled and notification enabled, while on the slaves both active and passive checks are enabled and notification is disabled.

A look at configuration files for a slave and a master, which in this case check a web page, helps illustrate the difference. You can usually find them in the Nagios configuration directory, often in /etc/nagios3/conf.d/services.cfg:

Slave Configuration:

# Generic service definition template
define service{
        name                            generic-service ;
        active_checks_enabled           1       ;
        passive_checks_enabled          1       ;
        parallelize_check               1       ;
        obsess_over_service             1       ;
        check_freshness                 0       ;
        notifications_enabled           0       ;
        event_handler_enabled           1       ;
        flap_detection_enabled          1       ;
        process_perf_data               1       ;
        register                0       ;
        } 

# Service definition for http
define service{
	  use				generic-service      ;
	  host_name			www.mysite.com	;
	  service_description 		HTTP	;
	  is_volatile 			0	;
	  check_period 			24x7	;
	  max_check_attempts 		3	;
	  normal_check_interval 	1	;
	  retry_check_interval 		5	;
	  contact_groups 		admins,webmaster	;
	  notification_options 		w,u,c,r	;
	  notification_interval 	960	;
	  notification_period 		never	;
	  check_command 		check_http	;
        }

Master Configuration:

# Generic service definition template
define service{
        name                           generic-service ;
        active_checks_enabled           0       ;
        passive_checks_enabled          1       ;
        parallelize_check               1       ;
        obsess_over_service             1       ;
        check_freshness                 0       ;
        notifications_enabled           1       ;
        event_handler_enabled           1       ;
        flap_detection_enabled          1       ;
        process_perf_data               1       ;
        register                        0       ;
        } 

# Service definition for http
define service{
          use				generic-service      ;
	  host_name 			www.mysite.com	;
	  service_description 		HTTP	;
	  is_volatile 			0	;
	  check_period 			24x7	;
	  max_check_attempts 		3	;
	  normal_check_interval 	1	;
	  retry_check_interval 		5	;
	  contact_groups 		admins,webmaster	;
	  notification_options 		w,u,c,r	;
	  notification_interval 	960	;
	  notification_period 		24x7	;
	  check_command 		check_http	;
        }

A good candidate to become a Nagios slave is one close (network-wise) to the services you want to monitor. A good master must be reachable by all of its slaves. The hardware resources required both for slaves and master depend on the number and kind of checks you do. Usually Nagios is not especially resource-hungry, so you should be able to monitor up to 5,000 services from any slave and collect 20,000 to 30,000 services on a master.

Once you have choose the appropriate machines, follow the standard Nagios installation procedure on both types of Nagios hosts, editing the configuration files as above.

Connecting the Pieces

The basic concept behind connecting all the pieces is “passive service”, implemented with Nagios Service Check Acceptor (NSCA). This tool runs a daemon on the master that waits for information regarding services. Slaves use the nsca_client to send information about their services to the master. To do that, you specify in the main configuration file what Nagios calls an obsessive compulsive service processor (OCSP) command to be executed after every check. In its most basic form, the relevant part of the configuration file for a Nagios slave might look like this:

# /etc/nagios/nagios.cfg
. . .
obsess_over_services=1
ocsp_command=submit_service_check
ocsp_timeout=5
obsess_over_hosts=1
ochp_command=submit_host_check
ochp_timeout=5

You have then to define the ocsp_command submit_service_check in the /etc/nagios/conf.d/commands.cfg file like this:

define command
{

 command_name    submit_service_check

 command_line    /usr/lib/nagios/plugins/submit_service_check.sh
$HOSTNAME$ '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$'

Finally, you need the shell script you defined for the command_line. The following could be your /usr/lib/nagios/plugins/submit_service_check.sh:

#!/bin/bash
/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$3" "$4" | /usr/lib/nagios/plugins/send_nsca -H  -c /usr/lib/nagios/send_nsca.cfg

While simple, this solution is inefficient and uses a lot of resources, because for every check made on every service the slave must start a new shell process, and run an nsca_client command. Unless you have fewer than 100 services on the slave, don’t use it. Instead, try one of these more efficient alternatives:

OCSP Sweeper is a utility that runs on the slave and creates a FIFO queue to which OCSP events are sent. It reads the contents of the queue every N seconds and sends the data to the NSCA on the master.
With OCP_Daemon, Nagios writes host and service check data into a named pipe instead of running a command every time to send the information to the master. A daemon polls the pipe takes care of sending the data to the master Nagios server.

If you use either alternative on the master, edit the file /etc/nsca/nsca.conf and set the option aggregate_writes for the NSCA daemon to 1. With this set, NSCA will process multiple results at one time, and give you a small performance boost on the master.

Linuxaria

Why and how build a distributed monitoring solution with Nagios – Part 1

Connecting the Pieces

Popular Posts:

Leave a Reply Cancel reply