As a system administrator, one often have to do repetitive tasks such as checking for free disk space, check mail queues and monitor critical services. If there are only a handful of servers, this task may not be very intimidating, but there are many times when there are many servers to monitor, or just for the sake of automation. This is where Nagios comes in.
Nagios is a host and service monitor designed to inform you of network problems before your clients, end-users or managers do.
This is exactly what we need to make an automated system for monitoring! I will not go into details on how to set this up, since there is an excellent quick start guide available on the website. Instead I will focus on how Nagios has eased the burden of managing a large number of servers.
I have ready made templates for servers and when a new server is added, I just create a copy of the template and add or remove the services needed to monitor the server.
Public services are easy to monitor directly from Nagios, but private data such as disk space and CPU load demands a local service running on each of the servers. This is where NRPE comes into play. NRPE is a daemon which listens on the network and will respond to Nagios queries, using standard Nagios plugins. In Debian and Ubuntu, just install the nagios-nrpe-server package, and in Windows NSClient is very usable and easy to configure.
The last thing is alerts management. All servers that someone else manages, or is in charge of, should receive the Nagios alerts for that server. It will dramatically lighten the administration burden if it is possible to delegate as much as the server / service responsibility to other people. For extremely critical services, there should be an SMS gateway, which sends a message to the administrator or someone in charge of the server. This ensures that attention is immediately brought to the problem.