Any internet based service company like for example web hosting, DNS hosting, Email-hosting, Cloud architectures, and even CDN networks have server's ranging from several hundreds to thousands. There might be different roles that are played by different servers that are geographically isolated from each other. As a whole these geographically separated servers might be providing a combined service to the end customer. A particular issue or problem on any of the server should not affect the customer service, and must be found and fixed before the outage happens.
Let's take two examples which will explain the need for a 24 x 7 monitoring of these servers. Suppose that you get a call from your technical support team saying that several customers are complaining about their websites being inaccessible. Such complaints without any other details are very difficult to troubleshoot, if you do not have a 24 x 7 server monitoring in place. During crisis, you cant waste time by checking the basic below mentioned things.
Because its quite normal to miss some or the other, by manually looking for basic issues on the server. What if the issue that was causing the problem was simply due to a RAID drive failure, due to which one of the disks were inaccessible( which contains the document root for some websites hosted ).
Such problems can be monitored for and can be warned before a complete failure occurs. Another funny example would be to find that a customer facing service was not working as desired for hours, simply due to a lag in time from a Network Time Server.
It is not at all feasible for a system administrator to look each and every log, and service settings, and other configurations round the clock. There needs to have some automated tool to continuously keep on monitoring these required services and settings on the server, and inform the concerned people in case of an issue. A good server and infrastructure monitoring tool must have the following characteristics.
Although there are many proprietary monitoring tools out there to select from depending upon the requirement, no proprietary tool can provide the peer review, source code modification, and version iterations that an open source tool provides.
Nagios is an open source network monitoring tool that provides all those capabilities we discussed above in one package. Nagios monitors the servers and network devices(in fact i must say any network device which is accessible with an IP address can be monitored using Nagios) and alerts you when a particular service that's being monitored goes wrong, and also will alert you when the service comes back to normal required state. Nagios is capable of doing the following things.
In this tutorial, we will be having a look at the major components of Nagios, which helps nagios to complete its task of maintaining a good monitoring infrastructure.
Let's begin this tutorial by understanding how a nagios server checks the status of a remote service on a remote server, and accurately report the output to you. In the world of nagios you will too often hear a term called plugins, which are readily available binary or small script based program, that checks the status of your required service or program.
Nagios checks the status of a remote service or program in multiple ways. Let's understand them one by one.
In this first method the nagios server will execute a plugin on the nagios server itself, which will basically try to connect to a network service on the target server. Lets understand this through the following diagram.
In the above shown diagram, we have tried to depict how nagios process execute an example check(which is also sometimes called plugin), on the nagios server itself, which will connect to the http port 80 on the target server, and will record the response time.
Nagios server will execute the check at regular interval(as configured), to check the availability of the service. In the above shown example, the plugin is placed inside the nagios server, and no changes are done at the client side. You cant monitor all properties of a client that counts, through this method. This method can be used only to monitor, services that are available publicly. The main reason behind this is that, you need to login inside the client server, in order to monitor stuff like memory usage, process status, cpu load, and other stuff.
Hence this kind of plugins are very limited in its capability, but you can surely achieve a considerable amount of good 24x7 monitoring using this method, for publicly available services like SMTP, HTTP, DNS, FTP, PORT availability check, Remote MySQL & MSSQL etc.
As mentioned in the previous method, without getting a login to the remote machine, the level of monitoring you can achieve is very limited, and also you cannot monitor all the services using that method.
You can achieve a 24 x 7 monitoring of the things that cannot be monitored directly through network with the help of two different methods, they are as mentioned below.
Related: Working of SSH explained
Let's frst undersand monitoring a remote host using SSH method. In this method, a user is made on all the client machines, which allows ssh login from the nagios server with the help of a predifined ssh key and execute a requred plugin to monior a required service.
This method of executing remote plugins on remote client with the help of SSH is a secure way to monitor. As a normal user logs in the remote client, the nagios server will be able to run any command that the normal user will be able to run(when i say run, i mean execute).
the plugins that reside in the remote client are sometimes called as local plugins as they are local to the remote host. to run local plugins on remote host,nagios uses a ready made command called check_by_ssh(we will be discussing the complete command usage of this plugin in a dedicated post of its own).
of cource you will not be sitting and entering passwords each and every time the check is executed by the nagios daemon. Login and execution of the remote plugin on the remote server using ssh must be seamless and also must be password less login. For this, you need to set up public key authentication of the user, which will be loging inside the remote server for executing the plugins.
Now let's see the another method of executing remote plugins.
Another method that is commonly used to achieve the successful execution of a remote plugin is NRPE. NRPE stands for Nagios Remote Plugin Executor. NRPE is a package that will be installed on all the remote hosts, that needs to be monitored. Mostly NRPE is installed as Xinetd service on the remote host, and by default it listens on the tcp port 5666.
Suppose the nrpe daemon receives a query from the nagios server, to execute a command on the local server, nrpe daemon looks inside the nrpe configuration files, for a command with the same name what nagios asked to run. Unlike ssh method, nrpe cannot run any command that the nagios server asks to run. Commands first need to be defined inside the nrpe configuration file. And only those commands can be run from the nagios server. Deploying ssh based nagios checks are much easier compared to nrpe method, because in nrpe method, you need to first install nrpe package on all the client servers that requires to be monitored.
Above diagram depicts the nrpe method of executing remote checks on a remote client with nagios. Nagios server has a check_nrpe plugin (which is very similar to the plugin check_by_ssh used in ssh method), which connects to the remote client on the port 5666, and executes the command, which is given as an argument to check_nrpe plugin(the command given as argument to check_nrpe plugin on the nagios server must also be defined in nrpe configuration files on the client, where the command will be executed.)
Nrpe method of monitoring remote host, by executing plugins on the remote machine is limited to the commands defined inside the nrpe configuration files on the client. Which means the command which you require to run on the remote machine, must be predefined in the nrpe configuration files on the client.
But check_by_ssh can be used to run any command, with executable permission to the user used to login to the remote machine.
Let's go ahead and understand the remaining two methods that can be used to monitor a remote host in nagios monitoring.
SNMP can be used to fetch the current value of different properties of a network device or any SNMP aware device. if you have SNMP daemon installed on your remote host, which needs to be monitored, then you can monitor hard drive, load, etc with the help of SNMP daemon.
Advantage behind using SNMP to monitor is because it is supported by a wide variety of devices like network switches, routers, UPS devices etc.
We will be doing a couple of posts on SNMP, for getting a better overview of the protocol and its usage. We will also be doing a dedicated post for monitoring devices with nagios and SNMP.
Above case of monitoring with snmp places the plugin inside the nagios server itself, which will be a generic snmp plugin that will be used to monitor all snmap related services, with different arguments given to it.
Until now we have seen around 4 different methods, used to monitor a remote server using nagios. All of them worked by either a plugin placed on the nagios server or a plugin placed on the client, or by simple monitoring or publicly available service. In all the above mentioned method, the plugin execution or say command execution was initiated by the nagios server.
Let's now see a method, in which the client will execute a required plugin at a regular interval, and report the output of the execution to the nagios server. This is achieved with the help of a daemon called NSCA.
NSCA stands for Nagios Service Check Acceptor. This is installed as a daemon on the nagios server itself, and it will wait for the command result from the client.
This kind of nagios monitoring is called as passive monitoring, because nagios server is not the one that initates the checks on the client, but the client will execute the plugins specified, at regular interval with the help of a cron and report the output to the nsca daemon on the nagios server.
While reporting the output, the client will also send details like the service name, hostname, the output of the command executed to the nsca daemon, so that the nagios server can report the output exactly in the same way active checks are executed(active checks are those checks in which the command execution is initiated by the nagios server. Examples are check by ssh, nrpe etc.)
There are couple of things that needs to be understood, from the above shown diagram. NSCA is a daemon on the nagios server that waits for the command result from the client.
Send_nsca is a program that can be used to send a command result to the nagios server. The hostname, the service name, and other related details will be included in the command result send using send_nsca to the nagios server.