Zabbix logo twitter bigger

Zabbix Review
Nagios vs Zabbix


As seen in
Logosasseeninsmall ab3c74d8852a39fa9ed0fb2833dd7db9c0f026cd637b6c09026ced45c4aaf8bd

Everyone is familiar with Nagios, which is often considered the de-facto standard for monitoring. The other tools in that general category are OpenNMS, Zenoss, Groundworks, HyperIQ and others. I am only talking here about tools that would qualify in the NMS category: something that really tracks different systems and devices across the entire infrastructure.

A couple of years ago, I was so tired of Nagios that I was ready to try something new. A couple of tools didn’t make the list, simply because of the “fremium” model. The basics are there, but anything more typically carries a hefty price tag.

I decided to try Zabbix and I have pretty much been a fan ever since. One caveat here, is that I am talking about version 1.8.x. Version 2.0 just came out and offers a few notable improvements, which I haven’t tried out yet. A couple of things that look very promising are: Direct JXM support, multi-homed hosts, and mounted filesystem discovery. Full list of changes is here

As an overview, Zabbix offers the following:

Relatively quick & simple install on a variety of platforms Agent-based, but available agentless options. A fairly vibrant community A large amount of templates covering most popular software Integrated graphs Escalation management

More specifically:

Graphs

There are a lot of graphic front ends for Nagios. In general, they are bolt-ons of varying quality. On the other hand, graphs are probably one of the stronger features of Zabbix. Typically, templates will have a few graphs predefined, but more can be added fairly easily. Any item that’s being collected can also be graphed on-demand. The one small drawback is the inability to save pics on the fly, which is sometimes useful for distribution. A workaround for that is described in this thread.

Graphing performance is decent if not spectacular. That will largely depend of data volume, your hardware and range of time. What I found especially valuable is something zabbix refers to as “screens“. Generally, the entire point of graphing or visualizing something is to be able to easily identify trends and correlations. “Screens” allow you to group disparate items together. For example, if you wanted to see the correlation between your requests per second, queries per second, response time, network traffic and read/write percentage, it’s fairly trivial to put it together. Besides that, I’ve tended to use screens almost as targeted dashboards. Something like putting all the MySQL relevant information on the same screen (disk IO, queries per second, replication lag, cpu/mem, cache hits, etc) can let you know the health of your MySQL infrastructure almost immediately. Same can be done on the web side and other areas.

Performance Performance will vary quite a bit. I’ve ran Zabbix on a large instance at EC2, backed by a 4-volume EBS RAID set and was able to receive 600-800 values/second without much of a problem. However, with that setup, the screens (particularly the ones with with a lot of metrics) would load in 2-5 seconds and the lag was noticeable. One key tweak that is absolutely necessary is the polling frequency. Most of the default (and 3rd party) templates will have the polling frequency too high. You generally don’t need to poll for free space every 5 seconds and there are plenty of examples like this. The data retention period also needs to be adjusted in a lot of cases. Reducing those intervals to something more reasonable is going to give a significant performance boost. It will behave better because you’ll reduce the volume of incoming values, but it will also reduce the amount of data you store and query against in the database. You likely don’t need precise-to-the-second numbers for every metric you collect going back a year. Historical data is still available, though in a somewhat less detailed form, which is generally sufficient for trend information. If the data volume gets too large, the clean up process might start failing. I’ve noticed that around 150GB of data it would start having trouble. At that point there aren’t very many good options and they tend to be quite hairy. It’s best to avoid getting into the situation in the fist place.

There are also a couple of options for distributed monitoring, if the performance requirements exceed the capability of a single node. There is a lot of documentation about it on their site, but it generally boils down to a choice between proxy or a node. I tend to prefer a proxy because of easier setup and maintenance. In a more specific example, I’d use proxies in an AWS environment which was spread across different regions. Another good use case in AWS is if you have a mix of a VPC and regular EC2 and you’d place your proxy in the VPC. This method can allow for significant scaling capabilities, though you would still need a very capable central master. The one significant benefit to a node approach is that they can be queried independently and support a hierarchical approach. However, in an environment with 1000s of devices that support different applications, nodes are likely a better approach.

Monitoring It’s a fairly standard feature set that is generally similar across other NMS systems. A couple of things worth noting:

Web Monitoring – it has a built in web transaction monitoring. It’s decent if not spectacular and doesn’t really compare against sophisticated transaction monitoring systems that are out there. It does support multiple steps and it’s based on curl, though it doesn’t expose all of curl’s functionality. That will present a problem if you need to do extensive cookie manipulation and/or variables. It’s also useless for heavily AJAXed pages and the ones that use flash. Still, it’s decent for basic monitoring and more then most other systems offer. IMPI support is worth noting, but I’ve personally never used it. Log Monitoring – this isn’t going to work well for high traffic web logs, but it does a pretty solid job at picking up exceptions and errors in various files. It does support a full regex engine for pattern matching. I’ve had it monitoring files that received ~500 lines per second and it had no issues with that. Templates – this is the core approach to monitoring in Zabbix. All your monitoring definitions are ideally grouped in templates. When a new server/instance shows up, you simply apply the template to it or add it to a group to which this template is assigned. There are a few templates that come out of the box of varying quality and there are a lot of user-generated templates for a variety of applications. A lot of them will have a script (PHP/Perl/Python) that polls the application and sends the data back. Typically you’ll have to make a few tweaks that are specific to your environment. Some of the ones that I found useful and better then others are: This is the “default” MySQL template for Zabbix and it’s based on a PHP script. The description says it wasn’t tested on 5.1, but I didn’t seem to notice any issues. There are range of values that have to be tuned in order to avoid false alerts. If you’re used to the Cacti templates for MySQL and the data those provide, this is a port to Zabbix. If I remember correctly, this template required a few tweaks to the PHP script, in order to get it working. This is another decent template for MySQL, but you don’t get InnoDB information out of the box. It is good for monitoring multiple MySQL instances on the same box though. The other templates would require modifications in their polling scripts. For Haproxy, I’ve used this template. It’s better than others, since it allows you to look and compare statistics of individual servers behind Haproxy. The downside is that it won’t automatically discover changes. That can be scripted, but it might get a little hairy. For Nginx, this is more than sufficient for most needs. Another one that is useful for Nginx, though the site is in Russian. Google translate does a pretty good job there. There are a few other templates on that site, but I’ve never tried them.

Misc

It does have an API for automation. I think it was improved in 2.0, but in 1.8 it was already solid. There is a decent CLI tool written in Ruby that will interface with the API, called zabcon There isn’t a great way to control alert floods. You can control trigger dependencies, but if something really goes haywire you might be manually clearing SQL tables after that. Alert escalations are a little wonky, but they work reasonably well. It is pretty trivial to port existing Nagios plugins or other scripts into Zabbix. JMX monitoring was done via zapcat. It wasn’t great, but for the lack of better options this was the only thing to work with. Version 2.0 does it natively and if they did it right, that’s probably one of the biggest improvements.

In summary, from what I’ve seen, Zabbix is easily one of the top NMS systems out there, though it’s probably somewhat less popular than others. If you’re fed up with Nagios or doing a brand new deployment, taking a serious look at Zabbix will be worth your while.

Disclosure: I am a real user, and this review is based on my own experience and opinions.


3 Comments

Picture solution1381
untergeekReal UserTOP 10

Nagios is for masochists who are content to live within the ecosystem they first learned, or for people who want a solution cobbled together like Legos in one ugly lump.
While I prefer Zabbix, any number of systems are preferable to Nagios with regards to having a unified system that does much out-of-the-box, rather than a bunch of disparate bolt-ons added after the fact.

Like(0)05 February 13
Picture pedro sousa
Pedro SousaReal UserTOP 10

Over the years I've been using Nagios, Zenoss, GroundWork, Cacti and other SNMP+MRTG solutions; since I've found Zabbix, it's been my number one choice. However, it still has some issues with the Reporting capabilities that need major improvement. From a "techs" point of view, Zabbix provides reliable metrics and a fairly simple implementation (with some tweaks discussed on this article) but from a managements point of view, it lacks the "beautiful graphs and reports" that Management "likes"!!! I'm not very happy with the time I'm asked to produce reports on my systems and have to setup the same reporting parameters time and time again...

Like(1)13 February 13
George wenzel li?1414328974
George WenzelReal UserTOP REVIEWERTOP 10

The old-school systems produced graphs every time data was gathered. This resulted in a fast user experience displaying graphs, but it caused the number of values per second to be limited by the number of graphs per second you can produce.

Zabbix dynamically creates the graphs on demand. This reduces the number of times it much produce a graph, pushing up the number of values per second you can capture. But as the reviewer noted above, screens and individual graphs can display slowly if they contain too many data points.

I agree with the reviewer that many or most of the default poll rates in the templates have excessive poll frequency. In fact, they are so high as to have an impact on the machine your are polling if you have very many values you are pulling. Sometimes I think that the people that create the templates only have one machine they are monitoring, and they set the poll frequency high just to have quicker graphs appear when setting up a new zabbix server. Nothing is more boring than spending a couple hours setting up a monitoring system, only to have a bunch of graphs with single dots on them because your polling cycle for disk space is every 15 minutes. But regardless of the reason for it, I think it is irresponsible to release templates with inappropriate polling cycles.

But back to the graphs, if you have too much data, an otherwise simple graph will take a long time to display. On a screen this gets worse because you are displaying multiple graphs. So to get the best screen display performance, reduce the polling frequency to the lowest value that still produces good graphs.

I have been knows to produce two objects for the same item, with different polling cycles. A long polling cycle for graphs that appear on screens and public viewable pages, and faster polling cycles for detailed data collection to be used in debugging.

I've used nearly all of the network monitoring systems in the 30+ years I have been monitoring networks. Zabbix is my favorite for most applications. I do use more advanced commercial systems such as NetMRI, as the commercial systems can do things like discover all of your systems, and self configure. Commercial systems like NetMRI also do deep inspection, such as VOIP quality analysis, that Zabbix simply isn't designed to do.

I can do anything with Zabbix, anything that I have time to configure. But to be fair, systems like NetMRI can be configured for very large environments in 5 or 10 minutes, out of the box. But when I want to do something special, that I create code for myself, I don't use systems like NetMRI, I use Zabbix. Zabbix is my favorite general purpose network monitoring system. And to be fair, Zabbix is a commercial system too, when you need it to be.

Tools like NetMRI have a lot more power to self-configure, but that power is not free... The NetMRI quote for the hospital I worked for was $300,000!! The commercial version of Zabbix was much lower. And with some careful work with discovery templates, you could still get some self-configuration out of Zabbix.

Solar Winds is another commercial tool in the same space as NetMRI. Solar Winds is nice, but the performance is impacted by the fact it runs on Windows, so it takes more hardware to monitor large enterprises, but it is comfortable for the Windows geeks. I'm not a Windows geek...;)

George

Like(1)14 February 13
Why do you like it?