Nagios Core Review
Everyone ends up using Nagios or a derivative just because everyone else does


Everyone ends up using nagios or a derivative just because... well everyone else does. The size of your org really matters a lot with what you are doing here as Zabbix might fit you right or not at all.

Lately I've been setting up nagios with a graphite back end for people. Then taking advantage of writing your own plugins for nagios to send data to both systems. You can throw a lot of data at graphite and make some super pretty graphs if that is what you are after. For example imagine having all the contents of a vmstat/iostat every X seconds... for ALL your servers that can be queried with less than a minute latency. You can do that with nagios+graphite+yourownfixins. ... and then you show Dev how easy it is to log data into carbon/graphite and become a super hero.

When you start hoarding this much data you can start asking some really detailed questions about disk performance, network latencies, system resources, etc... that before were just guestimates. Now you have the data and the graphs to back them up.

I'm also a big fan of Pandora FMS but I've never implemented it anywhere professionally and the scope it takes is pretty large.

(I should note, nagios is pretty terrible, it's no better than things we had a decade ago.)

The real truth here is that all the current monitoring systems are pretty terrible given that they are no better than what we had a decade ago. Every good sysadmin group makes them work well enough, but there is a lot of making them work. Great sysadmins go on to combine a couple of them with their own bits to make the system a bit more proactive than reactive, which is what most people expect out of monitoring.


Reactive monitoring is fine for certain companies and certain situations and it is easily obtainable with nagios, zabbix, home-brew, stupidspendmoney solution, etc... However reactive monitoring is just the base point for most, it certainly doesn't handle big problems well, or have the capacity to predict events slightly before they are happening. This level of monitoring also doesn't give you much data after an event to figure out what went wrong.


Great admins go on to add proactive systems monitoring and in some cases basic logic monitoring. This is what a lot of us do all the time, to avoid getting paged in the middle of the night, or to know what to pick up at fry's on the way into the office. Proactive monitors a lot more things than basic, and it is essentially the level where everyone works at now, with nagios, etc... That's certainly fine for today and tomorrow. But it doesn't tell you anything about next quarter, and when you ask queries about events in the past they are often very basic in scope.


The other amazingly huge drawback with current monitoring is that if you want to monitor business or application logic, it is going to be something you custom fit into whatever monitoring system you have. This will lead to it being unwieldy and while effective for answering basic questions like, "What's the impact on sales if we lose the east coast data center and everything routes through the west?" That's a fine question but it isn't a question that will get you to the next level, better than your competitors.


So what's next? I'll tell you where I think we should be going and how I am sort of implementing it at some places.


Predictive monitoring on systems AND business logic, with lots of data, and very complex questions being answered. This can be done right now with nagios, graphite and carbon. Nagios fills the monitoring and alerting needs. Carbon stores lots of numerical data, very fast from a lot of sources. Finally with Graphite you can start asking really serious questions like "How did the code push effect overall page performance time, while one colo site was down? What's the business cost loss? Where were the bottlenecks in our environment? Server? Disk? Memory? Network? Code? Traffic?" Once you've constructed one of these list of questions in graphite you can save it for the future, and not only monitor it, but because of legacy data kept on so many key points use it for future predictions.


That said, how do you all that now? Well you throw nagios, graphite and carbon out there and then you CREATE a whole lot of stuff that is specific to your org. This is a lot of work, a lot of effort and takes time and real understanding of the full application and what your end SLA goals are.


So how do we do all this?


You as an admin do this, by creating custom nagios plugins and data handlers on your systems and throwing them in to carbon. As an admin you measure everything, and I mean everything. Think all of the output from a vmstat and an iostat logged in aggregate one minute chunks on every single server you have and kept for years.


From the dev site you get the Lead Dev to agree on some key points where the AppStack should put out some data to carbon. This can be things like time to login, some balance value, whatever metric you want to measure. The key here is to have business logic metrics AND system metrics in the same datastore within Carbon. Now you get to ask question across both data sets, and you get to ask them frequently and fast. You are able to easily make predictions about more load impacting the hardware in what manner, i.e. do we need more spindles, more memory, etc...


This is what I have been doing with some companies in SV right now. It's not pretty or fully blown out yet, because it is a big huge problem and our current monitoring sucks. :D
but it IS doable with current stuff and is quite amazing to know answers to questions that were previously only dreamed about.


What's after that? The pie in the sky next level, would be having an app box in every app group running in debug mode, receiving less traffic of course through the load balancers, and loading all that debug data into carbon. Then you get to ask questions about specific bits of a code release and performance on your real production environment.


... so those are my initial thoughts. Any comments? :)


Further once you have all this, you can now write nagios plugins to poll carbon for values on questions you have created and then alert not only on systems logics and basic app metrics, but real queries that are complex. Stuff like "How come no one has bought anything off page X in the last two hours, is it related to these other conditions? Oh. It is. Create me an alert in nagios so we can be warned when it looks like this is about to happen again." With much more data across more areas you can ask and alert on pretty much anything you can imagine. This is how you make it to next level.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
3 visitors found this review helpful

4 Comments

it_user4329VendorTOP REVIEWERTOP 5

I'm not sure what you are trying to say in your first paragraph, you seem to be leading with the suggestion that Zabbix is a monolith that either fits, or it does not fit at all, and Nagios being extensible and flexible being your follow-on with the rest of your paper.

My experience is really a lot different. I found Nagios to be inefficient and incomplete, and while it is extensible, that was never an advantage. And the kinds of complex queries and actions you have been able to do with Nagios, I've been able to do with Zabbix. Zabbix is extensible as well, though I expect that fewer people have as much need to extend it.

I'm sure you found something Zabbix can't do well, that Nagios can do better, I'm just having a hard time imagining what that is. I suspect that it is more a case that you are a Nagios expert, and you haven't spent as much time with Zabbix.

My first efforts with Zabbix didn't go that well, like most systems like this there is a learning curve. My second effort we were using a specific automation feature, and the feature worked so well that I immediately learned how to augment the feature with custom scripts to automate other tasks. While that was just one set of features missing from Nagios, it was the reason I started using Zabbix. Once I was using it, I started to realize it was a superior solution all the way around. You may never find that one killer function in zabbix that rocks your world, so learning to replace one tool with another may not be as big a gain for you. But I have been replacing Nagios all over the place, and so far no one has given me any pushback, despite how many years they have been using Nagios. This is the first time I've seen a post that seems to imply Nagios is better, and I'm trying to understand where you are coming from on that. I get the sense you are where I was at after my first failed attempt at using Zabbix.

24 October 13
it_user12222ConsultantPOPULAR

My apologies, I did not mean to imply that Nagios is better than anything else, rather it is so pervasavive (especially in SV) that it is just the default monitoring solution. You are very correct in how one characterizes Nagios as essentially a "whole lotta work". For me, the only parts of Nagios that are worthy are the basic things that have been taken care of: pager duty, rotations, web-gui, all the stuff that would slow me down while creating specific loggers/analyzers.

As far as queries for monitored data, Zabbix could probably used in place of Nagios in a Graphite/Carbon implementation. However I am not familiar with reporting error conditions back into Zabbix.

I mentioned it above, but if you have a fairly large set of servers PandoraFMS is quite wonderful if one isn't comfortable or it isn't acceptable to build pieces where needed.

28 October 13
it_user4401VendorPOPULAR

That’s an excellent review, I would like to ask you a question. How can I have Nagios process all object configuration files in a certain directory? It must be possible.

10 November 14
Orlee GillisConsultant

Chris, do you still find this to be true? Is Nagios still a default tool when people are searching for IT Infrastructure Monitoring solutions?

20 October 16
Guest
Why do you like it?

Sign Up with Email