Everyone ends up using nagios or a
derivative just because... well everyone else does. The size of your org
really matters a lot with what you are doing here as Zabbix might fit
you right or not at all.
Lately I've been setting up nagios with a graphite back end for people. Then taking advantage of writing your own plugins for nagios to send data to both systems. You can throw a lot of data at graphite and make some super pretty graphs if that is what you are after. For example imagine having all the contents of a vmstat/iostat every X seconds... for ALL your servers that can be queried with less than a minute latency. You can do that with nagios+graphite+yourownfixins. ... and then you show Dev how easy it is to log data into carbon/graphite and become a super hero.
When you start hoarding this much data you can start asking some really detailed questions about disk performance, network latencies, system resources, etc... that before were just guestimates. Now you have the data and the graphs to back them up.
I'm also a big fan of Pandora FMS but I've never implemented it anywhere professionally and the scope it takes is pretty large.
(I should note, nagios is pretty terrible, it's no better than things we had a decade ago.)
The real truth here is that all the current monitoring systems are pretty terrible given that they are no better than what we had a decade ago. Every good sysadmin group makes them work well enough, but there is a lot of making them work. Great sysadmins go on to combine a couple of them with their own bits to make the system a bit more proactive than reactive, which is what most people expect out of monitoring.
Reactive monitoring is fine for certain companies and certain situations and it is easily obtainable with nagios, zabbix, home-brew, stupidspendmoney solution, etc... However reactive monitoring is just the base point for most, it certainly doesn't handle big problems well, or have the capacity to predict events slightly before they are happening. This level of monitoring also doesn't give you much data after an event to figure out what went wrong.
Great admins go on to add proactive systems monitoring and in some cases basic logic monitoring. This is what a lot of us do all the time, to avoid getting paged in the middle of the night, or to know what to pick up at fry's on the way into the office. Proactive monitors a lot more things than basic, and it is essentially the level where everyone works at now, with nagios, etc... That's certainly fine for today and tomorrow. But it doesn't tell you anything about next quarter, and when you ask queries about events in the past they are often very basic in scope.
The other amazingly huge drawback with current monitoring is that if you want to monitor business or application logic, it is going to be something you custom fit into whatever monitoring system you have. This will lead to it being unwieldy and while effective for answering basic questions like, "What's the impact on sales if we lose the east coast data center and everything routes through the west?" That's a fine question but it isn't a question that will get you to the next level, better than your competitors.
So what's next? I'll tell you where I think we should be going and how I am sort of implementing it at some places.
Predictive monitoring on systems AND business logic, with lots of data, and very complex questions being answered. This can be done right now with nagios, graphite and carbon. Nagios fills the monitoring and alerting needs. Carbon stores lots of numerical data, very fast from a lot of sources. Finally with Graphite you can start asking really serious questions like "How did the code push effect overall page performance time, while one colo site was down? What's the business cost loss? Where were the bottlenecks in our environment? Server? Disk? Memory? Network? Code? Traffic?" Once you've constructed one of these list of questions in graphite you can save it for the future, and not only monitor it, but because of legacy data kept on so many key points use it for future predictions.
That said, how do you all that now? Well you throw nagios, graphite and carbon out there and then you CREATE a whole lot of stuff that is specific to your org. This is a lot of work, a lot of effort and takes time and real understanding of the full application and what your end SLA goals are.
So how do we do all this?
You as an admin do this, by creating custom nagios plugins and data handlers on your systems and throwing them in to carbon. As an admin you measure everything, and I mean everything. Think all of the output from a vmstat and an iostat logged in aggregate one minute chunks on every single server you have and kept for years.
From the dev site you get the Lead Dev to agree on some key points where the AppStack should put out some data to carbon. This can be things like time to login, some balance value, whatever metric you want to measure. The key here is to have business logic metrics AND system metrics in the same datastore within Carbon. Now you get to ask question across both data sets, and you get to ask them frequently and fast. You are able to easily make predictions about more load impacting the hardware in what manner, i.e. do we need more spindles, more memory, etc...
This is what I have been doing with some companies in SV right now. It's not pretty or fully blown out yet, because it is a big huge problem and our current monitoring sucks. :D
but it IS doable with current stuff and is quite amazing to know answers to questions that were previously only dreamed about.
What's after that? The pie in the sky next level, would be having an app box in every app group running in debug mode, receiving less traffic of course through the load balancers, and loading all that debug data into carbon. Then you get to ask questions about specific bits of a code release and performance on your real production environment.
... so those are my initial thoughts. Any comments? :)
Further once you have all this, you can now write nagios plugins to poll carbon for values on questions you have created and then alert not only on systems logics and basic app metrics, but real queries that are complex. Stuff like "How come no one has bought anything off page X in the last two hours, is it related to these other conditions? Oh. It is. Create me an alert in nagios so we can be warned when it looks like this is about to happen again." With much more data across more areas you can ask and alert on pretty much anything you can imagine. This is how you make it to next level.