What is our primary use case?
We use it to monitor all our production servers, our UAT servers, as well as all the trading applications that are running on them. Additionally, we use some of the other features to monitor some of the network equipment via SNMP. Our NOC is watching the screens I created for them using the Geneos dashboards and the Event Ticker. They open cases for specific teams, depending on where the alert is coming from. It gets routed to the correct group and they resolve the issue.
For the most part, our applications are fixed protocol trading applications. We have a host of products and they're all unique. They range from process monitoring to log file monitoring to API. For example, the trading application has an API which sends data into the ITRS gateway and that displays information. We have rules written around that.
We use the Web Dashboard and it's great. You have to take the time to create the dashboards. I'm the only person administering ITRS at this point, and with over 500 servers and a lot of applications I don't have a lot of time to create dashboards.
How has it helped my organization?
When I first started, they were having outages daily. A lot of times it was the customer that was calling saying, "Hey, we're having a problem. Can you please look into this?" The alerts weren't even there. Our company didn't have an ITRS administrator. Someone from the New York ITRS office would come out once a week and do whatever he could, but he wasn't full-time. Once I became full-time, I was able to clean everything up, make sure only the required alerts were being generated, and that allowed the production team to actually see, "Oh, okay, this is broken right now," and enabled them to fix it.
It has allowed us to be less reactive and more proactive when it comes to issues. The consolidation of all the other monitoring applications into one allows the NOC to see everything more clearly as well. It definitely improved the overall functionality of the trading applications because we are able to see stuff alerting much more easily.
In terms of the number of issues we have detected and outages avoided, I wouldn't even know where to begin with that. It's a large environment, so we have issues every day. I don't know that I can quantify them.
What is most valuable?
The log file monitoring is probably what we use most extensively, especially the FKM sampler. That would be the one we utilize the most for scraping log files and looking for our messages.
In terms of the solution's real-time data, it's great. I can't say enough about it. I've evaluated many other products, including Nagios, because everyone wants to use stuff that's cheap — ITRS is very expensive — as well as Check_MK and some stuff from HPE, and nothing provides a solution like ITRS does. It's definitely the best solution that I've used, as far as real-time monitoring goes. You immediately get a pop up if something is broken. It's easy to see what's broken and what's not.
The filtering in the Active Console is exceptional. Depending on the user base, some people don't want to see server-level errors, so we have filters set up in the Managed Entities view, which allow us to filter out things that certain groups don't want to see, while allowing them to see other things. It's a great real-time monitoring solution. And you can draw graphs immediately, right from the Active Console, whether they're current graphs or historical graphs.
It also provides lightweight data collection. We have numerous metrics being logged to the database. As an example, in Europe we have the MiFID requirements where time-tracking on all the servers has to be logged to a database. At any point, regulators can come in and say that they want to see that data. We use ITRS for tracking that. We do the same in the US for a FINRA requirement where we're tracking NTP. We log all the NTP data, the offset drift to say, "Okay, you're off from this stratum." That gets dumped into the database. We then have a weekly report that runs and which is put into long-term storage for seven years. It definitely is good for doing that.
What needs improvement?
They have the Webslinger solution where you can see when something is alerting. It's a little bit cumbersome.
For how long have I used the solution?
I've been using Geneos for over 12 years.
What do I think about the stability of the solution?
It's absolutely stable. I've never had a problem with the actual ITRS gateway software.
I definitely found bugs early on and they would correct them pretty quickly. But in the last five years I haven't found any bugs, and if we ever have an outage it's because of either the network or the server that the gateway is running on. But the software is absolutely stable. It's probably the most stable software that I'm responsible for at this point.
What do I think about the scalability of the solution?
It's absolutely scalable.
The only place that we have a problem with scalability is with what is called the UL Bridge dashboard. That is an API stream that goes to the net probe. We're just sending so much data that sometimes the net probe suspends, so we're not seeing the data. That's the only place where we really have an issue. But I don't think it's the ITRS functionality that is responsible. I think it's our software just sending too much data.
In terms of the possibility of increasing usage, everything is pretty stable. The servers that we have them on are all Linux servers with more than enough CPU and memory. I've never really run into a utilization problem on any of the servers where ITRS is running.
How are customer service and technical support?
I don't only administer this application, I'm also responsible for other applications. As an example, I have another application that has been broken for over a month now and I've had to open three separate cases with the company and the issue still isn't resolved. They keep telling me, "Well, it's a different issue. You have to open up another case." But with ITRS support, I can go off on a tangent: "No, I found this and this and this and this," and it all goes under one case and it all gets solved within a day or two.
I've been working with ITRS support for 12 years and I have a very good relationship with the New York office. There have been plenty of things that I've recommended that came out within one or two releases of the next versions of the software.
Which solution did I use previously and why did I switch?
When I started, they had one of the original versions of ITRS Gateway. Now, everything is Gateway 2, but this was the original Gateway. As time went on, we were bought out and another company came in and they were using Nagios. I converted all of their monitoring from Nagios to ITRS. Our current company was using Check_MK, and I took all the servers that they had in Check_MK and brought them into ITRS as well. We wanted our NOC to have a single pane of glass to look at the entire environment. Having them look at an ITRS console, a Nagios console, and a Check_MK console was just too much. So I consolidated everything into one.
Through the migrations, I've learned how to use those other solutions. I even did a proof of concept with Nagios, because when one of the companies saw how expensive ITRS was, they asked me if we could do everything in Nagios that we're doing in ITRS. I attempted to do it, but one of the big problems was our extensive log file monitoring. Right now we have six ITRS Gateway servers, although it's really only three because the other three are just the backups. To create that same solution with Nagios, I would have needed over 20 servers. It wasn't feasible.
I also eventually looked at Check_MK, but the problem was that it's really just for system-level monitoring. It doesn't really get too extensive with application monitoring, and with the amount of application monitoring that we have deployed, I don't think it would have been possible to do with Check_MK.
ITRS is expensive but their service is second to none. And if you have any problems, they usually resolve them within a day or two.
There is no comparison when it comes to the visual presentation, between ITRS and Nagios. The Nagios front-end is horrible. It's very difficult to figure out what's alerting and what's broken. With the ITRS console, it's immediate. If you have your filters set correctly you can see exactly which servers and which managed entities are having an issue.
The time it takes to get an alert is about the same in ITRS and Nagios. It really depends on how things are configured. We have checks in ITRS that are configured for every 20 seconds. Some of them are every five seconds. You can do the same in Nagios. But the actual viewing of the events is much easier in the ITRS console than it is in the Nagios console.
The ITRS gateway is also easier to deploy than Nagios.
Nagios and Check_MK are both cheaper solutions but you get what you pay for. The amount of money that you can save with those solutions would be needed for someone in the background, doing a lot of development work to replicate what we're doing in ITRS. You could get cost savings upfront, but you're going to pay for it in the end with the development work.
What was our ROI?
We have absolutely seen return on our investment with ITRS. The stability of the trading applications with Geneos means it pays for itself.
What's my experience with pricing, setup cost, and licensing?
Pricing is the touchy subject, even here. Upper management always wants us to find a cheaper solution. But we have so much integrated with ITRS. For example, in one of our environments we have extensive client notifications, so if a client session goes down, they immediately get an email. It's automated. We don't have to do anything. That's a feature that our clients really like. It's expensive, but it does its job very well. And you set it and go.
What other advice do I have?
Follow the standards that ITRS provides. Their support is second to none and they will always guide you in the right direction.
At any given time we have at least 20 users connected to all the gateways, and that's not everybody because we're a global company. It's when the offices are open that people connect to it. They are the NOC users. They're all in Manila and they watch the screens and make sure if anything is alarming that a ticket is opened for the correct group. We have some of our system engineers who are looking at it for server-level errors. We also have production engineers who are responsible for the trading applications. They are also looking at the consoles to see if anything's alerting.