What is our primary use case?
We are monitoring uptime, availability, and performance of the trading systems through the ITRS dashboards. We have five segments and we have created all the applications and dashboards for these segments.
We are monitoring online and have created a separate, integrated monitoring room where we have installed all the Geneos dashboard screens and we have a separate group of people monitoring these tools. In case of any alerts, there are mechanisms for escalation.
Currently, we have not only provided dashboards to IT operations, but we have provided them to other departments as well. For example, the hardware team has its own dashboard, and the network team has its own dashboards. The business team also recently started using dashboards.
The types of applications we monitor start with trading applications. We have a cash market, futures and options, and through the ITRS dashboards we monitor all the back-end servers. The only thing that is not covered by the dashboards is the front-end, which is used by the members. But all the in-house application servers are covered, as are all the networks switches, all the hardware details, and the software applications for the trading servers. Apart from that, there is a separate interface team that is also monitoring its own dashboards.
Essentially, we are trying to include all the business parameters in the monitoring, those which are used most by the business users. Those parameters are being tracked in the ITRS dashboards. We have divided our dashboards into application processes, logs, hardware, and network. All these dashboards are combined and are visible in an Exchange one-view dashboard which is visible at the executive level.
How has it helped my organization?
On a yearly basis, we identify around 50 to 60 incidents through ITRS. Typically, we don't have outages to our critical systems. We haven't had any in the last three to four years. So while ITRS has not been involved in avoiding outages, there have been one or two critical issues which it detects each year. Those have not resulted in outages, but there would have been major business impact from them. We detected them due to ITRS.
What is most valuable?
It enables us to monitor application processes, to do log-monitoring on a 24/7 basis, to do server-level monitoring - all the hardware parameters - as well as monitor connectivity across applications to the interfaces.
The ITRS dashboards monitor real-time data. There are two processes. One is that it reads from files via netprobes that are installed on all the servers. They read the respective online files which are updated every two seconds and then display the online data. The second process is that the dashboards are updated through scripts. ITRS servers run scripts and collect all the data. That is how real-time monitoring is done.
It also provides integration with ticketing tools. Whenever there is an alert, ITRS can directly open a ticket in a particular ticketing tool.
We can also view the logs from the time of an alert and back, or at ten minutes before the alerts, or two hours or one day before the alert.
We are also able to shift all the rules from one server to another.
And recently, we have started using automated actions when there are critical alerts.
What needs improvement?
We have introduced many of the monitoring processes in the past five to six months, for the trading dashboards and the business team. We have segmented gateway servers doing the monitoring. Sometimes, if there is a lot of data coming onto the servers, we have observed a little bit of slowness on the gateway servers which are doing the ITRS dashboard monitoring.
I believe the plan is that the tooling team will divide the gateway servers into two, with half of the application trading servers monitored by one gateway server and the other half monitored by another gateway server.
In our organization, every department is very much dependent on ITRS. For me, the basic concern is the contingency planning for ITRS. For example, if a dashboard server stops working tomorrow there is a concern. Contingency is a concern; something needs to be planned. We have not observed any failures in the ITRS dashboards. But because of the dependency of every department on the ITRS dashboards, this is a major concern. The trading server availability is dependent upon the server availability of the dashboards.
For how long have I used the solution?
One to three years.
What do I think about the stability of the solution?
Up until now, there has been no failure in ITRS. Currently, it's stable. Because the number of servers, the monitoring alerts, rules, and categories is increasing, we have to increase the number of data servers. But it's currently stable. There is no problem with the stability of ITRS.
What do I think about the scalability of the solution?
Because we have an enterprise license there is no issue with scalability.
If the number of servers increases, we have enough licenses to cover that. As far as the dashboards are concerned, the number being used by the various departments is fixed.
How are customer service and technical support?
We have received very good support from technical support. All our tickets, all the changes, have been done in the specified time. We received a good amount of support from them during the initial deployment. Although it was a complex architecture, the deployment went very smoothly.
And currently, the changes which happen very frequently here, the changes in the dashboards, are done very smoothly. We have not had any issues with support.
Which solution did I use previously and why did I switch?
Previously we had an HPE service for monitoring and before that we had Nagios. The flaw in them was that we only received emails. One dedicated person had to continuously monitor the mail to get action taken when there were alerts. What helped us with ITRS was the real-time monitoring, where the alerts are coming in on the GUI itself. This has resulted in faster action when there are alerts. Events are immediately captured in the ITRS dashboard.
We checked various other tools and the monitoring techniques on the market, as well as the techniques used by ITRS. We found that the ITRS monitoring techniques, whether by polling or reading the files, was capturing the data more effectively and showing it on a dashboard which is more intuitive. Here, everything is done based on the trading system that is on the one gateway server. The monitoring techniques that the internal ITRS dashboard is using are more effective than the other monitoring techniques. That's why we opted for ITRS.
How was the initial setup?
The setup was not straightforward because our system is quite complex. There are multiple servers and segments and departments and, at that time, we had various OS versions. We had some challenges.
We deployed in segments. Our first deployment took around eight months. The next segment of deployment took around three to four months. The third segment took another three to four months. Everything together, all the dashboard deployments completed and all the segments, took between one-and-a-half and two years.
We also had some migrations planned for the trading department at that time, so we integrated the deployment of the dashboards with those migrations. The servers that had already been migrated, where the major architectural changes had already happened, they were where we deployed ITRS first. If we had deployed on the old servers, we would have had to re-do the deployment efforts of ITRS.
The second point in our strategy was that the critical servers were the trading servers. We did the ITRS dashboards on them first, and then, finally, on the hardware and network. And we have targeted the interface servers for later.
We also integrated this with a latency tool, Corvil.
We had a number of people involved in the deployment. There was a manager as well as someone who looked into the basic ingredients of the ITRS dashboards, the coding, etc. Another person was responsible for the user look and feel, how the GUIs would look, as well as the use-cases. There were three people at that time. Now, managed services has started to use the ITRS dashboards, and that is being handled by our separate tooling team.
When there are any releases or changes made to the trading systems, we inform our tooling team. We create a request for them to make all the changes to the dashboards and they make the changes.
We're now into more of a maintenance process.
Overall, we have about 150 to 200 people using the solution in our organization.
What's my experience with pricing, setup cost, and licensing?
Things like the capacity planning have a separate cost.
Which other solutions did I evaluate?
We looked at three other monitoring tools, but that was four or five years ago. BMC was one of them and there was an HPE solution as well. We looked at them based on top industry reviews.
We considered HPE open-source, but the GUI features and how fast it displays alerts on the GUI, as well as polling and integration with other third-party tools - they were lagging. We found ITRS more useful.
What other advice do I have?
It's a very good tool to use and everybody is very happy with it. We are looking forward to more features.
Not all the data that is being captured is currently being stored completely in the right ITRS dashboards. There is a project in progress for collecting the data and storing it for capacity and numbers purposes. We have seen a demo related to data collection for capacity planning and it looks very useful, as do the capacity reports. But that project is still in the roll-out phase and will take a couple of months.
The next feature we are looking at rolling out is the integration with the ticketing tool. That is planned for the next four to five months.
We are now looking at integrating small things into ITRS. If any incident or issue comes up, the first thing we ask is, "Why isn't it part of ITRS? How can this be integrated into ITRS?" Any small activities, challenges, or issues which we foresee in our day-to-day operations, we look at how we can implement them in ITRS. This is a more proactive kind of approach. So it's not only for current alerts but we can also implement things for the future in ITRS.
I would rate ITRS at nine out of ten. Everything is being monitored by ITRS. The reason it's not a ten is that, because it's an integral part of all our operations, if anything fails in ITRS, we're not sure where we would go. We are almost over-dependent on ITRS.