What is our primary use case?
The reason we started looking for something like this was that we were providing managed services for our customers. My company was a hosting company. We had various stacks that we were maintaining for customers, including PHP, a lot of Java, Node.js, Python; all of the popular languages. It was hard for people to get into APM because, at least in India, New Relic, AppDynamics, and other APM tools were quite costly. So cost was one issue.
A second issue was that we were using different tools for different stacks. Some Application Performance Monitoring (APM) tools for a particular stack would work for, say, Java only, or something would only be for Python, and a lot of configuration would have to be done for each of the different frameworks. So we were trying to find a solution that could encompass everything and that was more user-friendly. I came across this solution, we reached out to them, we did a PoC with them, and that's how it came about.
How has it helped my organization?
The main thing it solved for us, once it was properly configured, was that it was able to not only do the basic monitoring that traditional tools do, like Zabbix, but it was also able to integrate that information along with the APM data that it collects. If a key monitored value crossed a threshold, it would send out an alert. And the solution is able to intelligently find out if something is beyond the range that it normally resides in.
For example, if the load average for a server is normally around three to four, and then suddenly it's around seven, that's obviously something that we need to be alerted about. It provides this kind of more intelligent alerting.
It also has auto-discovery. We didn't need to consider much. We just installed the agent on the host and it was able to detect everything from the host level up to the service level, for whatever stack was installed, and that includes containers and dockers, etc. So a number of used cases were fulfilled with this tool for us.
In addition, as a sysadmin, when an alert comes in that says the site is down or some error has been thrown, normally the first response is to check the logs on the application servers. We find some clue and then we hunt somewhere else. This solution provides the big picture, a look at the whole cluster. So for some obvious things we can just say, "Okay, this not the issue." We don't need to check multiple screens.
Finally, sometimes it's able to provide a proper RCA on its own. It's able to correlate different events that occurred and that became like an RCA in itself, that we were able to provide to the developer or the customer in question. It helped us, as a DevOps consulting team, to quickly pin down issues that happened in production.
What is most valuable?
One of the most valuable features is auto-discovery and that it needs little configuration. This was valuable because we had a variety of stacks to support.
Also, it collects data at a higher granularity than other tools. Other tools will usually collect data at longer intervals, and that's not usually configurable. But Instana collects data much more frequently, around every second and, as the data gets older, it stores samples of it. For something that is an issue, if the data is sampled, you may not be able to figure out the behavior of that application. You may be looking for when a very small spike occurred. If you're only collecting one data point per minute, you may not be able to catch that behavior. If you're collecting more frequently, you can. The granularity is a useful feature.
What needs improvement?
New Relic has a better UI in terms of how it presents the data.
Many managers, as well as our customers, used to ask for reports, such as "top X number of queries that are slow," or "top pages that have the highest number of issues." This is something that can be improved by Instana. Currently, they don't have that kind of reporting available out-of-the-box.
For how long have I used the solution?
One to three years.
What do I think about the stability of the solution?
We did have some issues. Java was the most well-supported APM platform among the languages they support. We didn't have any performance issues or performance hits with it. On PHP, since we had a lot of different OS's and different PHP versions, I would recommend caution and run it on one particular host and do proper testing before running it on all the machines. There were a number of times that they had problems with it. They had to provide new PHP modules because sometimes there were load issues or other issues that cropped up.
We were using a particular version of Python for our own project and that required a significant amount of debugging alongside them. That's one of their more nascent technologies, APM-wise. I don't have a lot of experience with how their other APM sensors go.
PHP was less stable. Java was the most stable. The coverage of Python versions was maybe less than it could be, but we didn't have any stability issues with Python, as such.
What do I think about the scalability of the solution?
Since it gathers a lot of data - it gathers data every second - you should definitely have a lot of disk space around to store all that data. If you are an on-prem customer it shouldn't be an issue. You could just delete the Elasticsearch indexes and you should be good to go. If you want to store APM data long-term, it's not easily configurable. I didn't play around a lot with that aspect of it. We didn't hit disk usage limits, except once for a high-traffic site.
The more traffic that you have, obviously, it keeps gathering a greater amount of data. That can, possibly, cause issues with disk space. But if you have a lot of storage then that shouldn't be an issue.
How are customer service and technical support?
As far as operational complexity goes, we didn't have a lot of issues. Whenever we faced issues with the data not flowing or the like, we could reach out to them via support or Slack. Generally, they would be able to respond and fix the issue, if there was an issue, within a few days or a week.
They didn't have a lot of low-level tech support. I think there were one or two people who would send a query to the appropriate team. But since we were all technical users, we were able to gather whatever data was required beforehand and then create the ticket. We were also a different kind of customer than the normal SaaS customer, so we had more access to their developers on Slack, etc.
Which solution did I use previously and why did I switch?
We didn't have a previous solution, before Instana, and we didn't provide any APM solution for our customers. Instana came about because we had this managed services division and they had a bundle of things that came along with the managed services fee that they were paying every month. Instana was actually bundled for a lot of people as part of that.
How was the initial setup?
The setup that we were running was hosted on our machines. Instana has a SaaS version as well. We were not using that. We were using installations for our customers.
The on-prem version obviously requires more expertise as far as provisioning the appropriate hardware. We ran it on VMs but the on-prem version requires a lot of resources. For production systems, they have recommended 64 GB. We were able to make do with 32 GB just fine, even on production. So those numbers are more indicative of what they planned in the lab. We were able to make do with at a lot less.
Suppose something basic like Memcached is running. All the stats that are required are gathered by Instana automatically, you don't have to do much for that. But there are some other services that may require more configuration. One needs to go through everything it has detected in the logs, and that takes a bit of time. It's more of a one-time effort. Once that's done and configured, then you can bake that into the image of that particular stack, or as part of your configuration management tool and then it should be fine. The more data points it gathers, the more useful they become together; heuristics on the whole system. Otherwise, you just have basic, low-level stats like CPU load average and that's something any tool will give you.
If I speak about it from the point of view of where the agent is running, if you are using the SaaS version it's pretty simple. There is just a single one-line command that you have to run and that installs everything. You just have to configure the credentials for different services, like databases, or some particular app that requires auth. Everything else is automated. That is a quite a big advantage if somebody has a huge setup. Configuring all of that is a big issue if you don't have the manpower to set up everything beforehand.
To install an agent you just keep it as part of the image, or you can install it on the fly and it will just appear in the required place. It takes around five minutes for each application, even if you're doing it by hand.
Although we were running the on-prem version, we didn't have any dedicated staff, as such. One team did all the deployments and the technical account manager used to look after it, as well as the primary users. I was the one who would be contacted for any complicated issue that cropped up. But other than that, it's not very high maintenance. It pretty much works.
I should say that since we were running the 32 GB setup, we did have certain resource limits being hit sometimes. But that's because we were not following their prescribed requirements. But if it's properly configured, and enough resources are present, at least for the on-prem, it shouldn't require much maintenance at all. One person who has a couple of years of system management experience can easily manage it.
What was our ROI?
I do see a lot of potential for ROI if it's properly implemented and if whatever alerts provided as part of Instana are being acted upon. Then it is definitely a good tool to have.
What's my experience with pricing, setup cost, and licensing?
Pricing is quite competitive. Dynatrace, AppDynamics, and New Relic were all several times more expensive than Instana, both the on-prem and the SaaS versions. Price-wise they are a lot more competitive than anyone else out there. This much is sure.
Which other solutions did I evaluate?
We didn't look at a lot of competitors because, honestly speaking, generally what we found after implementing this as part of managed DevOps, was that APM was not at the top of the priority list for a lot of people. That was at least true for the customers that we were serving in India. For many of them, APM as a concept was introduced to them by us. A lot of them did not have know-how regarding APM.
The people who were interested in APM would generally contact us and check out Instana with us. They came from larger organizations or multi-national corps, so they had experience with AppDynamics or SolarWinds, etc. They were mostly IT managers or somebody who was evaluating different products. But for our customers, it wasn't really a priority. Instana was probably the first APM for them, in many cases.
What other advice do I have?
If you are somebody who is already well-versed with the APM world and who has tried out different tools, you will find Instana quite easy to use. There really wouldn't be a lot of things to learn about it.
Go with the SaaS version of Instana and don't bother with the on-premise. Other than that, to configure whatever is being auto-detected by Instana, be sure to check the logs of the agent that is running. If you find anything that the agent is missing, that's stopping it from gathering all the data it can, fix that. If you're using SaaS you don't have worry about disk space.
There isn't much else to say because most things are provided out-of-the-box. With New Relic, you have to configure each of the different stacks, the agent, etc. Here, that is not the issue.
This is a good tool to implement across the board if you can make it into your images or as part of your configuration management tool. As soon as things come up, new machines are added into the cloud. Your monitoring and APM can also be integrated automatically. Instana has a YouTube channel with small demo videos that show a large number of hosts, once they are launched. It's a good tool. It's worth trying out, for enterprise as well.
Apart from our team, I would say about ten people were using it in our organization. Among the customers, it's hard to say. We were the primary users on their behalf. We had just one customer who was a paying Instana customer, and their on-prem was being hosted by us.
In our organization, the users were L1, L2, and L3 system administrators. We had senior solutions architect people who could design solutions for customers. We had technical account managers who came through the ranks of the sysadmins but they were able to do managerial as well as technical tasks. The traditional DevOps role wasn't really present in our organization because those people were directly employed by our customers. We were more ops-centric, rather than doing the build-release automation kind of thing.
It needs to be properly configured. That part is very important. All the sensors that it's able to use to detect, should be properly configured. Otherwise, it's not very useful.
The overhead that comes with Application Performance Monitoring (APM) tools is one concern that some people have. Instana says that they don't do any deep code profiling, they don't have logic that thinks about when and what to profile. Some competitors like Dynatrace do have that feature. Instana says that since they don't do deep profiling, they don't have the extra overhead that these other tools have. I haven't compared them side by since so I can't say if this is true.
Also, application component types of reboots are not really required for Instana, except for the very first time when you install the Instana agent; then, you may have to restart the application server. Whereas that type of reboot is required for a number of other tools, it is not required with Instana.
I would rate it a seven out of ten, primarily because some things, like PHP and Python, are not quite as mature as Java. That's the main issue, stability-wise. But if you're on a Java stack, you can go ahead without any issues.