Threat Stack Review

Enables us to monitor every production command that developers run and alert on suspicious commands


What is our primary use case?

We're using Threat Stack for multiple purposes. We use it for file integrity management and we also use it as an intrusion detector, using it to monitor the interactive sessions on our Linux machines. We also do CloudTrail analysis and alerting.

How has it helped my organization?

We have about 210 microservices that make up our product. There are over 140 developers who have access to production, and they can troubleshoot but they're not allowed to make changes. We have to give them enough access to do their troubleshooting while ensuring that they aren't making any changes to the production system. The only way to do that is to monitor every command that they run in production and alert on those commands that are suspicious. It's working to ensure our developers are doing the right thing.

It can also provide a warning that someone from the outside may have compromised a machine. If somebody runs a suspicious command, like whoami or netcat or curl, or any of those kinds of commands that you don't expect, we're immediately alerted. It's a really great tool for that, and we can specify really granular rules.

The things that our developers are normally allowed to do and that we expect to happen, those aren't going to alert somebody. But the things that we don't expect to see in a production environment, those will alert somebody, and we'll move very quickly on them.

The way Threat Stack has improved our organization is directly related to how our production environment works and how we monitor it. The improvement is that we are able to get PCI certification. We use Threat Stack as a compensating control for PCI. We do have developers who have access to our production environment, so we don't have the traditional separation of duties that PCI would like, where the developers who write the code don't have access to production. But we're able to show the compensating control, that we monitor everything that happens in production and that there are no changes made in production by these developers. Threat Stack gives us that ability to implement a compensating control and show it. We were able to get PCI with this control.

The rules definitely give us more visibility and control over what's being triggered. We are able to monitor our environment and see what is normal. When we first installed Threat Stack, we obviously had a lot of alerts. Over time we have been able to monitor and see which of those things is normal. For example, which alerts happen because of automation, automated things that are happening in the environment and that trigger expected alerts? We don't need to see these as alerts. These are expected actions, they're authorized and not caused by users. They wouldn't be caused by a bad actor. They're just simply automation. We are able to write very granular alerts that look for that automation and no longer alert us on it, so we're able to cut down the alerts to a manageable level.

In terms of our cloud infrastructure, one of the things that we get from it is that we now have a baseline of normal. What do we expect to see? What are normal operations? From a security standpoint, what's going on that is the average, that we expect, and what is an outlier? This is one of the tools that allows us to say, "Okay, this is our normal baseline, these things are outliers." And even if they don't reach the alert level of a Sev 1, they're still outliers that we're logging as Sev 2 and Sev 3, and we're still looking at those every day just to see what patterns are changing.

In addition, we use Threat Stack for SOC 2 auditing and it saves us time for the same reason I noted about the separation of duties. It's a tool that we use in the SOC products to show how we're monitoring what happens in our production environment. We use it as a compensating control for the lack of a separation of duties.

Finally, Threat Stack has cut down on the time needed to investigate potential attacks by about 75 percent.  It's much faster now.

What is most valuable?

The number-one feature is the monitoring of interactive sessions on our Linux machines. We run an immutable environment, so that nothing is allowed to be changed in production. All changes have to happen in development, and then new systems are built in production. The only thing that is allowed in production is troubleshooting, find out what the issue is, but then it has to be fixed in development. We're constantly monitoring to make sure that no one is violating that. Threat Stack is what allows us to do that.

The solution's ability to consume alerts and data in third-party tools, via APIs or via export into S3 buckets, is working very well. We use the API to send monitoring to PagerDuty. And we've started using the API into other systems. We have it going out to a Slack channel, we've got some going into our automation. We're doing more and more with the alerting now. We're working directly with Threat Stack to use their APIs as they've recently been expanded. 

We're logging into S3 to do a little more in-depth research on what our alerts are, and we're also consuming CloudTrail events, which is a fairly recent update to Threat Stack, enabling us to alert on suspicious activity in CloudTrail.

What needs improvement?

The API - which has grown quite a bit, so we're still learning it and I can't say whether it still needs improvement - was an area that had been needing it. They have just recently come out with new improvements. 

I'm looking forward to their code analysis, which is coming out as a result of an acquisition they made.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

It's been rock-solid. We've never had an issue with Threat Stack.

What do I think about the scalability of the solution?

No issues with the scalability. We run over 5,000 production instances on it.

We have very few users. There are only seven people who have access to the Threat Stack console and they're all security engineers.

How are customer service and technical support?

We have used Threat Stack's technical support and they've been great. We contact Threat Stack and we hear back pretty much immediately.

Which solution did I use previously and why did I switch?

We used Trend Micro Deep Security. The issue was a problem in the agent that goes on the servers that was causing our servers to crash. It happened a couple of times and the support wasn't what we wanted, so we decided to change products. We couldn't handle that kind of outage.

I can't say there has been a decrease in the mean time to remediation because it's not really an apples-to-apples comparison. Trend Micro had different capabilities.

How was the initial setup?

The setup was very straightforward. The rules are easy to write. It's common language, it's not anything arcane where you have to learn how to write in their language.

The initial deployment only took us a couple of weeks. I'd say that we were comfortable with it within a couple of months, as far as our base-level tuning. We're tuning it forever, constantly reanalyzing our environment and making tweaks to it.

Our implementation strategy was that we deployed it in our production environment and simply monitored with the stock rule set that they gave us. Then we started trimming it from there based on what we saw as normal in our environment. We started writing granular rule sets based on the alerting that we were getting. We were also patching it with another tool that we have called Sumo Logic, where we do logging and alerting. We were using that to get some of the information on what we wanted to see and creating queries based on that.

It was built with two people, because that's all that our security team was at the time that we deployed this. We currently manage it with five people, they're all security engineers.

Tuning is really simple. It's a matter of monitoring the alerts that come in, whether they're Sev 1 through Sev 3, and determining whether they are normal, expected, and part of the baseline, and then filtering them out. Or, if they're something that is not expected, or something we want to know about, we increase the severity to a higher level so that they're treated differently. We have different actions for each of Severity 3, 2, or 1: page the engineer, email an on-call engineer immediately, or just send a daily wrap-up email. We're constantly looking at that to see if we want to change the actions.

What about the implementation team?

We did the deployment ourselves.

What was our ROI?

We have seen return on investment but I can't come up with a number because of how much we've changed. When we had Trend Micro, we had only some 500 instances, and now we're at 5,000.

What's my experience with pricing, setup cost, and licensing?

I honestly don't know what pricing would compare to, because there wasn't a whole lot on the market at the time. It came in cheaper than Trend Micro when we purchased it a few years ago. It seemed to me to be priced well.

Which other solutions did I evaluate?

We looked at was going on with open-source, with OSSEC, and doing it ourselves. That did not prove to be scalable.

What other advice do I have?

The best way really to demo and implement is to deploy it with the standard rules that come with it and simply monitor the environment for about a month, just to get a baseline before going and adjusting rules and customizing.

We are growing. Our product grows 100 percent, year-over-year. That doesn't increase our instance size 100 percent, but we do grow. We are expecting to continually grow for quite some time.

Disclosure: IT Central Station contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Add a Comment
Guest
Sign Up with Email