What is our primary use case?
In terms of primary use case, LogDNA is our root cause analysis for on-call channels and we're now also using it for load testing. We have a silverized service in production. When an issue hits the on-call channel, meaning a customer receives some type of service issue, we receive an alert and use LogDNA to RCA why that particular alert fired. We use it to trace customer requests and customer interaction, as well as studying our own service to make sure that we don't trip up in the transactions or on the services we depend on.
The product is our only SIEM. We've become more skilled at using it - it's a different way of finding events and collecting the evidence to explain a certain behavior. Over the past six months we've certainly learned to use it as a team. Some types of flows you can automate and we use that particularly for on-call tasks. I've seen a lot of progress but it's mostly an increase in team skills. All our team members use a LogDNA, not just the support team or SRE team. Everyone has something to contribute.
We deal with large companies that have the resources and also complexities. The product we service is actually a key management product. If something doesn't work, you have very big customers who become anxious. We collaborate closely with LogDNA.
Our team is about 100 people. We have site reliability engineers that use the tool for deep RCAs. We have four levels of support and the SREs do capacity testing, latency, performance testing, any type of really confusing or complicated RCAs that may pertain to account compatibility or networking hiccups. We also have an on-call group of about a dozen people who deal with customer issues, checking nothing is seriously wrong with the service. Finally, we have the developers who deal with a secondary transaction debug tooling. If a new feature is rolled out, they can use the tool to track the transaction going through in the system.
How has it helped my organization?
When an alert is received, our SLA target is five minutes. Prior to having LogDNA, we missed our SLA fifty percent of the time because of the time required to consolidate syslogs, service logs, and API logs. With LogDNA, which brings all the logs together in an interleaved stream, it allows us to take a transaction and relate it to other contextual events making the gathering of evidence for auditors and our internal RCA much more productive. We are now hitting that SLA pretty consistently, only exceeding our 5min SLA on complex SRE failures, typically involving upstream and downstream external services.
What is most valuable?
The most valuable feature is the fact that the solution aggregates all event streams, so that if there's a file access issue or an HTTP server or GRPC server issue, it's all in the same interface. This allows you to really quickly isolate the context. It means there's a mechanism which allows you to say you're interested in a particular event, find said event easily, and then ask for that event's context. There are events that led up to it, and that follow from it. Those are two great features that make it very productive to do RCAs.
LogDNA is a consolidated event stream which is great. All of the instrumentation is in a single pane, so the aggreagation of all event streams is superb. To query that state is a very interactive and very intuitive process, which is a a valuable feature and much more intuitive than other tools we've used in the past.
What needs improvement?
Scalability could be improved. We are using it through the IBM cloud deployment and on some of the data centers that are very heavily used, there is a significant lag in the event stream, sometimes 10, 15 minutes behind, which makes the RCA impossible. If an event hits but you don't have the information to look at it, then it's tricky. This is probably not an issue of the product itself, but more a deployment issue. There is something on the IBM side that needs some readjustment to make certain these lags don't happen too often. We now use other tools for back-up in that area. But if you really want to do SIEM type work, then that is an aspect that needs some improvement. It's hard to tell if it's the product or the IBM deployment of it.
The user interface is really very productive interactively but for an additional feature, it would be nice if we somehow could encapsulate a query or a filter, and communicate or share that among the team so that specific types of actions can be carried out quickly. In particular, when we deal with a customer issue, it may pertain to a particular transaction through the system and each transaction has a unique ID. It would be great if we could query that ID and request all transactions that pertain to a specific ID. For now, we need to find the events, then extract the ID. Once we have that, we can go through the UI to set up the query and filter it to give us a transaction. But it would be really nice if we could simply say, "Here's the ID. Give me all the transactions."
For how long have I used the solution?
I've been using the solution for six months.
What do I think about the stability of the solution?
It's a very solid solution.
What do I think about the scalability of the solution?
There is a lag with scalability. We have a global deployment, 17 regions throughout the world and we have one LogDNA instance per region. In the regions with very high traffic, we are seeing a lag. But in the regions where there's fewer customers, it is fine. But we're talking about sites that have tens of thousands of customers so this is not a small scale deployment, this is cloud scale.
How are customer service and technical support?
I don't have a whole lot of direct interaction with LogDNA support as my work is mainly with the continuous integration and site reliability teams. I always do it through our service groups.
Which solution did I use previously and why did I switch?
We previously used the standard ELK stack, Elasticsearch, Logstash and Kiban. We transitioned six months ago for two reasons. First, the old database of events was getting too full after seven days and second, that meant we only had seven days of history. LogDNA increases that to 30 days. We are now using LogDNA solely as our MO for event and logging.
How was the initial setup?
Initial setup was probably a three month effort to make it production worthy. It wasn't rocket science, but complex enough that you had to pay attention. LogDNA requires a container sidecar deployment so we needed to change our infrastructure and redesign our events stream solution to be able to fire up these sidecars, monitor and secure them. We prepared a prototype and that took probably three months. We now have half a dozen people that are integrating the continuous deployment pipelines with LogDNA tooling.
What about the implementation team?
We are skilled enough that we were able to do the implementation ourselves. The plan was to create a new implementation alongside our service in a controlled environment. With that controlled environment we could scale to our production size. Once we were satisfied that all was working, we wrapped it up in our production continuous integration tooling and deployed it as part of production. It was a straightforward implementation plan.
What was our ROI?
This is not an ROI capability: it is a necessary technology to support the scale out of our service to tens of thousands of customers across the globe that depend on our service to secure data assets.
What other advice do I have?
There are enough moving pieces that can go wrong, that require a very broad skill set in a team. I would advise others to make sure that you have the breadth of skills in your team to integrate properly in your whole service, to be able to customize it and finally, to actually use it properly. These are nontrivial things because you need to understand the interface with your own software, to use the tool properly and then to have some skills so that if something goes wrong, you can figure it out and fix it.
If you look at the progression, a year ago we were missing the SLA target. Now, we can pretty much satisfy that SLA consistently. For complex RCAs, the 30-day window we now have has given us a lot more time to do a deep RCA. Particularly with networking issues which are very difficult because they tend to be transient, we now have a fighting chance to find problems, explain them and then to take action.
I would rate this product a 10 out of 10 for the simple reason that its productivity is so much higher than we previously had. The query UI is splendid, very productive. I haven't seen anything better.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?