What is our primary use case?
The primary need is to really understand where our traffic is going, not just the transit ASNs — we know that — but where else is it going? How much traffic are we sending to those other ASNs?
Of course, DDoS is also another use case for us. We have identified DDoS.
And we're also using alerting now to help us understand when service owners are perhaps utilizing more than they should.
How has it helped my organization?
We had an event with one of our service centers, internally, and we were able to get them to understand that they were causing adverse effects for our customers on our circuits because they were over-utilizing circuits when they should not have been doing so. Kentik allowed us to peel back the entire network aspect of what they were doing and it allowed us to get an agreement from them that they would police themselves regarding their traffic, so that we did not have to do so for them.
And it allowed us to continue to have shared resources rather than duplicating everything. We were able to continue to allow them to utilize our transit, or our shared network connections, rather than saying, "Okay, you can't use this anymore. You have to duplicate everything." As a result, we're saving, in this case, about $40,000 a year, because we're not duplicating the network. If you understand what's happening, you can say, "Okay, this is what you can do, this is what you can't do." You can't get to that point unless you understand what's happening first, and Kentik allowed us to do that.
The solution has proactively detected network performance degradation or anomalies. For instance, right now I'm tracking another service center that is trying to provide a backup solution going to one of the cloud providers. What's happening is that their traffic is not hashing, it's not load-balancing over multiple circuits. I can easily prove that because I can pull up the circuits and see all of the flows from this particular service owner going over one circuit. That's an anomaly Kentik detected and I can go back to the service center and tell them. And it alerts me when it's happening, when it's getting too high, when it's about to saturate the circuit. It then tells me, "Oh, by the way, they're doing it again." That is very helpful.
The drill-down into detailed views of network activity help to quickly pinpoint locations and causes, especially if you set it up properly so you have all your routers and your interfaces. It's super-easy. In this case, it sends me an alert. I pull up the dashboard and it's all right there. It tells me everything. For example, when I pull up the alert that I got this morning it gives me a traffic overview and tells me, before I've done anything in the source or destination ASNs, which service center it is, if I have a separate ASN for them. It shows where it's going and how much traffic is spiking. It gives me the total traffic hits per second and packets per second, as well as source country, destination country, subnet — everything. It's telling me exactly who, what ports, and everything that is causing the anomalous traffic. If you have it pre-set-up, it just takes you through to the dashboard with everything already there. That's super-helpful because I can go back to the service center and tell them that they're saturating the link and this is how they're saturating it. I have proof.
I have also used Kentik's months of historical data for forensic work, especially with my old job. I was at a service provider previously and we got DDoS'd all the time, constantly. It was much easier for me to go back in time and look at some of these DDoS events and look at the signatures so I could just figure out which buckets most of them fit into. I could say, "Okay, I had these many incidents, these are the different types of issues I saw, and maybe if we take these actions we might be able to stop this kind and that kind of DDoS." It was much easier for me to go back and look at it as a holistic view.
In addition, it has decreased our mean to time remediation for anomalous traffic moments. For instance — and I'm not in the operations team — it has certainly allowed the operations team to detect and figure out what's happening much more quickly than they previously were.
At my previous company, it probably went from about a 30-minute detection to about a ten-minute detection, and that included making sure we understood which IP address was being attacked. As a service provider you can see what the interface is, but the question is which IP address on the interface is being attacked. That's the thing that you get much faster and you're able to surgically black hole that IP address, as opposed to shutting down the entire port for the customer. That kind of thing is huge.
Kentik has also improved our total network uptime. We're able to check the customer-effecting incidents much faster than we previously were. And at my previous company I can say wholeheartedly that it improved uptime because when you can detect so that you're not shutting down ports, you can get to the router faster, and the router is not falling over anymore because it's being attacked.
In terms of improving on the number of attacks we have to defend, at the previous company I would say it did because I did all the analytical work, and we were able to determine a couple of different types of attack that we might be able to defend a little bit better. Here, it has reduced the number of internal incidents we've had. Service owners are not really thinking properly about how they're using the network and have service-effecting incidences that they didn't know about. If you point it out, they stop doing it, if you have data for that. Before, we weren't really able to point it out in a way that they understood. Now, it's much easier for us to detect it, clearly determine that it was them, and then say, "Could you stop this? Don't do that."
What is most valuable?
The analytics part is really important for me. I have seen some things pop up periodically that I did not expect, so it is important for me to dig into them. The ability for me to look at the traffic and see where it's going to is extremely important.
I really love the Data Explorer. I use it all the time to go in and craft exactly what I need to see. I'm able to then take that story and explain it to the executives. I've done that a couple of times and it is helpful.
And I'm really liking the alerting. It's super-helpful.
In terms of the solution’s real-time visibility across our network infrastructure, I have not been able to find any other monitoring or netflow visualization tool that gives me the kind of information I get from Kentik. If I need to take a deep-dive into something that I see, it's really easy for me to do that. Whereas with most other things, I have to use five or six other tools to get that kind of data, with Kentik, I have it all in one place. Data visualization is extremely important.
What needs improvement?
I've checked out the V4 version of the interface and it's still a little bit clunky for me to use. I still go back to the old interface. That's definitely one that they still need to work on. It doesn't seem like everything that you get in the V3, the older interface, is there. For instance, I was trying to add a user or do the administrative tasks in V4, and I couldn't figure out where I was supposed to do that. The interface just wasn't working for me so I went back to V3 to do that stuff.
Also, with the alerting page, that traffic overview page, sometimes I really want to share it with someone. Usually, you can get a quick URL on most of the other pages to share that particular view, but I can't do that on the traffic overview page that is given to me from an alert. That would be really helpful.
For how long have I used the solution?
I started talking to Kentik in 2014. We got it all installed in 2017, so I've been using the product as an engineer for two to three years. At my previous company, we talked to them originally when they were CloudHelix, before they changed their name to Kentik. Eventually, we managed to finally get them into the network, which was awesome.
What do I think about the stability of the solution?
Generally speaking, I have found it to be fairly stable. Do they have periodic outages? Yes. But almost never is the whole thing down, it's just one aspect that is down. I haven't really had an issue with it.
What do I think about the scalability of the solution?
At my previous company we probably had one of the largest installations ever. I would say Kentik is fairly scalable. That company is one of the biggest ISPs in the world. They had 200,000 netflow flows per second. So it's pretty scalable. The scale I'm dealing with now is so minimal in comparison. It's a different world.
How are customer service and technical support?
I have used technical support a few times and they have been knowledgeable and easy to work with. I really like them. I haven't had any issues at all. I've dealt with a lot of vendors, so it's like a breath of fresh air for me.
If you've ever dealt with Cisco before, or a telecom vendor, you know what I mean. But I send an email to Kentik and within a few hours I've got something back asking me a couple of questions and helping me fix the problem. It's a vastly different experience because if you try to do that with Cisco, for instance, or one of the network equipment vendors, you're going to be in for a very long process. And if you talk call a telecom company, same deal. You're probably not going to get a human the first time. If you get an email, it's going to be automated. It's just going to take forever. But Kentik is very quick.
Which solution did I use previously and why did I switch?
We have DDoS mitigation providers but they don't really provide the analytics. They detect and mitigate, but they don't really provide you any information on what's really happening.
At my previous company we had tried, several times, to build our own solution, and I can tell you that it was not terribly successful. We could only ever get analytics on one very small use case, as opposed to all of the use cases that Kentik has. I was intimately involved with each one of those attempts, so I can tell you it was not easy.
How was the initial setup?
At my previous location the solution was on-prem and I helped with the entire process of getting it into the network. I helped them do the proof of concept, I helped do the executive briefing, I helped do the modeling of the entire implementation, and I also helped and worked on the implementation itself.
Because it was an on-prem setup I found it pretty straightforward. We had to do a whole bunch of work on the network to get it working properly because you have to change all of your configurations to make sure it's sending to the right locations, but otherwise, it went very smoothly. They told us we had one of the fastest implementations ever. From the time that we actually started the implementation, it was only about a month, and we actually got all the routers in there too. And that was with a huge, massive, on-prem installation. Probably one of their biggest ever.
For the servers, Kentik worked with our IT department, but for the network stuff, for anything that was on the routers, we deployed it ourselves.
At my current company, they have the cloud solution, and I was not a part of the installation. I'm not sure why they decided to go with cloud versus on-prem. I don't understand it. I know why my other company went on-prem but I don't know why they did cloud versus on-prem here.
What about the implementation team?
I worked with Kentik directly and I had a very good experience with them. They were knowledgeable, helpful, and easy to work with. They used Slack and it was very easy for us to communicate with them, even across teams. They were working with our IT team and the backbone engineering team. It was very easy.
What was our ROI?
We are working in one country where transit is very expensive and Kentik has allowed us to identify those peers we're sending traffic to so that we can then get onto the exchanges in that country and significantly reduce the cost of our transit in that country.
If you're talking about Japan, it tends to have higher transit costs. We brought up our exchanges and then we targeted a lot of peers so that we're not spending, five bucks a meg or so to send traffic to those peers.
I don't have a final cost analysis yet, but I can tell you that the IX is much cheaper than the transit is. And our customers are getting much better latency. Our latency numbers have decreased by about ten percent because we're peering directly with the customer at an exchange. That's one place I can say the ROI is great.
We did the same thing with any of our transit providers as targets. If we can privately peer with someone somewhere, rather than have them go over transit, we target those peers and pull them off of the transit. Anytime we can do that, it's much cheaper.
What's my experience with pricing, setup cost, and licensing?
I believe pricing is by device, the number of devices with BGP sessions, and then by the amount of flow you expect from that device, if I remember correctly. We did ours on a yearly basis. That was easier for us. I think they will happily do multi-year if you want.
What other advice do I have?
Carefully analyze your routers and how much flow they're sending to a collector. I would also suggest if you can minimize the number of routers that have to send BGP, so you have a good enough view of the BGP, but you don't have to have every router sitting at BGP sessions, that might help. Those are suggestions for implementation.
The biggest lesson I have learned from using Kentik is "don't do it yourself." At my previous company they were being very stubborn and they didn't want to use an off-the-shelf product, so I went through three iterations of a netflow interface trying to get it correct, and I kept telling them, "Okay, but there's a product out there that does this. So please let's stop spending all this money." And they went so far as to spend a couple of million dollars on hardware to deploy it out to the network and everything, and we still ended up going to Kentik. That is one of the biggest things I learned, that sometimes you cannot do it all. You have to go to someone who's an expert in a particular kind of big data, and that's what they are.
We don't currently make the use of solution's ability to overlay multiple data sets such as orchestration, public cloud infrastructure, network path, or threat data onto existing data. But with the public cloud providers we are working with, we are looking at pulling in VPC logs so that we can see if we're getting the performance that's necessary out of our public cloud providers. That's the next step with this product for us.
We're not pulling in other data sources like logs or ThousandEyes data, for instance, at this point. We did talk to Kentik about trying to pull ThousandEyes data in and marrying it with their product. But not quite yet. I hope to add that into the product as well at some point. We do use BGP as another metric to figure out what's happening with the different paths.
We probably have about 30 users. Everything from our monitoring team is in there so they're working with me on pulling together an interface that uses the API to pull the data out of Kentik to put it on one of our internal interfaces. That way, some people won't have to log in to get some data. It's more of an executive view for them. But some of our executives actually have access to Kentik too. We have a couple of network backbone engineering executives who have access and who do look at it. Then we have a lot of our operations team, the network architecture and backbone engineering. They all have access. It's a wide range.
In terms of deployment and maintenance, there are two of us who put stuff in. I've created users. One thing we are going to do is automate getting the routers in there. We would generally suggest, and this is what I did previously, that you write scripts to do your updating of everything, plus you have the scripts that just does it automatically for you. That's super-helpful.
In this environment we don't have that many routers in it. It's about 40 to 50 routers at the moment. We mainly use it on their engines. We're starting to work with our security team to get it from data center to data center as well. That's really limited by our need for security rather than how we would use it entirely. At my previous company, when I left, we had 667 routers in it. It was used everywhere for everything. We absolutely have plans to increase usage of Kentik at my current company. I'm working with our security team to get approval to do that. I have to meet their security needs in order to expand the usage.
Honestly, it is one of those products that I would suggest to almost any network operator. I would go with a ten out of ten as my rating. I have not felt like this about any other company out there. It has just been so useful for me on so many different levels from operations, to ROI. It's just helpful.