What is our primary use case?
We use it for DDoS mitigation. Availability is extremely critical to our business. We provide online services for two, large video game titles, so there's a requirement to keep our game online and for our players to be able to play the game.
We use it on-prem.
How has it helped my organization?
Due to the availability and power of these devices, we've been able to ramp off our upstream cloud-scrubbing provider and handle all attacks on our own. Right now 100 percent of our attacks are handled on-prem with these devices. In 2019 - we're nine months into the year - we've had 1,500-plus attacks and less than five of those attacks have had impact on us.
Our current setup is two 6435 TPS devices at each location. Each of those boxes is rated for 155 Gigs of traffic. We currently are sending 100 Gigs of traffic to each box, that's the bandwidth of the line that we have coming in, and we successfully mitigated a 163-Gig attack. That one was successfully mitigated within the last month by those devices.
In terms of increased availability, I would say it's at about 99.999 percent, overall. We haven't had any major impact, anything more than five minutes, in about two or three years now, due to DDoS.
The automation in TPS makes my team more productive. There is less manual work for my team in dealing with attacks, and with other functions as well, because that automation is built-in: An incident is created, the attack is mitigated, and a report is created. There's really zero touch at this point in time for attacks. It has very much reduced the amount of manual intervention required during an attack.
That is especially true with the newest upgrades that just came out where it does automated pcaps. Everything that we need at this point in time is automated. The device automatically goes into mitigation. It gets the pcaps for us and, a large percentage of the time, it just blocks the traffic that we're looking for. Having those pcaps also helps out because in the future, when we're looking at attacks where we may not have either signature or a proper remediation, we can actually build that in based on the data that we're receiving from those pcaps.
Using TPS we have detected a lot more small attacks and attacks that we had been missing previously, but that's not only because of TPS. We do gather flow information from the TPS devices as well as from our border routers that we recently upgraded. We're using FlowTraq. With that combination, we are seeing a large increase in the number of DDoS attacks that we're detecting compared to what we were using previously, which was a third-party cloud provider. On average, we're detecting anywhere from 25 to 50 more attacks per week than we did previously.
What is most valuable?
The most valuable feature is the DDoS mitigation availability, the ability to be able to block different types of attacks, and large attacks. That's the main function of the device that we use today. They're able to block attacks that we see against us, to prevent attacks as we see them, to prevent further outage to the environment.
Also, based on previous equipment that we had, it's amazing that this device can do what it can do in a 1U form factor. The devices that we have right now have never gone over capacity and we've actually mitigated some pretty large attacks with these devices.
The solution's response time to an attack is pretty good. Normally, it's a matter of milliseconds and most of the time, within one minute, we have a response and we have mitigation in place. That limits the impact to our environment when we do have an attack.
We use the solution's programmable automated defense using RESTful API quite a lot. We don't use it for configuring the device, but we do use it for things such as grabbing stats and data as well as doing automated blacklisting of IPs and using class lists within devices. It works fairly well. For the most part, we only use the RESTful API for certain tasks. A10's aGalaxy uses API to configure the devices as well and we rely on that for all basic configuration such as IP addresses and port configuration.
What needs improvement?
We currently do not use the solution's machine-learning-powered Zero-day Automated Protection because of an issue with it. We have a ticket open with the team to try to resolve that problem. That is one of the problems that we have today, something that is not working.
We also use the aGalaxy platform, which is a management platform for the TPS devices. The issue is that some TPS features were added at the TPS level but weren't carried over to aGalaxy, and we manage all of our devices through aGalaxy. So we can't actually use some of the new features that are available on the TPS because that functionality doesn't exist in aGalaxy. That is one of my biggest complaints.
We're somewhat the guinea pig for using both aGalaxy and TPS. There are some features that we would love to use, but unfortunately, because we're on aGalaxy, we don't have the ability to use them due to limitations of the devices. The A10 team just needs to work on making sure they have a release cadence so that the TPS and aGalaxy are in line with each other.
A lot of the issues that we had previously with the devices were because a given functionality didn't exist and it took a while for the team to actually create that for us. But since they were implemented and the kinks were worked out, everything has been working pretty well.
For how long have I used the solution?
We have been using A10 TPS for about a year-and-a-half.
What do I think about the stability of the solution?
Overall, today, the stability is great. We have very few issues with the devices in terms of their performance or their availability.
We did have some issues previously which were resolved. We had some crashing issues on the devices but, with the last upgrade that we received about a month or two ago, those issues were resolved. We haven't had any major issues with our devices since we moved to the new code version.
What do I think about the scalability of the solution?
It enables us to scale defenses. We can go up to eight boxes at each location, with the current configuration that we have.
As time goes on, we are looking into possibly going with the newer devices which just came out and which have increased capacity. We're also potentially looking to move out to more pop locations in the future: Having an internet connection and an A10 TPS at a remote location, and then we would back-haul traffic to us. We are looking to potentially expand our footprint in the future.
Overall, scalability is just a limitation of our own network. But having ECMP and BGP available to us, we can scale out as horizontal as we need to, relying on whatever size of pipe we have coming in. It's really our own limitation at this point in time. Each of our data centers do 100-Gig pipes, which the devices have plenty of support for. But if we did need to roll out to either four devices or six devices, we would have the availability to do that.
How are customer service and technical support?
We used A10's support team, not for the initial setup but when we ran into issues. We would definitely use their support team for bringing issues to their attention and getting information to them saying, "Hey, this is what we're trying to do, this is what we need for the device to do," and they would actually build it out for us. That included their dev team as well. They were very open with their dev team for what we needed.
Overall, they're really great. For the most part, they understand what our needs are and they understand what problems we run into. Their support team is really good when we speak with them, as far as resolving our issues goes.
It does take some time, at times, to get issues resolved, but that is the nature of having them build out products or fixes for us. They have worked on a number of issues with us where something is not scheduled to be released but they either give us a hot-fix or they fix it in code and give us that version ahead of time, before it's actually released to the public. That is a great asset for us.
Which solution did I use previously and why did I switch?
We were previously using the Arbor APS platform. Some of the reasons we switched were that the TMS platform we were looking at didn't have the functionality to do ECMP and BGP the way we wanted to be able to do them. The second major factor was pricing. Pricing was much higher with Arbor for the same type of solution.
How was the initial setup?
I'd break the initial setup into two parts. I did not do the network part of the setup, although I know our network team had some issues with the initial setup because of the limitations of the devices at that point in time.
For the network part of the setup, one of the limitations the device used to have, and no longer have, was with BGP. The way we run our environment is that the TPS device is actually a BGP device within our network and it peers with other devices. That's not a common setup that A10 is usually used for. It's normally used in an environment where there are routers to the north and south of the device, so that there's usually another device that you reroute traffic to when there's an attack. But because we wanted to be in an always-on, asymmetric situation, we didn't have that ability. So they had to build it for us.
They also had to build what's called ECMP, which is equal-cost multi-path. It's basically load balancing on the network side. They had to build that in for us as well because that was a requirement for how we were going to build the environment. So there were some growing pains when we first brought it online just to make sure that everything was working. They built it into the product for us and it is now working perfectly fine. It is a standard feature now in the newer versions. They've added the ability to have BGP route-on and route-off as an option. Some teams do use that functionality where they have the two routers and they route on only when there's an attack. In our case, we are always on so we have the ability to turn that functionality off because we don't need it.
From the perspective of defenses through aGalaxy, that's gotten better over time. They've made a lot of enhancements to the product that we've requested to make our lives easier. We are currently running approximately 163 zones in our aGalaxy. Managing that number of zones and IPs can be kind of a daunting task, but they've added a bunch of features in the new versions of aGalaxy to be able to easily do that and onboard new IP addresses in an easier manner.
It took us about six-plus months to deploy. We had our existing solution in place and the new solution was hanging off of that for testing purposes. It was a good six to eight months before we were fully migrated over and we had our devices inline.
Previously we were using a different vendor for our mitigation, which was basically two 10-Gig connections that were shared across a switch stack, with all the devices being inline. That was very susceptible to failure because the traffic was always inline. Part of the new implementation requirements from the network team was that we have the ability to set up BGP, which is how it's set up today. So if for some reason there is an issue with a device, like a TPS, we can always pull the BGP route to that device and route traffic around it. Previously we didn't have that ability, so if there ever was an issue on our hardware stack, it would affect all services.
What was our ROI?
We have definitely seen return on our investment by going with A10. We have been able to scale back our cloud services due to the deployment of the A10s. Our future goal is to actually discontinue use of cloud services for DDoS mitigation.
What's my experience with pricing, setup cost, and licensing?
We are doing multiyear licensing. We signed up for three years to get a discount. We're doing that for most of our vendors at this point in time.
As far as I'm aware, there aren't any additional costs to the standard licensing fees. For aGalaxy there is a limitation on how many zones and how many devices we can deploy, but that's the only limitation there is. Currently, we're at 500 zones and about 50 devices. Everything else is not on a gigabyte or a license model where you can only have so much traffic through these devices. There are no such limitations. It's a software license for being able to upgrade to newer versions.
We are always looking to do multi-year deals, especially with devices that we plan on keeping. Being able to do multi-year is a pretty standard thing now. It just works for us and it gives us the ability to grow as we need.
Which other solutions did I evaluate?
We looked at products such as Radware and another product but I don't remember the name of it. Ultimately, based on some information and data from other gaming companies that we spoke with, it was suggested that we look at A10.
It's so long since we actually looked at Radware, but Radware was being used by our cloud provider for their DDoS mitigation. One of the things that we looked at was that their capacity per box was much lower. At the time, those boxes only were able to handle about 10-Gigs of traffic each, which was way below our needs, especially since we were moving to a multi-hundred-Gig solution.
What other advice do I have?
The type of configuration and the type of network you're planning on running really matters. A10 does a good job of letting you know what's available and what works for the company, depending on those needs. For our use, we needed to be full, 100 percent on. Some companies don't require that and they can afford some type of downtime for BGP cut-over and such. My advice would be to really work with the A10 engineering team on what your needs are and what you're looking for in a product, to make sure that is a viable option. We spoke a lot with other gaming companies that were using the solution and asked, "What is your setup? What kind of issues have you had?" We're using it in a different fashion than some of the other gaming companies are using it today, but it works for us and we think it does a great job.
The biggest lesson I have learned from using this solution is that it does take time to implement. There always are going to be some software issues that need to be worked through. Having a more versatile environment and versatile network makes it a lot easier, so that if you do have issues you can certainly work around them. That's especially true in a production environment. We really don't have a test environment that we are able to set up to test these in and this was basically done by hanging off our production environment with minimal downtime.
In our organization, there are two major teams that use the tools. There are three folks on the networking team and they handle all networking aspects, including BGP, routing, and configuration of the device from a networking perspective. And my team is the SOC team. I currently have nine folks. We work about 95 percent off of the aGalaxy system. We're responsible for responding to alerts, responding to attacks, gathering pcap data, gathering data about zone alerts, etc. Those 12 people are the ones responsible for the A10 devices.
That same group of people is responsible for deployment and maintenance of the solution. I'm mainly responsible, on the security side, for any types of updates that get pushed to the devices. That would be any type of software updates or any type of work being done. Whereas on the networking side, it usually just requires one person if we're doing any type of work. It doesn't require the whole team, for the most part. All three people in the network team have knowledge of the system, but it's usually two people required for that work if we do any types of updates.
I would rate it at nine out of ten. It does have its issues that are being worked through, but overall it's great.