Apache Flink Review

Scalable framework for stateful streaming aggregations


What is our primary use case?

Initially, we created our own servers and then eBay created their infrastructure. Now it's deployed on the eBay cloud.

Our primary use case is trying to do real time aggregations/near-real time aggregations. Let's say for example that we are trying to do some count, sum,min,max distinct counts for different metrics that we care about, but we do this in real time. So let's say, you have an e-commerce company and you want to measure different metrics. If I take the example of risk, let's say you want to check if one particular seller on your site is doing something fishy or not. What is the behavior? How many listings do they have? In the past five minutes, one hour or one day or one year? You want to measure this over time.

This data is very important to you from the business metric point of view. Often this data  data is delayed by 1 day via offline analytics. You do ETL for these aggregations ,it's okay for offline business metrics. But when you want to do risk detection for online businesses, it needs to be right away in real time, and that's where those systems fail and where Apache Flink helps. And if combined with Lambda architecture, you can get them real time with the help of a parallel system that captures very latest data.

How has it helped my organization?

The mighty work on risk, right? So we have to provide risk insights. When we have machine learning models or deep learning models or route based systems and when we want to evaluate some user or somebody's behavior on the site, we want to evaluate it right away. If we don't evaluate it right away, it's of no use to us. Let's say that a fraudulent buyer comes in and is trying to buy an item. In the process of buying, when the person clicks on the button and purchases the item, if you're not able to detect the fraud right then, it's of no use to us.

At the same time, we also have to be able to make sure that we are not pissing off the real people if they are good. So we have to create the very hard balance between whether it's a good person or a bad person without seeing what they're doing. Then you need to do it in real time. That's why with offline systems the aggregation of your data comes after one day, which is a typical case of ETL and is of no use to us. We want to do it right away. But at the same time, we don't want to provide too much friction either. This aggregation includes: how many transactions have you done in the past, in the last year? How many transactions did you do in the last minute? How many coupons did you use in last five years or how many coupons did you use in five minutes? All these kinds of metrics can be useful and can feed into a machine learning or deep learning model or a rule based system to do something with it.

Let's say if we feel that the seller has crossed a limit, or our seller is doing something fishy or buying something fishy, we can throw a capture to the user or we can provide some friction -"can you do any multifactor authentication?" These are just examples. So our use case is the idea that we want to have real time aggregations. One thing to note is that Flink will not help you to do real time navigation very well. It helps you in near-real time navigation. But with the lack that you get from near-real time vs. real time is that you can have it from another system. So let's say that Flink is able to give you all the aggregations accurately either daily or in five minutes. I'm just taking an example for our use here. That's how it works. Five minutes. And it depends on the complexity of the problem that you are trying to solve or how much infrastructure you have. So if it is delayed by five minutes, what you can do is have a panel system that only takes care of the latest data. And then you can take this to the system and combine it with the Lambda architecture. Where you combine a historical system with the real time system and then give the data aggregation. What we do is we apply the same concept in a near-real time integrated fashion and then we are able to give real time analytics on anything that happens on us.

What is most valuable?

You need to understand that a bunch of links work on streaming data. Among the streaming data, out of the nuances of streaming, we have called exactly once are the ones we call semantics. Exactly once being the hardest, which is being very well maintained by flex. What I mean by that is, when an event or when data is coming in, you're only processing that event. Only one stop. This is very important when you're trying to do some aggregates.

Let's say you are a seller on the site and we know that depending on when a seller makes a setting limit, you're allowed to sell only three items a day. If we miscount you in the last five minutes, from three to four, but you did three listings, because of the current listing the data does not reflect that, we do not aggregate now. We allowed you to do this. What is wrong? We can't allow a wrong decision on your side based on the setting limits and the reason for that is because the data is out-dated because the data got delayed. That is a reason the data got delayed. The other reasons could be that the data got lost or they will not aggregate exactly the way it was supposed to be aggregated. So exit aggregations are what makes it much harder to do in real time. That's why you use offline systems like Spark and Hadoop. But you can do real time with Symantec, which is very well supported in Flink. So that's one of the best features. 

Another feature is how Flink handles its radiuses. It has something called the checkpointing concept. You're dealing with billions and billions of requests, so your system is going to fail in large storage systems. Flink handles this by using the concept of checkpointing and savepointing, where they write the aggregated state into some separate storage. So in case of failure, you can basically recall from that state and come back.

I'll take an example of Call of Duty. Let's say when you play a game of Call of Duty there are five levels. And in each level, there are different obstacles that you need to clear to advance. There are five levels and in each level, there are different obstacles. So let's say in each level there are 10 obstacles. If you've cleared three obstacles and you die in that process you don't want to start from scratch again in the same level, right? You want to start from the third level. So that is what the concept of checkpointing and savepointing allow you to do. I have done work until this point at the third level, and now I want to restart from the third level only. I don't want to redo that part again. That's what checkpointing and savepointing do. So how does it help in our case? What is the current date? It's Wednesday, October, 15, right? So October 15 until 12:00 PM, we open all the applications. We take that at a regular interval. We take that aggregation snapshot and store it in a different storage. If the system goes down after 12:00PM because the load is high or some other hardware failure, you can recover that data from 12:00 PM and reprocess. You're not recovering the entire thing.

Let's say the seller's count listing was two but you don't want to contrast for that particular seller from zero. You want to count from two after 12:00 PM. Right? That's what Flink helps you do.

Additionally, it helps you scale very well, but there are a lot of nuances. Because Flink allows you to do good aggregations upon segregations, maintaining the system is not that easy because you need a machine that has high RAM requirements. You need to have good memory requirements. It depends on what kind of problems you're solving. The problem that I'm describing is actually the the hardest kind of problem, when you're putting state in Flink's memory, stateful problems. There are other problems called stateless problems. Let's say you are a trucking company and you want to track the new data of your trucks. Let's say you have 15 trucks and you want to look at each of your trucks and each and every truck is spitting out latitude and longitude info when they're moving. In this case, maybe your intention is only to track them real time but you're okay with it being delayed by five, 10 minutes. But in this case, you're not aggregating something. There's nothing stateful about it. You take the data and you dump it to storage. That storage could be anything, plastic surgery or anything. And then you can create a graph and plot a trendline. Where is it going? In a map or something like that. The data comes in and it gets written to a place and then uses a straight graph. So it stays put. In this kind of application, Flink works very well and you can really do a lot of real time analytics on it. But in the case of the problem that I described where you do real time applications, that's a little bit tricky because in this case, you need to recover from the right state so that you don't mess up the relations.

What needs improvement?

In Flink, maintaining the infrastructure is not easy. You have to design the architecture well. If you want to scale for a larger number of streaming data you need good machines. You need good resilience architecture so that if it fails, you can recover from those with minimum downtime. You should have good storage systems to store and retrieve intermediate flink states(in case of stateful applications). Basically all the problems that come with a distribution system. So you have to have all that infrastructure for it to perform well. Best way is to look at the use cases you wish to support in 5-10 years ahead and design the architecture around flink accordingly.

For how long have I used the solution?

I started using Apache Flink in October 2017. My team has been using it since May 2017.

What do I think about the stability of the solution?

In terms of stability with Flink, it is something that you have to deal with every time. Stability is the number one problem that we have seen with Flink, and it really depends on the kind of problem that you're trying to solve. If you're trying to solve the problems that we are trying to solve, which is stateful aggregations, you will find a lot of stability problems. That's why you have to invest money and time into understanding what problem you are trying to solve. How much infrastructure do you need? Stable infrastructure would take time to mature. Once you do that, you also need to spend time understanding and figuring out an optimized way of making it cluster-ready. You don't want to throw money just like that. If you want to throw money, you want to throw money in the right way.

When you create clusters of machines or something like that you're going to need a lot of analysis upfront. Let's say you're selling Flink as a product to different people, how can you do this? One way is you take a bunch of use cases, common use cases, and do experiments with it and form clusters based on that. You can then call them flavors. Let's say, for example, flavor A can do this kind of thing very well, flavor B can deal with these kind of things. And flavor A has its own infrastructure in the sense that flavor A has five job managers, 16 task managers, five zookeepers and all those configurations. Then different clients can use this kind of model. I'm just giving some ideas for how you can make the things work if you are selling this as a product.

What do I think about the scalability of the solution?

In terms of scalability, there's a lot of room for improvement in the case when you're in stateful conditions. There is no system as of now which does scaling very well for stateful aggregations. There are other frameworks like Apache Beam which actually is an app but for other kinds of things. Then there are other things, like Apache Pulsar. But among these, Flink performs best and it's actually very good for streaming architecture.

When we started, we were only three people working on a bunch of things. We were the first people. I was the software engineer. Basically I was the most junior of all. They're mostly principal engineers and I was a software engineer.

When we started we were creating code and generating the chart file. It was worth creating our own clusters. That's why I have that insight. We were doing all the settings on those clusters so that it would work. Then, you wouldn't apply the job on those clusters. We give all these insights to the other team, the platform team, so that they can evangelize the product for the entire company who would want to use Flink. For us it was a side project setting up the clusters and all that, because for us we are solving other business problems, but we had to do it because there was nothing available for us. Then the platform team did all of that. Now we just write code. We understand the business problem. We write code, we generate a plan for them. Everything else is taken care of by them on the machine itself.

It's getting popular really fast now. This idea of when it's easy for people to use it, people will use it if it solves the problem. In my experience, it is going to be big just like Spark. It's just that you need a little bit more infrastructure, because it's trying to solve a complex problem, which is what people need to understand. If it is a complex problem, you have to spend time and energy to make it make it stable. So you need to go to the infrastructure team who can write and who can create a stable Flink protection, then it will help you solve a lot of the problems.

How are customer service and technical support?

I have not had any experience with customer support. We did everything ourselves, but we read a lot of Apache's documentation on that.

How was the initial setup?

The initial setup is not straightforward. It would take time. You have to know a lot of things. But one thing is that when we started, Flink was very new. The product is maturing and people are using it more. They will understand what people need and all that stuff. Maybe it would not be as difficult as it was, but it does require you to understand a lot of things.

So how do you set up a cluster? Let's say you want to do aggregation on 15 million emits for one particular Flink job. In Flink, when you deploy an application, it's called a flincher. When you do that, how do you design the cluster? What boxes do you have? How much RAM do you need? Once you have one particular box, how do you design the topology of the cluster? Let's say the way it works is it has something called job manager, which are the coordinating notes. Then it has the task editor which are the machines which basically do the real work, the aggregation. Then there are the zookeepers. The zookeepers are something which helps you maintain the health of the cluster. You have to make a balance.

How many zookeepers do you want? How many job managers do you want? How many task editors do you want? How much RAM do you want to have in each cluster? How much network do you want to open? How much traffic can the Flink cluster take in the stable manner so that it doesn't go down frequently? You have to do a lot of experiments with all of this setup, depending on what problem you're solving, it depends upon the load that you're getting from your business.

What about the implementation team?

Deployment takes around two minutes if everything is good. It's very fast.

How do we do that? Just like any application, we write code, we build a bit of the manifest on the job, the code that is able to be deployed. You take that bell JAR. It's called a JAR. Generally we have used the JAR version because Flink is a JVM theme, so we are using the JVM version, which supports Java and SQL. We write code for whatever you want to do with it. We build the code and it converts into a JAR file. We take the JAR file and upload it into a Flink server and then you just click a button and it's deployed.

You don't have to do anything. If you are starting up, the moment you install Flink in your local machine, when you're trying it out and you start the server, you will see your Flink server UI coming up. There's an option so that you can deploy a job. What you have to do is build your code, generate a JAR file and then simply go to the UI, upload the JAR and start the server. That's it.

Which other solutions did I evaluate?

Before Apache Flink my team tried Apache Storm and it did not work for them. I think Apache Storm is not being used by anyone else in the company.

What other advice do I have?

This is general advice if you're trying to do anything: Any problem that you're trying to evaluate, you have to really understand the problem that you're trying to solve, what is the nature of the problem? And by nature of the problem, the business side is one thing, but you have to understand how you're solving things. For example, do you want something to be fast enough, scalable and for any new product? Every time they advertise it is fast, scalable, highly distributed, etc... But in what context? What kind of use cases is this product built for? You have to understand the principle and only then you choose a product. If you want Apache Flink, it's about if you want something for near-real time metrics that may be useful for your business.

In that case, Apache Flink is your friend, because it's built on streaming architecture. If the nature of your application or your business is streaming, the data is coming at a very high rate and you want to do something with it, then Apache Flink is a good option. Another example I can give you: let's say you run a company, you are the CEO of Twitter, right? So in Twitter, a lot of people are writing a lot of stuff. A lot of streaming data is coming in. Because a lot of people are tweeting at the same time all around the world there's a lot of streaming of data coming in.

Let's say you're a celebrity and 5,000 people follow you. When you write a tweet, all 5,000 people have to see that tweet as quickly as possible. So when your tweet comes in, a very complex system from Twitter's backend has to take that tweet, has to know which of those people and display it on their feed timeline. Now this might sound easy when you only have five people, but if you have 315 million people tweeting, it's a very complex system and you have to make it available, etc... So when you're dealing with streaming data Apache Flink is a good option.

On a scale of one to ten, I would rate Apache Flink around seven to eight. It's pretty good if you're solving a streaming type of problem. My experience is limited. I only worked with Apache Storm a little bit and Apache Flink. Among all of this, if I would talk about streaming, Apache Flink wins hands down, but there are other products like Apache Pulsar which I have no idea. So my perspective is very limited.

Which deployment model are you using for this solution?

Public Cloud
**Disclosure: I am a real user, and this review is based on my own experience and opinions.
Add a Comment
Guest