What is our primary use case?
We have our own infrastructure on AWS. We deploy Flink on Kubernetes Cluster in AWS. The Kubernetes cluster is managed by our internal Devops team.
We also use Apache Kafka. That is where we get our event streams. We get millions of events through Kafka. There are more than 300K to 500K events per second that we get through that channel.
We aggregate the events and generate reporting metrics based on the actual events that are recorded. There are certain real-time high-volume events that are coming through Kafka which are like any other stream. We use Flink for aggregation purposes in this case. So we read this high volume events from Kafka and then we aggregate. There is a lot of business logic running behind the scenes. We use Flink to aggregate those messages and send the result to a database so that our API layer or BI users can directly read from database.
How has it helped my organization?
Flink has improved my organisation by enabling us to become independent of Redis which is used as an intermediate caching layer with Apache Storm for aggregation. Redis was a bottleneck. With an increasing number of messages, Redis was becoming full and also had a higher chance of errors because we were doing the checkpoints and state management manually.
With Flink, it provides out-of-the-box checkpointing and state management. It helps us in that way. When Storm used to restart, sometimes we would lose messages or intermediate state . With Flink, it provides guaranteed message processing, which helped us. It also helped us with application maintenance/deployments and restarts.
What is most valuable?
When we use the Flink streaming pipeline, the first thing we use is the Windowing mechanism with event time feature. So as that happens, the Flink aggregation is very easy. The next thing is we were using Apache Storm. Apache Storm is stateless, and Apache Flink is stateful. With Apache Storm, we were supposed to use an intermediate distributed cache. But because we use Flink, and it is stateful, we can manage the state or failure mechanism. The result is that we do aggregation every 10 minutes and we do not need to worry about our application stopping in between those 10 minutes and then restarting.
When we were using Storm, we used to manage all of it ourselves. We created manual checkpoints in Redis, but with Flink, it supports inbuilt features like checkpointing and statefulness. There is event time or author time that you can have for your messages.
Another important thing is the out-of-order message processing. When you use any streaming mechanism, there is a chance that your source is producing messages that are out-of-order. When you build a state machine, it's very important that you can have the messages in order, so that your computations/results are correct. What happens with Storm or any other framework that you're using is that to get messages in order, you have to use an intermediate Redis cache, and then sort the messages. When we use Flink, it has an inbuilt way to have the messages in order, and we can process them. It saves a lot of time and a lot of code.
I have written both Storm and Flink code. With Storm, I used to write a lot of code, hundreds of lines but with Flink, it's less, around 50 to 60 lines. I don't need to use Redis to do the intermediate cache. There is a lot of code that is being saved. I have to aggregate around 10 minutes and there is an inbuilt mechanism. With Storm, I need to write out logic and then I need to write a bunch of connected bolts and intermediate Redis. The same code that would take me one week to write in Storm, I could do that same thing in a couple of days with Flink.
I started with Flink five to six years ago for one use case and the community support and documentation were not good at that time. When we started back again in 2019, we saw that documentation and community support were good.
What needs improvement?
The state maintains checkpoints and they use RocksDB or S3. They are good but sometimes the performance is affected when you use RocksDB for checkpointing.
We can write python bolts/applications inside Apache Storm Code and it supports Python as a programming language, but with Flink, the Python support is not that great. When we do machine learning, data science, or ML work, we want to integrate the data science or machine learning pipeline with our real-time pipeline and most of the data science or machine learning work is in Python.
It was very easy with Storm. Storm supports native Python language, so integration was easy. But Flink is mostly Java. The integration of Python with Java is difficult, so it's not direct integration. We need to find an alternative way. We created an API layer in between so the Java and Python layers were communicating by using an API. We just called data science models or ML models using the API which runs in Python while Flink runs in Java. We would like to see improvement where we can have another way to run it. Currently, it's there, but it's not that great. This is an area that we would like to see improvement.
For how long have I used the solution?
I have been using Apache Flink for one-and-a-half years now.
What do I think about the stability of the solution?
Stability-wise, it's good and stable. We do aggregations on data streams from received Kafka. Flink Application connects to multiple Kafka topic and read the data. The number of messages generated in Kafka are very high . Sometimes in production, we see some glitches, where data is mismatched. Our Flink runs on Kubernetes Cluster , so sometime when worker node crashes or application restarts we see mismatch in aggregation results.
We are yet to verify whether it's a problem with the Flink framework or it's a problem with the code which does aggregation and checkpointing. We are yet to figure out whether the data is lost when a worker nodes crashes or we restart Flink application, or there is a problem with the way we have done the implementation. This problem is intermittent not always replicated.
What do I think about the scalability of the solution?
It's easy to scale because it supports Docker. Once you have Docker/Containers, you can deploy it on Kubernetes or any other container Orchestrator. So scalability-wise, it's good, you can just launch the cluster. When you have an automated cluster launching mechanism, you can easily scale up and down.
So far, there are close to 10 users who use Flink and most of them are software engineers, senior software engineers, DevOps guys, DevOps architect, and a Cloud architect.
Most of our work was on Storm but we saw improvement with Flink. So we have moved one business application. We have a couple of other main business applications or a data pipeline and we would like to move that as well.
How are customer service and technical support?
We have not used technical support. There are good forums and community support.
Which solution did I use previously and why did I switch?
We switched from Storm to Flink. We looked at Apache Spark Streaming as well, but some of the use cases were better in Apache Flink. We chose Flink over Spark Streaming and Kafka Streams. We thought Flink was better and so we went with it.
Spark is micro-batch but this Flink offers complete streaming. Memory management with Apache Spark is not that great, but Flink has automatic memory management. For our use case, we found Flink is faster as compared to Spark. The windowing mechanism that Flink provides is better than Spark.
How was the initial setup?
In terms of the implementation, we initially set up our development instances for Mac, which was easy. We have the documentation available. For the setup, when we wanted to move it to production, it provided the setup on Kubernetes. That Kubernetes setup is a little bit complicated. You need a person who understands Kubernetes well. A developer alone cannot do it. When you want to take it to production, the setup on Kubernetes using Docker is a little bit complicated. We need something like a one-click deployment script that can launch the cluster so that you can then do it.
In another case, we used AWS. There is Flink support in AWS EMR that we could have readily. It's a manged service and so it was easier for us. We don't need to bother with launching the cluster and running our workload. When we have to manage our own cluster using Kubernetes and Flink, it's a little bit complicated. There are a bunch of manual steps that need to be done.
Moving to production, we did EMR a couple of days. But for the Kubernetes cluster setup, it took us two to three weeks. The setup required a couple of team members from the DevOps team and engineering side.
In terms of our deployment strategy, we were already using the Kubernetes cluster for most of the use cases, and we wanted to use the same Kubernetes cluster. The first thing we wanted to do is Dockerize the application that we were running and then use the same Kubernetes cluster or create a separate workspace in that and use it.
What about the implementation team?
We did the deployment ourselves. We have a team of three or four DevOps guys who manage our Kubernetes cluster.
For the deployment, we needed one or two guys and for development, we are three to four people. We had a lot of other business applications that are in Flink.
Which other solutions did I evaluate?
Apache Storm, Spark Streaming , Kafka Streams
What other advice do I have?
My advice would be to validate your use case. If you are using already a streaming mechanism, I suggest that you validate what your actual use cases are and what the advantages of Flink are. Make sure that the use case that you are trying can be done by Flink. If you're doing simple aggregation and you don't want to worry about the message order then it's fine. You can use Storm or whatever you are using. If you see features that are there and are useful for you, then you should go for Flink.
Validate your use case, validate your data and pipeline, do a small POC, and see if it is useful. If you think it's useful and worth doing a migration from your existing solution, then go for it. But if you don't already have a solution and Flink will be your first one, then it's always better to use Flink.
The biggest lesson I have learned is that the deployment using Kubernetes was a little bit difficult. We did not evaluate when we started the work, so we migrated on the code part, but we did not take on the deployment part. Initially, if we would have seen the deployment part, then we could have chosen Kafka Streams as well because we were getting a similar result, but on the deployment side, Kafka Streams was easy. You don't need to worry about the cluster.
I would rate Apache Flink an eight out of ten. I would have given it a nine or so if it wasn't for that the deployment on Kubernetes is a little bit complicated.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Which version of this solution are you currently using?