What is our primary use case?
We are using Flink as a pipeline for data cleaning. We are not using all of the features of Flink. Rather, we are using Flink Runner on top of Apache Beam.
We are a CRM product-based company and we have a lot of customers that we provide our CRM for. We like to give them as much insight as we can, based on their activities. This includes how many transitions they do over a particular time. We do have other services, including machine learning, and so far, the resulting data is not very clean. This means that you have to clean it up manually. In real-time, working with Big Data in this circumstance is not very good.
We use Apache Flink with Apache Beam as part of our data cleaning pipeline. It is able to perform data normalization and other features for clearing the data, which ultimately provides the customer with the feedback that they want. We also have a separate machine learning feature that is available, which can be optionally purchased by the customer.
How has it helped my organization?
We have a set of pipeline services that we run. For example, we might use Apache Beam for running a four-hour service, and we use Flink to run it. You can run any job using Flink, including an Apache Spark job.
We have many systems including Elasticsearch Database, MongoDB, and other services. Based on what we have running, we want to clean and transform some of our data.
Currently, we have two implementations of Flink and one of them is running Kafka, whereas the other one is Cassandra. Based on that, we process all of the things that we want and if it's streaming then we used Kafka, whereas if it is batch then we use Cassandra. The result of all of these services is that it can provide a much better user experience.
What is most valuable?
The most valuable feature is that there is no distinction between batch and streaming data. When we want to use batch mode, we use Apache Spark. The problem with Spark is that when it comes to time-series data, it does not train well. With Flink, however, we can have the streaming capability that we want.
The documentation is very good.
A lot of metrics are supported and there is also logging capability.
There is API support.
What needs improvement?
We have a machine learning team that works with Python, but Apache Flink does not have full support for the language. We needed to use Java to implement some of our job posting pipelines.
For how long have I used the solution?
We have been using Apache Flink for between one and one and a half years.
What do I think about the stability of the solution?
Stability is pretty good and we haven't had any problem with it.
We are using this product extensively and we have new products being onboarded.
What do I think about the scalability of the solution?
Apache Flink scales well. As long as we are using Kubernetes, we are happy to scale as much as you want.
We have a data team with between twenty and twenty-five people. It is split into two groups where the first group works on reporting, machine learning, and background operations. The second group works with Big Data.
How are customer service and technical support?
We have not used technical support from Apache.
Community support is available.
Which solution did I use previously and why did I switch?
Prior to Flink, we used Apache Spark.
We had to move to Flink because of the streaming capabilities that it has. In our architecture, we have one layer for batch processing and the other for streaming. This is quite a pain for us because we don't want to have two separate jobs to handle both streaming and batch processing. Using Flink, we are able to utilize the API and handle both of these jobs.
How was the initial setup?
The complexity of the initial setup is dependent on your use cases and what it is that you are trying to achieve. I found that we didn't have any problem with it.
This product can be deployed on-premises or as a SaaS on the cloud. It depends on the requirements of the customer.
The deployment using Kubernetes takes approximately 30 minutes to complete.
What about the implementation team?
Our in-house team is responsible for scaling and other maintenance. There is very good documentation available for this.
What's my experience with pricing, setup cost, and licensing?
This is an open-source platform that can be used free of charge.
Which other solutions did I evaluate?
We got to learn about Apache Flink through using Apache Beam. Originally, I did not know very much about Flink. The problem with Apache Beam is that you cannot run it alone. Once you create the jobs, you need a tool to run them. There are two options left, being Apache Spark and Apache Flink. We chose Flink because it was more compatible with what we wanted to do.
What other advice do I have?
We are very happy with the product, and we have been able to achieve all of the use cases that we are expected to deliver for our customers.
Over time, I have seen many improvements including in the documentation. An example is that when we first started using this product, almost two years ago, there was no support available.
At this point, we do not have much opt-in but we have some use cases to ensure that our system is not breaking. We have QA who can validate these things based on what is expected versus what we have done.
My advice for anybody who is considering Flink is that it has very mature documentation and you can do what you want. It is a very good way to implement streaming pipelines and you won't have any problems.
The biggest lesson that I have learned from using Flink is how we can customize the experience for the customer and how important it is to keep up with the industry. We don't want to be left behind.
I would rate this solution a seven out of ten.
Which deployment model are you using for this solution?