Apache Flink Valuable Features

Ilya Afanasyev - PeerSpot reviewer
Senior Software Development Engineer at Yahoo!

We value this solution's intricate system because it comes with a state inside the mechanism and product. The system allows us to process batch data, stream to real-time and build pipelines. Additionally, we do not need to process data from the beginning when we pause, and we can continue from the same point where we stopped. It helps us save time as 95% of our pipelines will now be on Amazon, and we'll save money by saving time.

View full review »
PrashantVaghela - PeerSpot reviewer
Principal Engineer at InnovAccer Inc.

If you have data that is streaming from different kinds of sources, Flink has many advantages. 

One is that Flink basically gives you stateful transformations. So, if you want to transform your streaming data that is coming in, and you want to apply stateful transformations, one transformation after another, depending on the result of the first transformation, Flink allows you to do that.

There are many other positive points. The reason why the Apache organization actually came up with Flink in the first place is that it offers stateful transformations and complete offset management between transformations.

View full review »
Armando Becerril - PeerSpot reviewer
Partner / Head of Data & Analytics at Kueski

The top feature of Apache Flink is its low latency for fast, real-time data. Another great feature is the real-time indicators and alerts which make a big difference when it comes to data processing and analysis.

View full review »
Buyer's Guide
Apache Flink
April 2024
Learn what your peers think about Apache Flink. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
768,578 professionals have used our research since 2012.
RA
Sr. Software Engineer at a tech services company with 10,001+ employees

You need to understand that a bunch of links work on streaming data. Among the streaming data, out of the nuances of streaming, we have called exactly once are the ones we call semantics. Exactly once being the hardest, which is being very well maintained by flex. What I mean by that is, when an event or when data is coming in, you're only processing that event. Only one stop. This is very important when you're trying to do some aggregates.

Let's say you are a seller on the site and we know that depending on when a seller makes a setting limit, you're allowed to sell only three items a day. If we miscount you in the last five minutes, from three to four, but you did three listings, because of the current listing the data does not reflect that, we do not aggregate now. We allowed you to do this. What is wrong? We can't allow a wrong decision on your side based on the setting limits and the reason for that is because the data is out-dated because the data got delayed. That is a reason the data got delayed. The other reasons could be that the data got lost or they will not aggregate exactly the way it was supposed to be aggregated. So exit aggregations are what makes it much harder to do in real time. That's why you use offline systems like Spark and Hadoop. But you can do real time with Symantec, which is very well supported in Flink. So that's one of the best features. 

Another feature is how Flink handles its radiuses. It has something called the checkpointing concept. You're dealing with billions and billions of requests, so your system is going to fail in large storage systems. Flink handles this by using the concept of checkpointing and savepointing, where they write the aggregated state into some separate storage. So in case of failure, you can basically recall from that state and come back.

I'll take an example of Call of Duty. Let's say when you play a game of Call of Duty there are five levels. And in each level, there are different obstacles that you need to clear to advance. There are five levels and in each level, there are different obstacles. So let's say in each level there are 10 obstacles. If you've cleared three obstacles and you die in that process you don't want to start from scratch again in the same level, right? You want to start from the third level. So that is what the concept of checkpointing and savepointing allow you to do. I have done work until this point at the third level, and now I want to restart from the third level only. I don't want to redo that part again. That's what checkpointing and savepointing do. So how does it help in our case? What is the current date? It's Wednesday, October, 15, right? So October 15 until 12:00 PM, we open all the applications. We take that at a regular interval. We take that aggregation snapshot and store it in a different storage. If the system goes down after 12:00PM because the load is high or some other hardware failure, you can recover that data from 12:00 PM and reprocess. You're not recovering the entire thing.

Let's say the seller's count listing was two but you don't want to contrast for that particular seller from zero. You want to count from two after 12:00 PM. Right? That's what Flink helps you do.

Additionally, it helps you scale very well, but there are a lot of nuances. Because Flink allows you to do good aggregations upon segregations, maintaining the system is not that easy because you need a machine that has high RAM requirements. You need to have good memory requirements. It depends on what kind of problems you're solving. The problem that I'm describing is actually the the hardest kind of problem, when you're putting state in Flink's memory, stateful problems. There are other problems called stateless problems. Let's say you are a trucking company and you want to track the new data of your trucks. Let's say you have 15 trucks and you want to look at each of your trucks and each and every truck is spitting out latitude and longitude info when they're moving. In this case, maybe your intention is only to track them real time but you're okay with it being delayed by five, 10 minutes. But in this case, you're not aggregating something. There's nothing stateful about it. You take the data and you dump it to storage. That storage could be anything, plastic surgery or anything. And then you can create a graph and plot a trendline. Where is it going? In a map or something like that. The data comes in and it gets written to a place and then uses a straight graph. So it stays put. In this kind of application, Flink works very well and you can really do a lot of real time analytics on it. But in the case of the problem that I described where you do real time applications, that's a little bit tricky because in this case, you need to recover from the right state so that you don't mess up the relations.

View full review »
Sunil  Morya - PeerSpot reviewer
Consultant at a tech vendor with 10,001+ employees

Apache Flink was easy to deploy and manage.

View full review »
ZHIZHENG - PeerSpot reviewer
Product Operations Manager at OKX

Apache Flink's best feature is its data streaming tool.

View full review »
AC
CTO at ReNew

The product helps us to create both simple and complex data processing tasks. Over time, it has facilitated integration and navigation across multiple data sources tailored to each client's needs. We use Apache Flink to control our clients' installations. 

View full review »
MP
Lead Data Scientist at a transportation company with 51-200 employees

The significant advantage is the learning curve as it is easy. It's not associated with any specific cloud entity. It provides us the flexibility to deploy it on any cluster without being constrained by cloud-based limitations. It enables freedom for making optimizations, allowing for specific customizations as needed.

View full review »
JR
Sr Software Engineer at a tech vendor with 10,001+ employees

The most valuable feature is that there is no distinction between batch and streaming data. When we want to use batch mode, we use Apache Spark. The problem with Spark is that when it comes to time-series data, it does not train well. With Flink, however, we can have the streaming capability that we want.

The documentation is very good.

A lot of metrics are supported and there is also logging capability.

There is API support.

View full review »
BH
Lead Software Engineer at a tech services company with 5,001-10,000 employees

MapFunction, FilterFunction are the most useful or the most used function in Flink. Data transformation becomes easy for example, Applications that store information about people and when they want to retrieve those person's details in some kind of relation, in the map function, we can actually filter all persons based on their age group. That's why the mapping function is very useful. This could be helpful in analytics to target specific news to specific age group.

View full review »
SD
Software Architect at a tech vendor with 501-1,000 employees

When we use the Flink streaming pipeline, the first thing we use is the Windowing mechanism with event time feature. So as that happens, the Flink aggregation is very easy. The next thing is we were using Apache Storm. Apache Storm is stateless, and Apache Flink is stateful. With Apache Storm, we were supposed to use an intermediate distributed cache. But because we use Flink, and it is stateful, we can manage the state or failure mechanism. The result is that we do aggregation every 10 minutes and we do not need to worry about our application stopping in between those 10 minutes and then restarting.

When we were using Storm, we used to manage all of it ourselves. We created manual checkpoints in Redis, but with Flink, it supports inbuilt features like checkpointing and statefulness. There is event time or author time that you can have for your messages. 

Another important thing is the out-of-order message processing. When you use any streaming mechanism, there is a chance that your source is producing messages that are out-of-order. When you build a state machine, it's very important that you can have the messages in order, so that your computations/results are correct. What happens with Storm or any other framework that you're using is that to get messages in order, you have to use an intermediate Redis cache, and then sort the messages. When we use Flink, it has an inbuilt way to have the messages in order, and we can process them. It saves a lot of time and a lot of code.

I have written both Storm and Flink code. With Storm, I used to write a lot of code, hundreds of lines but with Flink, it's less, around 50 to 60 lines. I don't need to use Redis to do the intermediate cache. There is a lot of code that is being saved. I have to aggregate around 10 minutes and there is an inbuilt mechanism. With Storm, I need to write out logic and then I need to write a bunch of connected bolts and intermediate Redis. The same code that would take me one week to write in Storm, I could do that same thing in a couple of days with Flink.

I started with Flink five to six years ago for one use case and the community support and documentation were not good at that time. When we started back again in 2019, we saw that documentation and community support were good.

View full review »
VI
Principal Software Engineer at a tech services company with 1,001-5,000 employees

Apache Flink is meant for low latency applications. You can take one event opposite if you want to maintain a certain state. When another event comes and you want to associate those events together, in-memory state management is a key feature for this. 

Checkpointing was important since we have the consumption done by Kafka. There was a continuous pool of data coming in from cars and which was put into Kafka. This particular Apache Flink component came in and it started processing it. When there's a failure or something effective, checkpointing is very important. 

It also helped us in exactly one standard. Another valuable feature is API extraction, which is nicely done. Anyone can understand it. It's not very complex. Anyone can go through all the transformations and everything they have. It's easy to use that. It's a well-balanced extraction.

View full review »
RP
Software Development Engineer III at a tech services company with 5,001-10,000 employees

The most valuable feature of Apache Flink would be that it is truly real-time. Unlike Spark and other technologies, it's not recurring batch processing and it also allows me better control over resources. For example, in Spark, it's very difficult to create multiple parallel streams and it consumes the memory of your entire cluster very greedily. With Flink, I have very good control, can choose the number of task managers with a fixed amount of memory, and configured parallelism. This flexibility is very useful in scaling of Flink.

View full review »
Ertugrul Akbas - PeerSpot reviewer
Manager at ANET

It's usable and affordable.

It is user-friendly and the reporting is good.

View full review »
Buyer's Guide
Apache Flink
April 2024
Learn what your peers think about Apache Flink. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
768,578 professionals have used our research since 2012.