Spring Cloud Data Flow Review

Good logging mechanisms, a strong infrastructure and pretty scalable


What is our primary use case?

Mostly the use cases are related to building a data pipeline. There are multiple microservices that are working in the Spring Cloud Data Flow infrastructure, and we are building a data pipeline, mostly a step-by-step process processing data using Kafka. Most of the processor sync and sources are being developed based on the customers' business requirements or use cases. 

In the example of the bank we work with, we are actually building a document analysis pipeline. There are some defined sources where we get the documents. Later on, we extract some information united from the summary and we export the data to multiple destinations. We may export it to the POGI Database, and/or to Kafka Topic. 

For CoreLogic, we were actually doing data import to elastic. We had a BigQuery data source. And from there we did some transformation of the data then imported it in the elastic clusters. That was the ETL solution.

How has it helped my organization?

For example, like PCF, all the cloud services, has their own microservice management infrastructure. However, if you have a CDF running, then the developer has more control over the messaging platform. How we can control the data flowing from one microservice to another microservice is great. As a developer, I feel more in control. Some hosted services (like the cloud) or some hosted infrastructure make us run smaller microservices, but they are actually infrastructure dependent. If anything happens (like any bug or any issue), it can be difficult to trace the problem. That's not true here. In a CDF they are really good at logging. Therefore, as a developer, I can have my Spring Boot logging mechanism to check what the problem is and it helps a lot. 

I've been working with the solution for eight or nine years at this point. I feel more comfortable with the infrastructure. CDF is actually infrastructure for Spring Boot applications running inside it. As a task or as the Longleaf microservice in the data pipeline. If you have a Spring Cloud Data Flow server implemented in your project, that means you have your own data pipeline architecture, and you can design your flow of the processing of the data as you wish. 

There is also logging for these short leading tasks. When the task is started, when the task is stopped, this kind of logging also helps to get some more transparency. 

In terms of the direct benefit of the company, they spend less money due to the fact that if you have some kind of hosted BPM or some kind of hosted service to orchestrate your microservices, then you need to pay some fees to a company to manage it. However, if your developer can manage the CDF, then this management cost gets reduced. I'm not sure of the actual hard costs, however, I am aware of the savings.

What is most valuable?

Mostly we enjoy the orchestration of microservices as you can have a Spring Boot application and build your own steps. You can deal with multiple processors as you need. There is a Spring Task inside CDF. That task is also helpful for a temporary position. You can trigger some tasks and it will do something in a few microseconds and then finish the task. There is no memory occupied by running the microservice. You can just open the microservice and it will do some work and then it will die and memory is released. These kinds of temporary activities are also helpful. 

It's a low-resource type of product. You have a scheduler running, and you have a lot of smaller tasks to be done by the Scheduler. Therefore, you don't need to keep the microservice running. You can trigger the task and the task will be executed and it will be down and GAR execution will be down and then memory will be released. So you don't ever need to keep any long life microservices.

There are a lot of options in Spring Cloud. It's flexible in terms of how we can use it. It's a full infrastructure.

What needs improvement?

The configurations could be better. Some configurations are a little bit time-consuming in terms of trying to understand using the Spring Cloud documentation. 

The documentation on offer is not that good. Spring Cloud Data Flow documentation for the configurations is not exactly clear. Sometimes they provide some examples, which are not complete examples. Some parts are presented in the documentation, but not shown in an example code. When we try to implement multiple configurations, for example, when we integrated with PCF, Pivotal Cloud Foundry, with CDF, there were issues.  It has workspace concept, however, in a CDF when we tried to implement the workspace some kind of boundary configuration was not integrating properly. Then we went to the documentation and tried to somehow customize it a little bit on the configuration level - not in the code level - to get the solution working.

It is open source. Therefore, you need to work a little bit. You need to do some brainstorming on your own. There's no one to ask. We cannot call someone and ask what the problem is. It is an open-source project without technical support. It's up to us to figure out what the problem is.

For how long have I used the solution?

I've been working with the solution for more than 11 months on two separate projects in California and Illinois. However, I have been familiar with the solution since 2017 and have used it on and off since then on a variety of projects.

What do I think about the stability of the solution?

Spring Cloud Data Flow is an open-source project and a lot of developers are working on this project. It is really stable right now. The configuration part may need some improvement, or, rather, simplifying in that some configuration could be simplified somehow. For a simpler implementation or a smaller project, there is no problem. If you deploy in PCF it is the CDF server, and if you deploy in Kubernetes it is the CDF server, then there are some integrations. 

What do I think about the scalability of the solution?

The solution scales well. 

The main reason to use the Spring Cloud Data Flow server is for scaling your project. You can split it into multiple microservices, then you can deploy it into multiple servers. We took help from the PCF platform as PCF has a Pivotal Cloud Foundry. They have Spring Cloud Data Flow server integrated right in. In their cluster, our microservice was running, however, it was running in multiple instances. We can increase the number of instances of these microservices as we need. 

How are customer service and technical support?

The solution is open-source, so there really isn't technical support to speak of. If there are issues, we need to troubleshoot them ourselves. We need to go through the code and work through the issues independently.

Which solution did I use previously and why did I switch?

We've had experience with Apache 95 and also Spark, however, Spark is just an execution engine mostly. They also have similar architecture. Apache 95, like this solution, is also open-source. We've looked at Amazon Step Function, however, their concept is similar to a serverless architecture. You don't need to even do the, boilerplate coding to run the application as a microservice. You just copy the part of the code you need to execute as a function. In ACDF what we do, we write microservice as application double application, then run that code inside my microservice, we've had some method, however, in AWS, Amazon Step Function, lambda, you can only put the part of the good that you need to execute, then use their platform to connect all the steps. Amazon can be expensive as you do need to pay for their services. The others you can just install on your servers.

How was the initial setup?

During the initial setup, when I ran the CDF server (just one GAR then Skipper server another GAR), I created some tasks and created a source string with an ITA service string. These tasks are all simple. However, if we try to integrate with some kind of platform, for example, another platform where I'm going to deploy a CDF, then the complexity comes into play. Otherwise, if you can run it in a single ECS or any kind of Linux box or in a server instance. Then there no issue. You can do everything.

I used the Docker compass and we did Docker-ize a lot of things. It was a quick deployment.

That said, each deployment is unique to each client. It's not always the same steps across the board.

What other advice do I have?

While the deployment is on-premises, the data center is not on-premises. It's in a different geographical location, however, it was the client's own data center. We deployed there, and we installed the CDF server, then the Skipper server, and everything else including all the microservices. We used the PCF Cloud Foundry platform and for the bank, we deployed in Kubernetes. 

Spring Cloud Data Flow server is pretty standard to implement. The year before it was a new project, however, now it is already implemented in many, many projects. I think developers should start using it if they are not using it yet. In the future, there could be some more improvements in the area of the data pipeline ETF process. That said, I'm happy with the Spring Cloud Data Flow server right now.

Our biggest takeaway has been to  design the pipeline depending on the customer's needs. We cannot just think about everything as a developer. Sometimes we need to think about what the customer needs instead. Everything needs to be based on customer flow. That helps us design a proper data pipeline. The task mechanism is also helpful if we can run some tasks instead of keeping the application live 24 hours. 

Overall, I'd rate the solution nine out of ten. It's a really good solution and a lot cheaper than a lot of infrastructure provided by big companies like Google or Amazon.

Which deployment model are you using for this solution?

On-premises
**Disclosure: I am a real user, and this review is based on my own experience and opinions.
More Spring Cloud Data Flow reviews from users
...who compared it with Apache Flink
Add a Comment
Guest