Apache Airflow Review

Helps us maintain a clear separation of our functional logic from our operational logic

What is our primary use case?

We are a technology, media, and entertainment-technology company. We are using Apache Airflow for architecting our media workflows. We are using it for two major workflows.

We have had it set up for some time on our own cloud. Recently, we migrated the setup to AWS.

How has it helped my organization?

Airflow is our first choice because we wanted a clear separation of our functional logic from our operational logic. We don't want our microservices to have the cross-cutting responsibilities of our operational logic. Right now, our microservices are the core business' inner functional logic. The majority of our distribution, our decision making, and the majority of our workflow operational responsibilities have been added to Airflow.

What is most valuable?

The reason we went with Airflow is its DAG presentation, that shows the relationships among everything. It's more of a configuration-driven workflow. 

It's all Python, as well. The majority of the configuration is Python-friendly.

What needs improvement?

One specific feature that is missing from Airflow is that the steps of your workflow are not pipelined, meaning the stageless steps of any workflow. Not every workflow can be implemented within Airflow. For example, Step 1 of my workflow will have output which I definitely want to automatically be provided as an input to my Step 2. At the workflow level, we want to have common state management where, across steps, we'll be able to reach the state information. Right now, we're using an external state repository to maintain the state.

If Airflow could come up with some kind of implementation, where not every step of the pipeline is an independent step, that would be helpful. I would like it if a part of the output of your previous steps could be Apache input for your next step. That kind of pipeline is missing. When we consider other products like jBPM, Camunda, or Cadence, they have the concept of pipelining.

I would also like to see support for more platforms, in terms of programming BPMs. Cadence supports Golang and Java. Legacy components can be from any platform, so if they could provide more client support for Java client library and Golang, that would be helpful. I want it to program in Java.

For how long have I used the solution?

I have been using Apache Airflow for more than a year.

What do I think about the scalability of the solution?

It's definitely scalable.

We have been using Airflow for sometime but we are not heavily dependent on it. We only have a couple of use cases being executed by Airflow. 

Because we have some data engineering problems, we have a good amount of analytics systems. We have a high volume of data that comes into our system, along with a lot of email, and we have to have an automated data pipeline. Given that, we have all these computing capabilities that are built of microservices. The beauty of it is its scalability. It has every step of your workflow, and it has scheduler capabilities. Every step of your workflow is delegated to one of your nodes. That is being scaled per your computing needs.

We are still evolving. Our business processes are not completely automatic. We're still in the process of identifying what all the automation cases are that we can bring under Airflow. We would like to leverage one common orchestrator or workflow BPM for our complete ecosystem. So we have some architects in our system who are happy with Airflow and others who would like to migrate to some other BPM like Cadence or Apache NiFi. There are a lot of orchestrators and we're just out of the gate. Airflow is still not being heavily used in our enterprise.

Which solution did I use previously and why did I switch?

This is the first workflow BPM tool that we are using in our platforms.

How was the initial setup?

There is comprehensive documentation for setting up a simple workflow and you just follow the documentation for setting things up. We're all engineers so we don't mind if the steps are lengthy, in terms of setting up the system. I'm quite okay with the documentation provided for getting your system up and running.

But I would appreciate it if they published a portal where we could see in what way other businesses, or other technology companies are solving their problems, with some case studies, using Airflow. It would help us to review their case studies. My biggest problem at the time when I was deciding whether Airflow fit our needs or not, was that I was looking for some case studies of technology companies that are already using the solution. With Camunda and jBPM, there is a good quantity of case studies available online.

Which other solutions did I evaluate?

There is no scarcity of BPMs. There are many products online: either open-source or community products or licensed products. There are many good BPMs. The reason that Airflow is in my system is that some of our workflows which we have onboarded are also on Python. Airflow complements that. But the first and foremost ability of any orchestrator should be to integrate with any underlying platform, be it a Java platform or a Python platform. That's the beauty of an orchestrator.

What other advice do I have?

We have a team of people, four to five team members, who initially evaluated Airflow and  wanted to implement it.

We have customers onboarded on our legacy systems. I cannot disrupt the service and bring everything into Airflow. I have to onboard Airflow seamlessly, while I protect my current, ongoing business systems. So I'm trying to balance things here. We have only been able to onboard a couple of workflows. Eventually, we want to do it more fully, but there were a few challenges as I told you: There is no pipeline to take information, which is forcing me to retain my state in a separate state repository. That would be the next big area where I would like to see improvement.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
**Disclosure: I am a real user, and this review is based on my own experience and opinions.
Add a Comment