How has it helped my organization?
I used Kafka with a client to decouple applications with different availability profiles. Before using a messaging-based architecture with Kafka as the messaging system, the client used a coordinator application to fire off various posts to as many as eight other applications. With an application that's impacting at least a customer a second in airports, where the customers demand that the system always works, there were issues with ensuring high availability.
A typical way to calculate system availability is: Availability = Uptime/(Uptime + Downtime). Hence, where there are two applications involved with a 99% availability, the total system availability degrades quickly: 99% * 99% = 98.01%.
With eight applications, total availability caused issues. However, only two systems needed to provide real-time responses, while other systems were for payment processing, CRM, promotions, etc. It was OK if those systems were not up to date in real time.
Kafka allowed the client to have temporal decoupling for writes, i.e., the flaky third-party CRM system did not need to be available at the moment for us to respond to a user with a successful response. The availability concerns shifted to Kafka, which is a better trade off because it's built for this.
Another benefit, though not required, was the addition of logical decoupling between applications. Additional consumers could be built to overlay concerns of analytics, but the systems responsible for creating the entities on a given topic did not need to be aware of the analytics applications. This simplifies the interaction between applications and concerns of an organization.
Another benefit of this architecture is that testing is simplified. A given application needs to be tested to obey a contract of reading a message and producing another message. A Kafka topic acts as the boundary for an integration test.
What is most valuable?
Kafka, as compared with other messaging system options, is great for large scale message processing applications. It offers high throughput with built-in fault-tolerance and replication.
Messaging systems in general allow for logical and temporal decoupling between applications. Given Kafka's high availability, it's a great option to use if applications require availability, but not real-time processing.
If a downstream system is offline, messages can queue up and process when possible, but the user may not necessarily need to be aware of any issues.
A messaging-based architecture becomes important as a set of micro-services need to scale with high availability. Kafka is a great choice for messaging with such architecture.
What needs improvement?
Kafka requires non-trivial expertise with DevOps to deploy in production at scale. The organization needs to understand ZooKeeper and Kafka and should consider using additional tools, such as MirrorMaker, so that the organization can survive an availability zone or a region going down.
Shifting availability concerns to Kafka means that it cannot go down. It's important to understand the partitioning model and replication needs before relying on it for critical business functions. I'd suggest using it with a feature toggle for a non-critical path in production and learning from failure before relying on it.
While Kafka is built to scale, that does not mean that applications can start as many consumers or producers without consideration for how Kafka brokers will perform. Considerations about scaling out brokers need to occur before publishing millions of messages.
What do I think about the stability of the solution?
Generally, there were no stability issues. However, there was one scare in production when a consumer rebalance took 30 minutes and messages were not being processed during that time.
What do I think about the scalability of the solution?
We have not yet had scalability issues!
How are customer service and technical support?
There are specialized consulting companies in this space and there are online resources to read. That may help companies get past hurdles.
Which solution did I use previously and why did I switch?
No, we did you use a previous messaging system.
How was the initial setup?
The setup was complex. One must consider setting up ZooKeeper, Kafka, multi-zone/region availability, as well as typical associated functions for running it all in production. This includes monitoring, message schema changes (consider Avro), encrypting messages if it's a concern, potentially authorization for different topics depending up on the sensitivity of data.
If an organization uses Kafka as the first messaging system, then the approach for application design must also shift significantly.
What's my experience with pricing, setup cost, and licensing?
It is open source software.
Which other solutions did I evaluate?
The client evaluated alternatives before I arrived, but I was not there during the evaluation so I cannot comment.
What other advice do I have?
Consider using a managed Kafka service, such as from Heroku.
If messaging is not a central component of the business and vendor lock-in is less of a concern, consider using something like Amazon's Kinesis. This can more rapidly provide the benefits of a messaging service without the pain of understanding it deeply, setting it up, and managing it.
It's important to use a lean approach to understand how it will break in production.
Implement a non-critical transaction with it.
Perhaps use a feature toggle within a facade and implement the behavior with the old approach and with Kafka to reduce risk.
Add it to one or two applications and monitor how it goes.
Figure out security, monitoring, scaling, schema migration, etc., before using it as a critical component in an application.