Apache Kafka is actually a distributed commit log. That is different than most messaging and queuing systems before it. I find the ability to write data at one velocity and have subscribing consumers read at different velocities to be the best feature.
Improvements to My Organization:
Kafka has a guaranteed delivery mechanism that is very easy to set up. When starting out with minimal hardware, it can handle very large data volumes. When prototyping and creating a proof of concept, Kafka has helped to speed up the timeline from the prototype all the way to production volumes.
Room for Improvement:
The GUI tools for monitoring and support are still very basic and not very rich. There is no help in determining a shard key for performance.
Use of Solution:
I have been using Kafka for three years.
We did not have any issues with stability.
We did not have any issues with scalability.
- Kafka is open source from LinkedIn and support comes from the community of users.
- You can go with Confluent, the company that was founded by the original engineers from LinkedIn.
- You can go with a cloud hosting service, like AWS EMR or Azure HDInsight.
We used traditional message queues and file semaphores. There was a lot of overhead with asynchronous messages being put into an order and making sure nothing got dropped. It required a lot of code and maintenance.
Since it is open source, you are on your own for setup. However, the tutorials from the Apache foundation and online sources have been an immense help.
Getting started is very easy. The complexity of very large volumes of data and appropriate sharding, however, is difficult. There are fewer resources for tuning and best practices.
Cost and Licensing Advice:
When starting to look at a distributed message system, look for a cloud solution first. It is an easier entry point than an on-premises hardware solution. A lot of the complexity has already been taken care of. Both AWS and Azure have supported Kafka clusters that can be provisioned very easily.
Other Solutions Considered:
We looked at RabbitMQ and Spark Streaming.
Be sure to define the use cases as best as possible at first.
Kafka is very good, but it is complex to support. It can handle any message size, whereas native cloud options have size limitations.
Be sure to understand what messages will be sent and how many discrete topics will be needed.
Be aware that you must code both producers and consumers.
The bulk of the work is with the consumer.
The Apache stack for Kafka is very open source. There are essentially no tools other than command line options to monitor brokers and topic health. So there are 3rd party tools that will help with that, some free, some paid – but it requires that you install agents on the servers hosting Kafka and open up ports for netbeans on the scripts that start up the Kafka services. Additionally, you also have to monitor zookeeper – which is very memory intensive. Cloud offerings that provide the whole modern data architecture stack – like AWS EMR and Azure HDInsight as well as Hortonworks and Cloudera provide a console GUI as part of each of their offerings. Also Confluent, a company founded by the Linked-In engineers that designed Kafka, also have a paid enterprise offering that has much better tools for maintain the kafka cluster. But apache Kafka with the community – you are on your own.
Disclosure: I am a real user, and this review is based on my own experience and opinions.