What is our primary use case?
I work as a senior software engineer in eCommerce analytics company, we have to process a huge amount of data.
Only a few people within our organization use Kinesis. My team, which includes three backend developers, simply wanted to test out different approaches.
We are now in the middle of migrating our existing databases in MySQL and Postgres, to Snowflake. We use Kinesis Firehose to ingest data in Snowflake at the same time that we ingest data in MySQL, without it impacting any performance.
If you ingest two databases in a synchronous way, then the performance is very slow. We wanted to avoid that so we came up with this solution to ingest the data in the stream.
We use Kinesis Firehose to send the data to the stream, which then buffers the data for roughly two minutes. Afterwards, it places the files in an S3 bucket, which is then loaded automatically, via an integration with Snowflake that's called Snowpipe. Snowpipe reads and ingests every message and every file that's in the S3 bucket. This stage doesn't bother us because we don't need to wait for it. We just stream the data — fire and forget. Sometimes, if the record is not ingested successfully, we have to retry. Apart from that, it's great because we don't need to wait and the performance is great.
There are some caveats there, but overall, the performance and the reality of it all has been great. This year, 100% of the time when there was an issue in production, it was due to a bug in our code rather than a bug in Kinesis.
How has it helped my organization?
We save a lot of time with Kinesis, but it's difficult to measure just how much. We actually have something similar regarding some other processes. We have developed somewhere else a tool that takes note of the contents of the stream, places them into a file, manually uploads them to the S3, and copies the files into Snowflake. That could be done with Kinesis, but it could take two weeks or 1 month less to get it production-ready.
What is most valuable?
The first would be the one found in the AWS SDK using the asynchronous client: put Record batch function which allows you to put a list of records in one put record request, which saves time and it's more efficient. Also, by using the asynchronous client, the records are sent in the background using an internal thread pool that can be configurable for your needs. In our performance testing, we came across this setting was the fastest solution. It didn't impact anything in the performance of the system process.
The second one would be the ability to link the stream to other places other than S3 via configuration of the stream and without changing a line of code.
Lastly, you can also link a lambda function to the stream to transform the data as it arrives in before writing it in S3, which is great to perform some aggregations or enrich the data with other data sources.
What needs improvement?
The default limit that they have, which at the moment is 5,000 records per second (I'm talking about Kinesis Firehose which is a specialized form of the Amazon Kinesis service) seems too low. Actually, on the first week that we deployed it into production, we had to roll it back and ask Amazon to increase the default limits.
It's mentioned in the documentation, but I think the default settings are far too low. The first week it was extremely slow because the records were not properly ingested in the stream, so we had to try it again. This happened the first week that we deployed it into production, but after talking with Amazon, they increased their throttling limits up to 10,000 records. Now it works fine.
For how long have I used the solution?
We've been using this solution since September 2019.
What do I think about the stability of the solution?
The stability is great. I'd say that maybe we have it running 99% of the time, and nothing stops it.
What do I think about the scalability of the solution?
Amazon Kinesis is definitely scalable. We have huge spikes of data that get processed around midnight and Kinesis handles it fine.
It automatically scales up and down, We don't need to compute it for that. It's great.
How are customer service and technical support?
The only time that we needed to contact Amazon was to ask them to increase the throttling limit. They replied to us very quickly and did what we asked.
Which solution did I use previously and why did I switch?
Initially, we were evaluating Kafka. I think Kafka is faster, but it's less reliable in terms of maintenance; however, when Kafka works, and you have it properly configured, it's much better than Kinesis, to be honest.
On the other hand, Kinesis provides us with better maintenance. Our DevOps team is already oversaturated, so we didn't want to increase the maintenance cost of the production environment. That's why we decided to go with Kinesis; because performance-wise, it's easy to configure and maintain.
How was the initial setup?
I found this solution to be really easy to configure. The essential parts of the configuration include naming the stream and also configuring the buffering time that it takes for a record to get ingested into S3 (how long it will be in the stream until it's put into an S3). You also need to link the Amazon S3 buckets with the Amazon Kinesis stream. After you've completed these configurations, it's pretty much production-ready. It's very, very easy. That's a huge advantage of using this service.
What about the implementation team?
Deployment took a few minutes.
You don't need a deployment plan or an implementation strategy because once you configure it, you can just use a stream. It's not an obligatory version that needs a library, etc. This stream is completely abstract in that way. You only need to configure it once, that's it.
What was our ROI?
We have seen a return on our investment with Amazon Kinesis. We are able to process data without any issue. It's our solution for ingesting data in other databases, such as Snowflake.
Which other solutions did I evaluate?
Developing the stream process manually or using Kafka
What other advice do I have?
If you want to use a stream solution you need to evaluate your needs. If your needs are really performance-based, maybe you should go with Kafka, but for near, real-time performance, I would recommend Amazon Kinesis.
If you need more than one destination for the data that you are ingesting in the stream, you will need to use Amazon Kinesis Data Streams rather than Firehose. If you only want to integrate from one point to another, then Kinesis Firehose is a considerably cheaper option and is much easier to configure.
From using Kinesis, I have learned a lot about the synchronous way of processing data. We always had a more sequential way of doing things.
On a scale from one to ten, I would give this solution a rating of eight.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)