What is our primary use case?
As part of my interest in obtaining Amazon certification and learning more about Kinesis, I am currently using it to capture streaming Twitter data.
I get an avalanche of tweets and I need some technology to harness and capture them. I have used the streaming Twitter API to deal with it. Twitter is updated every half a second, so I'm tapping into the streaming API and capturing a lot of stuff.
It has also been used for the Internet of Things (IoT), where there is a lot of streaming stuff that comes out and you need a mechanism to capture all of it from your devices. This includes things such as logs. My company was recently working on a project with Kinesis where we were capturing data from racecars.
These racecars were emitting tons of data and it needed to be captured by some kind of tool for analytics. Kinesis was used to capture all of that information. The basic use case is just capturing the data. In the streams, you can do some sort of interim transformations but for the most part, the basic use case is just capturing data and persisting it in a data store like Amazon S3. Another example is Elastic MapReduce permanent storage. Once it lands in some kind of permanent store, further transformations or aggregations can be done at that point.
How has it helped my organization?
In the racecar project that we worked on, the client wanted to be able to capture metrics in real-time to allow for the adjustment of racing strategy.
What is most valuable?
The most valuable feature is that it has a pretty robust way of capturing things. You can capture things from the beginning, or start capturing tweets at a certain point in time.
It has some good fault tolerance in case something breaks.
It's really easy to implement, get started, and use.
With AWS, you don't have to invest in any kind of infrastructure. All you have to do is go to the portal, create an account, turn it on, and use a few lines of Python code in order to capture what you're looking for.
The Kinesis API is really easy to put information on the shards. You just need to enter a few lines of code.
What needs improvement?
I'm currently trying to figure out production rates and consumption rates for data. If there were better documentation on optimal sharding strategies then it would be helpful.
What do I think about the stability of the solution?
I think that this product is very stable and very fault-tolerant.
As part of consuming data off of the stream, you do get some sort of unique number that is somewhat sequential. This means that if you have a problem with the data and something breaks, you can simply go back to that location in the stream.
Imagine that it gives you an integer, 100, to indicate your point in the stream. Then, if something fails, at a later point in time you can go back to spot 101 and continue retrieving data inside the stream. It's very fault-tolerant.
What do I think about the scalability of the solution?
The product is very scalable. Especially on the cloud, there is a large advantage.
How are customer service and technical support?
I haven't needed to contact technical or customer support.
Which solution did I use previously and why did I switch?
I am familiar with Kafka, although I have never used it.
Compared to Kafka, which requires physical servers, Kinesis, being on the cloud, is very easy to implement. It is a little easier to use, as well. Anybody who is interested in using it does not have to invest any money in a server or invest time in setting things up and configuring it on an actual environment with Kafka. All they have to do is go to AWS and turn it on.
I don't have any experience with other streaming analytics solutions.
How was the initial setup?
If someone knows what they're doing, they can have something up and running in half an hour. You can certainly use a deployment strategy, although I haven't to this point. I've just done it on my desktop, locally, in an IDE called PyCharm.
One can go ahead and deploy to an Amazon EC2 instance or AWS Beanstalk. I chose not to do this because it's easier for my project.
What about the implementation team?
I think as far as maintenance is concerned, you just kind of have to watch the production and the consumption of your data. You just have to make sure that everything's in order. They have metrics on the AWS console to help keep an eye on that kind of stuff but once it's up and running, you really don't have to do a whole lot of maintenance.
What other advice do I have?
My advice for anybody who is implementing this product is to start by reading through the Amazon documentation, as well as go through some videos on YouTube or Pluralsight just to get a high-level idea of what's going on. Then, start experimenting and trying to figure out how it works. From there, try to figure out how to choose your optimal sharding strategy, like how many shards do you need within the stream and how you want to partition the data within it.
I think from there, you need to look at your production and consumption rates on the stream. This is how much data you are putting onto the stream and at what kind of rate. You need to make sure that you're consuming data off of the stream, also, and look at that rate too.
The ideal use case is to be able to consume data faster than producing because then you're able to control things. If you're not able to do that, then you could get overwhelmed.
The biggest lesson that I learned from using this product is that it's a whole new world of processing big data. I come from a traditional data warehousing background where everything is batch-oriented. So for this, this is a whole new ball game in terms of how to process data. It's a new mechanism for harnessing the power of data. A traditional data warehouse could not analyze, for example, what is going on in real-time on a racing car. It's not scalable and it's not going to work. However, something like this is dynamic and big enough to handle this kind of application.
This is a pretty good product, albeit I don't have much to compare it with. That said, I don't have any problems with it. It's done what it's asked and it's easy to use.
I would rate this solution a nine out of ten.