What is most valuable?
Scalability: Ability to load huge number of datasets (I have experience with petabytes of data) and process those things. Storage is not limited. We can increase whatever we want.
Performance: The distributed architecture of Redshift has the capacity to process the workflow in a different cluster and coordinate those things in the leader node, making the process much faster.
Flexibility: This feature is helpful for user to increase the node size and config depending on their need. There is no need to wait for hardware to be in place whenever we increase the dataset. Redshift provides the option to increase the node or cluster size whenever required.
Multi-formatted accessibility: The Redshift engine has the capability to read the following file formats: CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON, BZIP2, GZIP, LZOP. The user can choose which is best for their requirements.
VPC configuration: VPC configuration secures our dataset, which we keep inside the Redshift cluster. This VPC config doesn’t allow any third party in or out bound against firewall.
Python UDF calls: This is useful for a user to create their own user-defined function through Python and import that class into Redshift and process the dataset.
How has it helped my organization?
We were using MySQL & MongoDB for our regular operations, but when we grew, we were forced to handle a huge number of datasets. It could be petabytes of data in and out on a regular basis. We struggled a lot to complete the operations in a timely manner. With Amazon Redshift, we gained a lot in terms of timing, as well as project completion.
Some of the scoring mechanism really works well in the distributed architecture of Amazon Redshift.
What needs improvement?
Of course, every product has pluses and minuses. From that perspective, Amazon Redshift has some issues with snapshot restoring when we handle huge datasets. When our snapshot size is really huge, like 20 TB+, we are forced to wait a long time to get it restored. This is reasonable, as they need to transfer the entire dataset to the cluster.
My thought on this issue is that Amazon has their own data centers and they are connecting each region of storage through Direct Connect. The input and output network data transfer might not be a complex thing. For example, if they used 10 Gbps network transfer, they can transfer 1 TB in less than two minutes, but that’s not happening now. To restore 1 TB of data, it takes more than 30-40 minutes.
For how long have I used the solution?
I have used it for the last 3.5 Years.
I am using Amazon Redshift for big data mapping and data aggregation.
We are using most of their products. Specifically, we are using their dedicated data-centre service (Direct Connect). We are using Amazon products such as Amazon EC2, S3, SQS, EMR, ML, CloudWatch, Redshift, DynamoDB, etc., for more than 10-12 years.
What do I think about the stability of the solution?
I have encountered stability issues. A few weeks ago, I encountered an issue with hardware failure and database health status failure. When we face these kind of issues, we can't do anything from our side until the Amazon technical team finds the issue and rectifies it. It takes time to get resolved. If we are in a rush to deliver something for a client and encountered these issue, we are really screwed.
What do I think about the scalability of the solution?
Ofcourse. When the amount of data that we handle in the cluster grew, we need to increase the cluster or node size. Apparently, the size of node or cluster increases the hold time for synchronizing the data (meta data) with the node manager. The initial time increases when we start the cluster.
How are customer service and technical support?
Customer Service good. But couldn't make direct call to customer service many times. I could catch them through their web UI rather making direct call. Technical Support
Technical support is really great, but it’s paid support. The Basic Support plan doesn't have the option for technical support. It’s only providing billing support.
Which solution did I use previously and why did I switch?
I have experience working in Hadoop as well. When I compare the two (Redshift & Hadoop), Redshift is more user friendly in terms of configuration and maintenance.
How was the initial setup?
The initial setup of Amazon Redshift is so simple and straightforward. We do not need to read or understand any of the technical documentation. Simply said, it’s a plug-and-play service or platform.
What about the implementation team?
I have implemented through in-house.
What was our ROI?
In terms of ROI, I can't directly convert to it. Because we are not using only Redshift. We are using multiple product to increase our revenue and decrease time consumption. So It's difficult to calculate ROI of Redshift usage.
What's my experience with pricing, setup cost, and licensing?
Pricing and licensing is so important. In terms of pricing, it's bit high, as they are using standard hardware. My advice to users is: We need to start the cluster when we require it. At the end of the workday, we can just snapshot the clusters and shut them down. And then we restore those snapshots when we need them back. That way, we are charged only for usage rather than spending money on wait time or sleep.
Which other solutions did I evaluate?
I evaluated Hadoop and Spark, along with Redshift. I have no negative comments about those other products. Redshift is flexible in terms of configuration, maintenance and security, especially VPC configuration, which secures our data a lot.
What other advice do I have?
Use this product for huge data mapping or aggregation. Use Redshift through VPC to keep their data very secure and for a long time.