What is our primary use case?
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase.
Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.
This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
How has it helped my organization?
Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds.
Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.
What is most valuable?
AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI.
What needs improvement?
Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.
For how long have I used the solution?
I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.
What do I think about the stability of the solution?
It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.
What do I think about the scalability of the solution?
It is scalable, but we don't have the need to scale it.
It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.
We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.
How are customer service and technical support?
We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project.
How was the initial setup?
I don't have any idea about it. We are a big company, and we have another group for setting up Spark.
What other advice do I have?
I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult.
If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution.
I would rate Apache Spark a nine out of ten.
Which deployment model are you using for this solution?