If you were talking to someone whose organization is considering RapidMiner, what would you say?
How would you rate it and why? Any other tips or advice?
I have worked with RapidMiner, but I have not yet explored all of the functionality of the software. As an example, the relation with big data and the relation with the Cloud. I have used the utilities quite a bit. In the last model, they added automation cleaning for data preparation. It is very interesting. I am a computer scientist and I received my Ph.D. 23 years ago. I am a researcher, and when I have a problem, I use it to research and to find a solution to much more difficult problems. I would rate this solution an eight out of ten.
On a scale from one to ten where one is the worst and ten is the best, I would rate RapidMiner as around a seven. I choose seven because of the UI things and other parts of the product that might be improved. RapidMiner is more of an enterprise product. Here, in this region, most people like a packaged solution like Alteryx which covers more. Alteryx is also more attractive to many users because it is cheaper and easier to use from the perspective of the user interface. With Alteryx or Tableau, for example, you can just pick up data sources and then start EDL (enterprise data lake). It takes more effort to bring the data on to the data mart for RapidMiner and other enterprise products in the traffic mining category. These enterprise solutions have an additional level of complexity and flexibility but not everyone even needs it.
Using RapidMiner is a two-stage process. At first, it's something simple whereby you can get quick results. This is done by clicking to get the mean standard deviations or the numerical variables, for example. You can get a bar chart and a frequency count of all of the categorical variables. I would suggest that you get somebody to do that, just to get used to it, but then stop them and make sure that they do a course on machine learning. Otherwise, they may be missing something important like cleaning up the data. For example, I did one project many years ago whereby I was asked by the Department of Education in the UK to look at the survey data on primary school children. It had been done by a market research company and they were a bit uneasy about the results. They didn't know what was wrong with the results, but they felt they weren't right, so they asked me to look at it. The first thing I did was to take a simple look at the values of all the variables and the first thing that became clear was that on the bar chart of variables, the right-hand end shot up. This was the value 99, which was clearly a missing value. Now it was a missing value, but in SPSS, which is what they use to finalize it, they had not declared it to be a missing value, so it found a child whose age is apparently 99 and they treated that child as being age 99. I found that out very easily by working out the arithmetic mean age of these primary school children, who should be under the age of 10, and their average age was 34.4. That came up merely because they hadn't specified a missing value. Now that's a very simple example, but it's the sort of thing that can go wrong when people just use a package and they don't know what the underlying assumptions are. Or people produce a linear regression when the relationship is nowhere linear. I recently refused to referee an article, for example, from China, because they did linear regression on data which clearly were not linear. They were exponential in format. All this to say that this is the two-stage process. You can get started very quickly, but you must then make sure that your staff is properly trained not to make these kinds of mistakes. The beginning learning curve is very shallow, but when you want to go on and do really advanced things then it takes more time. Companies know this, so they try to find cheap solutions such as employing sociology graduates to use the software. They don't understand the issues the same way a computer science or mathematics graduate would. With respect to functionality, at the moment it has more features than I need and can handle. I would rate this solution as nine out of ten.
We're in the banking and finance space, so mostly our clients use the on-premises deployment model. As part of compliance, it's required that data should not go out of the bank's boundaries or firewall. This solution is a great tool for users that are experimenting and is an alternative to doing the coding and everything themselves. It's perfect for those who want to focus more on data analysis rather than spending days coding everything. Users can go pretty far because of the solution's Auto ML capability which cuts down on coding. It allows for great productivity. I'd rate the solution eight out of ten.
The tools have a complete function for doing data. I'm not quite sure about the speed of RapidMiner but I think it's the fastest solution that I use. I don't think the product consumes a lot of RAM, which is good. There is something confusing in the product but it's possible that the error is mine and maybe I'm not yet familiar enough with the product. I would therefore rate this product a nine out of 10.
I would rate this solution a nine out of ten.
I have not worked with all of the features in RapidMiner. For example, I have not worked with all of the features for Big Data, and I have not used it with the cloud. I would rate this solution an eight out of ten.
This is a solution that I recommend. I would rate this solution an eight out of ten.