What is our primary use case?
I still use this tool on a daily basis. Comparing it to my experience with other ETL tools, the system that I created using this tool was quite simple. It is just as simple as extracting the data from MySQL, exporting it on the CSV, putting it on the S3, then pushing it into Redshift.
The PDI Kettle Job and Kettle Transformation are bundled by the shell script, then scheduled, and orchestrated by Jenkins.
We still use this tool due to the fact that there are a lot of old systems that still use it. The new solution that we use is mostly on Airflow. We are still in the transition phase. To be clear, Airflow is a data orchestration tool that mainly uses Python. Everything from the ETL, all the way to the scheduling and the monitoring of any issues. It's in one system and entirely on Airflow.
How has it helped my organization?
In my current company, it does not have any major impact. We use it for old and simple ETLs only.
In terms of setting the ETL tools that we put on the panel, it's quite useful. However, this kind of functionality that we currently put on the solution can be easily switched by other tools that exist on the market. It's time to change it entirely to Airflow. We'll likely change in the next six months.
What is most valuable?
This solution offers tools that are drag and drop. The script is quite minimal. Even if you do not come from IT or your background is not in software engineering, it is possible. It is quite intuitive. You can drag and drop many functions.
The abstraction is quite good.
Also, if you're familiar with the product itself, they have transformational abstractions and job abstractions. Here, we can create a smaller transformation in the Kettle transformation, and then the bigger ones on the Kettle job. For someone who has familiarity with Python or someone who has no scripting background at all, the product is useful.
For larger data, we are using Spark.
The solution enables us to create pipelines with minimal manual, or custom coding efforts. Even if you have no advanced experience in scripting, it is possible to create ETL tools. I have a recent graduate coming from a management major who has no experience with SQL. I trained him for three months, and within that time he became quite fluent, with no prior experience using ETL tools.
Whether or not it's important to handle the creation of pipelines with minimal coding depends on the team. If I change the solution to Airflow, then I will need more time to teach them to become fluent in the ETL tool. By using these kinds of abstractions in the product, I can compress the training time to just three months. With Airflow, it will take longer than six months to get new users to the same point.
We use the solution's ability to develop and deploy data pipeline templates and reuse them.
The old system was created by someone prior to me in my organization and we still use it. It was developed by him a long time ago. We also use the solution for some ad hoc reporting.
The ability to develop and deploy data pipeline templates once and reuse them is really important to us. There are some requests to create the pipelines. I create them and then deploy them on our server. It then has to be as robust as when we do the scheduling so that it does not fail.
We like the automation. I cannot imagine how the data teams will work if everything was done on an ad hoc basis. Everything should be automated. Using my organization as an example, I can with confidence say that 95% of our data distributions are automated and only 5% ad hoc. With this solution, we query the data manually. We process the data on the spreadsheets manually and then distribute it to the organization. It’s important to be robust and be able to automate.
So far, we can deploy the solution easily on the cloud, which is on AWS. I haven't really tried it on another server. We deploy it on our AWS EC2, however, we develop it on our local computer, which consists of people who use Windows. There are some people who also use MacBooks.
I personally have used it on both. I have to develop both on Windows and MacBook. I can say that Windows is easier to navigate. On the MacBook, the display becomes quite messed up if you are enabling the dark mode.
The solution did reduce our ETL development time if you compare it to the scripting. However, this will really depend on your experience.
What needs improvement?
Five years ago, when I had less experience with scripting, I would have definitely used this product over Airflow, as it will be easier for me with the abstraction being quite intuitive. Five years ago, I would choose the product over the other tools using pure scripting as it would reduce most of my time in terms of developing ETL tools. This isn't the case anymore as I have more familiarity with scripting.
When I first joined my organization, I was still using Windows. It is quite straightforward to develop the ETL system on Windows. However, when I changed my laptop to MacBook, it was quite a hassle. When we tried to open the application, we had to open the terminal first, go to the solution's directory, and then run the executable file. The display also becomes quite messed up when we enable dark mode on MacBook.
Therefore, if you develop it on MacBook, it'll be quite a hassle, however, when you develop it on Windows, it's not really different from other ETL tools on the market, like SQL Server Integration Services, Informatica, et cetera.
For how long have I used the solution?
I have been using this tool since I moved to my current company, which is about one year ago.
What do I think about the stability of the solution?
The performance is good. I have not done a test on the bleeding edge of the product. We only do simple jobs. In terms of data, we extract it and then exported it from MySQL to the CSV. There were only millions of data points, not billions of data points. So far, it has met our expectations. It's quite good for a smaller number of data points.
What do I think about the scalability of the solution?
I'm not sure that the product could keep up with the data growth. It can be useful for millions of data points. However, I haven't explored the option of billions of data points. I think there are better solutions that are on the market. It's also applied to the other drag-and-drop ETL tools too like SQL Server Integration Service, Informatica, etc.
How are customer service and support?
We don't really use technical support. The current version that we are using is no longer supported by their representatives. We didn't update it yet to the newer version.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We're moving to Airflow. The reason for the switch was mostly due to a problem when we are debugging. If you're familiar with the SQLs for integration services, the ETL tools from Microsoft and the debugging function are quite intuitive. You can exactly spot which transformation has failed or which transformation has an error. However, in the solution, from what my colleagues told me, it is hard to do that. When there is an error, we cannot directly spot where the error is coming from.
Airflow is quite customized and it's not as rigid as this product. We can deploy the simple ETL tools all the way to the machine learning systems on Airflow. Airflow mainly uses Python, which our team is quite familiar with. This solution is still handled by only two people out of 27 people on our team. Not enough people know it.
How was the initial setup?
There are no separations between the deployment and other teams. Each of our teams acts like an individual contributor. We handle the implementation process all the way from face-to-face business meetings, setting timelines, developing the tools, and defining the requirements, to the production deployment.
The initial setup is straightforward. Currently, the use of versioning control in our organization is quite loose. We are not using any versioning control software. The way we deploy it is just as simple as putting the Kettle transformation file into our EC2 server and rewriting the old file, that's it.
What's my experience with pricing, setup cost, and licensing?
I'm not really sure what the price for the product is. I don't handle the purchasing or the commissioning.
What other advice do I have?
We put it on our AWS EC2 server, however, when we developed it, it was put on our local server. We deploy it onto our EC2 server. We bundle it on our shell scripts and the shell scripts are run by Jenkins.
I'd rate the solution a seven out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Yes the integration tool should be made available as Professional or Community / Standard / Enterprise Editions and Pricing should be made accordingly on the industry by industry basis or cases by case. And also there should be Transparency in the pricing and availability of community edition as the case was earlier when Pentaho management realeased it into market.