Please share with the community what you think needs improvement with Databricks.
What are its weaknesses? What would you like to see changed in a future version?
I have seen better user interfaces, so that is something that can be improved. It was quite hard to deploy.
I think the automatic categorization of variables needs to be improved. The current functionality is not always efficiently identifying the features of the data that is collected. Probably that is the only thing I can think of. Apart from that, I have not explored the product enough yet to go into more depth because there is only one asset project that I have taken on right now. Because I own this company, I have been doing more to run it than to explore this product very deeply. But when you get any form of data inside there, if it could understand what type of variables there are and what features the data has, it would help massively in taking processing to the next step. If it does not exactly identify the variables you may have to modify them a little. Apart from working with Databricks to understand its capabilities, I am also trying to learn Apache Spark right now. Some members of my team want to work with Apache Spark as a solution and at this point, we are evaluating both and we are planning to use Spark or Databricks. As far as what might be added, some custom algorithm samples would be useful. All of the other products of this type — Azure, AWS, SageMaker — they all have customizable algorithms. You have the capability to implement a sort of workflow from that by modifying things in the sample and changing it to fit your purposes. Probably that is something that might help in doing some small NDP (Near-Data Processing) development. It might not help in the project directly, but it will help while we work on some NDP development of our own so that we can quickly evaluate how something is going to work. Templates or other samples could make working on things easier. That would also help massively in getting people to understand the potential of what the product can actually do. But I also think not many people would strongly agree with this. Many people go to the first solution they can think of that they know very well already in the IT field even if they could imagine that something could be better. To get the value out of this technology, people will need to come to accept it. Technical people will accept Databricks more if they understand that this is something that they can use and start working on without a lot of experience. Adopting it will take time for new users who have no experience. But to feel like they can have success with a product, they have to execute something in a very short time and see how it can work. When you talk about AI — or really when you talk about anything new — people do not initially want to invest the time in discovery. These processes do take time to learn, but with templates or samples, you get to see immediately what the possibilities are and what you might get out of it. Then when they try something of their own and are able to get it working in less than a week's time, they will be encouraged to look into the product and the technology some more.
Pricing is one of the things that could be improved. Also, there could be improvement in the visual analytics space there and on the machine learning functions. I haven't explored so I don't know about the functions and features that are there. If it is not there, then I think that's something which they should consider including.
The management of the solution needs to be modernized. Managing the radius data is hard. The solution requires modern scoring. There's not a good way of knowing how the models are performing from a data science perspective. The solution needs more model scoring abilities. It doesn't necessarily need more model monitoring, but more model scoring and performance from a data science perspective. Databricks is an analytics platform. It should offer more data science. It should have more features for data scientists to work with.
It would be very helpful if Databricks could integrate with platforms in addition to Azure. Having an open-source version or having the option to get a trial version of Databricks would be very helpful. It would be very useful for beginners if there were tutorials and examples on how to write code for PySpark, R, or Scala. Having examples would give people something to refer to and play with.
The solution could be improved by integrating it with data packets. Right now, the load tables provide a function, like team collaboration. Still, it's unclear as to if there's a function to create different branches and/or more branches. Our team had used data packets before, however, I feel it's difficult to integrate the current with the previous data packets. The support could be improved a bit around the database. When we stream it to Data Lake, some data cannot be loaded. It should be a priority to fix this.
Databricks should have more libraries for predictive analysis and machine learning. It should have more compatible and more advanced visualization and machine learning libraries. As it is now, I have to try a customer algorithm in order for things to be compatible. I would like to see more deep learning analytics.
Some of the error messages that we receive are too vague, saying things like "unknown exception", and these should be improved to make it easier for developers to debug problems. As it is now, we have to go into the driver logs to identify the error messages properly. There is not much information about Databricks available online, such as cost. Whenever we want to find the actual costing, we have to send an email to Databricks, so having the information available on the internet would be helpful. I would like to see integration with Power BI or Tableau for the business users. They may use Databricks to check on things, but it will be a little bit complicated for them. The GUI interfaces for Tableau and Power BI are ones that they are used to, so the integration would help.
Improvements could include the pricing, the product is a little expensive, although I think comparable to other similar options. The integration features could be more interesting, more involved. For example, we use the Database Notebook, which is not as great as Jupyter Notebook, for providing a great user experience. The look and feel are not the same and we've had complaints from some of our users. They say that it's easier and more productive for them to use Jupyter Notebook. And then there is the integration feature for connecting to data sources, for example, Jupyter Notebook through publishes connect. The problem is that when you do that, you don't get all the Jupyter features which is a shame for us. For additional features, having some PyTorch or TensorFlow type features inside would definitely be great. For now, my users are developing for themselves by importing their libraries into their Notebook and then creating models based on the potential flow of PyTorch. That requires a lot of imports, particularly library imports, something that is now available in the new version of Machine Learning services. These things are very important because the self appliance community has shifted from the traditional way of preparing models, to a deeper learning system. It's now more common to have those features.
The product could be improved by offering an expansion of their visualization capabilities, which currently assists in development in their notebook environment. Perhaps a few connectors that auto-deploy to a reporting server? More parallelized Machine Learning libraries would be excellent for predictive analytics algorithms.