Interesting Article about BigData and Column oriented engines written by a student…
… Students in major of software engineering are required to take another course named “Data Storage & Information Retrieval Systems” as a prerequisite for Database. DS&IRS mainly focuses on optimized storage and retrieval of data on peripheral storages like a HDD or even Tape! (I did one of my most sophisticated and joyful projects during this course. We had to implement a search engine, which compared to the boolean model of Google, is supposed to be more accurate. More information concerning this project could be found on my older posts). During these courses, student are required to be engaged in specific projects, defined to help students gain a better perspective and intuition of the problem and issue.
I don’t know about other universities, but in ours, seeing students performances on such projects is such a disappointment. While doing such fun projects as part of your course to learn more, is quite an opportunity, students beg to differ. The whole atmosphere is believing that our professors are torturing us, and we should resist doing any homework or projects! You have no idea how hard it is to manage escaping that dogma, as you have to live among such students. It is unfortunate how most of the students are reluctant to any level of studying. For such students, learning only occurs when they’re faced with a real problem or task.
So here’s the problem. You are supposed to do your internship at a data analysis company. You will be given 100 GBs of data, consisting of 500 millions of records or observations. How would you manage to use that amount of data? If you recall from DS&IR course, you’d know that a single iteration through all the records would take at least 30 minutes, assuming all of your devices are average consumer level. Now imagine you have a typical machine learning optimization problem (a simple unimodal function), that may require at least 100 iterations to converge. Roughly estimated, you’d need at least 50 hours to optimize your function! So what would you do?
That kind of problem has nothing to do with your method of storage, a simple contagious block of data which minimizes seek time on the hard disk, and reading the data in a sequential manner is as best as you can get. Such problems are tackled by using an optimization solution which minimizes access to hard disks and finds a descent optimal solution.
Now imagine you could reduce the amount of data you’d need on each iteration, by selecting records with a specific feature. What would you do? The former problem doesn’t even need a database to perform its job. But now that you need to select some records with a specific attribute (Not necessarily a specific value), you shouldn’t just iterate through the data and test every record against your criteria. You need to manage the data on disk, and create a wise index of the data, which would help you to reduce disk access and answer your problem perfectly (or even close enough). That’s when databases come in handy.
Now the question is, what kind of database should I use? I’m a Macintosh user, with limited ram, a limited and slow hard disk, with a simple processor! Is using Oracle the right choice? The answer is no, you have a specific need and these general purpose databases may not be the logical choice, not to mention the price of such applications. So what kind of service do we require? In a general manner, users may need to update the records, or alter the table’s schema and … . To provide such services, databases need to sacrifice speed, memory and even the processor. Long story short, I found an alternative open source database which was perfect for my need.
The infobright, is an open-source database which is claimed to “provide both speed and efficiency. Coupled with 10:1 average compression”. According to their website the main features (for my use) are:
- Ideal for data volumes up to 50TB - Market-leading data compression (from 10:1 to over 40:1), which drastically reduces I/O (improving query performance) and results in significantly less storage than alternative solutions. - Query and load performance remains constant as the size of the database grows. -Runs on low cost, off-the-shelf hardware.
Even though they don’t offer a native Mac solution, They have a virtual machine running Ubuntu, prepared to use the infobright with. And here’s the best part, even though the virtual machine allocations were pretty low (650 MBs of ram, 1 cpu core), it was actually able to answer my queries in about a second! The same query on a server (Quad Core, 16GBs of ram, running MS SQL Server) took the same amount of time. My query was a simple select, but according to the documents, this is highly optimized for business intelligence and data warehousing queries. I only imported 9 millions of records, and it only consumed 70MBs of my hard disk! Amazing, isn’t it? Having all the 500 millions of data imported would only take 3.5 GBs of my disk!!
The infobright, is mainly an optimized version of MySql server, with an engine called brighthouse. Since its interface is SQL, you can easily use Weka or Matlab to fetch the necessary data from your database and integrate it into your learning process, with minimum amount of code.