If you have a short attention span see the shorter blog post.
If you have a large attention span, see the complete 12-page paper.
There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).
The second trend is what I talked about in my last blog post: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.
I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)
Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as I wrote about previously, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., the work by James Hamilton), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.
Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.
Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.
Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.
In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the Facebook Hadoop deployment as an example). Hadoop simply scales better than any currently available parallel DBMS product. Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)
The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.
Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.
That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.
Our paper (that will be presented at the upcoming VLDB conference in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).
Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.
The initial release of the source code for HadoopDB can be found at http://db.cs.yale.edu/hadoopdb/hadoopdb.html. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!