If you have a short attention span see the shorter blog post.
If you have a large attention span, see the complete 12-page paper.
There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).
The second trend is what I talked about in my last blog post: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.
I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)
Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as I wrote about previously, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., the work by James Hamilton), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.
Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.
Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.
Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.
In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the Facebook Hadoop deployment as an example). Hadoop simply scales better than any currently available parallel DBMS product. Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)
The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.
Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.
That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.
Our paper (that will be presented at the upcoming VLDB conference in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).
Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.
The initial release of the source code for HadoopDB can be found at http://db.cs.yale.edu/hadoopdb/hadoopdb.html. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!
Daniel, congratulations on the release and I look forward to hearing the session at VLDB.ReplyDelete
Congratulations, it would be great to have a blogpost in future about what your deployment configuration is (or right now you guys have just tested it on 10node cluster mentioned in paper). And what fine tuning did you guys had to do, for making it work better/faster or both.ReplyDelete
We actually tested it on a 100 node EC2 cluster in the paper. As far as configuration, there's a lot of detail on this in the quick start guide (http://hadoopdb.sourceforge.net/guide). Is this what you are looking for?
I'm curious what you mean when you say you are targeting analytical workloads, and how you think a merged system like this might apply to other kinds of workloads?
This sounds extremely interesting! I'm very interested in checking this out and seeing how it might apply to some of Mozilla's current large scale data warehousing needs in the future.
My first interest is the details on what level of performance trade-offs HadoopDB has to give up. Does it lean toward SQL as being capable as an interactive data store where you can run an aggregation query and see the results in seconds to minutes, or does it lean toward Hadoop as requiring a batch processing model where the answer will come but it might take a day?
Noticing your mention of Semantic Web in the blog sidebar, have you any thoughts on where/whether RDF query and storage (SPARQL etc) might fit into this design?
This comment has been removed by the author.ReplyDelete
Quite an interesting post! I was wondering when someone was going to 'glue' the various pieces of the Hadoop stack together.ReplyDelete
I just wrote an article on why Hadoop and HBase are making large inroads into data warehousing. I encourage everyone to read it at this link :)
I am also interested in your answer to the RDF/SPARQL question.
I hope the project lives on. Are you considering making it a part of Hadoop itself, thus increasing its survival chances once original authors move on?
Wow, lot’s of questions!ReplyDelete
@danbri, @little3lue: Applying HaodopDB to our work on Semantic Web data management is definitely something we’re looking into. In fact, this forms the basis of my recent NSF Career grant award that I received six months ago (see http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0845643). Obviously HadoopDB didn’t exist when I applied for the grant, but in the grant I proposed to explore the use of shared-nothing data management techniques to deal with the Web-scale RDF data and inference. HadoopDB is very useful as a starting point for this research project. Though HadoopDB does not currently accept SPARQL queries, my student Kamil Bajda-Pawlikowski has extended a SPARQL-to-SQL parser which could be placed on top of HadoopDB’s SMS planner. I have a student showing up in September who has expressed an interested on working on research related to this grant --- hopefully I’ll have further updates for you as the research progresses.
@neha: By “analytical” workloads, I mean workloads that scan through large quantities of data and process the data in some way, e.g., through data mining algorithms, correlation analysis, scoring, or summarizing. These are the workloads that data warehouse systems like Teradata are well known for, and are often seen in applications like customer relationship management, inventory optimization, risk management, and click-stream analysis. In general, HadoopDB is focused on structured data (rather than unstructured data). HadoopDB can also be applied to other workloads, but the less structured the data gets, the less useful it will be relative to just using Hadoop.
@Daniel E: HadoopDB is definitely designed to be an interactive data store (rather than just being used for batch data processing). It is a little slower than commercial parallel databases, but it performs in the same order of magnitude. For detailed performance benchmarks, I’d recommend looking at the VLDB paper (http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf).
@otis: HadoopDB uses the same Apache license as Hadoop and Hive with the hope that some sort of integration might be possible down the road.
Please let me know these responses answer your questions.
I'd like to set this up as a full OSS project; there's already quite a bit of interest in it from the PostgreSQL community. Would it be OK if I copied the code over to pgFoundry, our projects hosting site? Or would you rather put it somewhere else?
Among other things, I'd like to play with using it for government open data.
It is an honor to have a comment from you in my blog.
Perhaps we can talk about this offline. I'll send you an e-mail this afternoon.
Josh - thanks for jumping in. That's exactly what I was hoping to see (see my earlier comment).ReplyDelete
Hi Daniel, is it possible to use these types of systems for real-time data? For example, web logs? There's always a loading phase when it would seem easier to stream data in real-time. One think I don't see is dynamic repartitioning so I don't think this would work.ReplyDelete
Dynamic repartitioning is on our radar, but you're right in that it's not something we've implemented yet. For now we only support bulk load, which obviously isn't well suited for real-time data. Loading is definitely something we want to think about more in the future.
Another question if you don't mind Daniel. When you say "more than ever, a requirement to perform data analysis inside of the DBMS" and that architectures like eBay uses are essentially CPU bound and more nodes are required for processing, isn't that someone contradictory? It would seem like decoupling the compute grid and the data grid would make more sense. With replicated data bandwidth would seem abundant without the extra complication of managing so many database servers.ReplyDelete
To me, it is far more efficient from a performance and a "green" perspective to push the computation to the data. Hence, I am not a fan of decoupling the compute grid and the data grid. I totally agree that pushing more analysis into a CPU bound system is a really bad idea. In my opinion it is possible to get the best of both worlds: you scale (a large amount) horizontally, which allows each node to hold a smaller amount of data and allow the processing power (of the aggregate system) per Terabyte of data to be much higher. But you need a horizontally scalable system. Hence: HadoopDB :)
In a dedicated cluster I can see your point, but in a cluster running a heterogeneous load won't you run out of IO on each node as the computations stack up?ReplyDelete
Getting good balance in any cluster configuration is always tricky when you have heterogeneous load. Being able to move virtual machines around in a public or private cloud might help, but I don't yet have coherent thoughts on this subject.ReplyDelete
I posted a comment over at https://sourceforge.net/forum/forum.php?thread_id=3348931&forum_id=979674 before I read that the HadoopDB team is already planning to work on a MonetDB connector.ReplyDelete
Is any more information on this work available? I'd be interested in contributing too.
You can get the contact information of Kamil from the HadoopDB Website. He is the one to contact regarding the connector to MonetDB.
Could we have a google group or mailing list? Sourceforge does a reasonable job of hosting lists.ReplyDelete
I will do some test with Firebird Database http://firebirdsql.orgReplyDelete
I have asked on Firebird-Architect about your ideas and presentations and they are a little bit not so impressed , seems to be impossible to do transactions and there are some mapreduce weaknes-es with databases it seems
This comment has been removed by the author.ReplyDelete
Aster Data is a parallel database with a MapReduce interface. The core product is completely disjoint from the Hadoop codeline, and doesn't implement many of the systems-level features from the Google MapReduce paper, most notably the fault tolerance and runtime scheduling aspects (underneath it all, Aster is a traditional parallel database system).ReplyDelete
HadoopDB is part of the Hadoop ecosystem, inheriting the job scheduling, fault tolerance, scalability, and the flexible analytics infrastructure from Hadoop.
If you think that MapReduce is just a language, than Aster Data has all you need (though beware of the propriety system, with a proprietary lock-in interface language). If you think MapReduce is a complete system as proposed by Google and implemented in the Hadoop open source base with the whole ecosystem around it, then Aster Data is probably not the direction you want to go (they do have a connector to get data in and out of Hadoop, but that's as far as they go).
What is the status of this project now? In your paper you mentioned that you have plan to integrate monetdb in it. Do you have any improvement on that approach ?
Hey Daniel, I want to implement your project myself Although I have study About Map Reduce and Hadoop but I cant get much help how map reduce work on Database Kindly help me to implement your project at my home with a fresh start......Give me All the possible tutorials and guidelines which will be helpful for understanding the concepts and implementing this project at very small levelReplyDelete