DBMS Musings: 2009

Wednesday, December 30, 2009

2009's top blog posts

Below are my top six blog postings of 2009, in order of the number of page views as calculated by Google Analytics:

Announcing release of HadoopDB (longer version), and (shorter version). Combined visits: 31,228 (26,650 and 4,638 for the longer and shorter versions respectively).

The post gave an overview of the HadoopDB project that was released from my research group at Yale over the summer. The basic idea is to combine the scalability and ease-of-use of Hadoop with the performance on structured data of relational database systems.
A tour through hybrid column/row-oriented DBMS schemes. 2,922 visits.

This might be the post I'm most proud of, from the ones on this list. It went through different ways one can combine row-store and column-store technology in a single DBMS. I believe that hybrid database systems along the lines described in the post are going to take off in 2010, and the predictions made in the post will soon come to fruition.
ParAccel and their puzzling TPC-H results. 2,499 visits.

Of the posts on this list, this is the one I like the least, and the one that needs to be rewritten the most. I don't recommend reading it, but if you do, make sure you read the corrections in the comment thread in addition to the main article. However, the post is very stale at this point, since it discusses some TPC-H results from ParAccel that were challenged by a competitor and ParAccel later withdrew in September. ParAccel has not yet rerun these 30TB TPC-H benchmark results.
Watch out for VectorWise. 1,595 visits.

This post discussed the technology behind Ingres' new column-store storage engine designed for analytical workloads, based on some research out of CWI. I am very high on this technology and my research group has had a chance to play around with it a little this winter.
Analysis of the "MapReduce Online" paper. 1,108 visits.

I'm surprised this post got so many visits, since it was really geared only for readers in the research community. This post reviewed some research performed at the University of California, Berkeley, which explored how to pipeline results from different phases in MapReduce jobs to improve performance and enable early estimations of results. Rumor has it that this paper was accepted to NSDI 2010. I think the model of releasing papers as technical reports and independently reviewing and recommending them in public on venues like blogs is an interesting model to consider for the next decade, rather than the private 'accept' or 'reject' reviewing process currently used today.
Kickfire’s approach to parallelism. 1,042 visits.

This post takes a deeper dive into Kickfire's technology than you'll find in most other places. I find the way that they use FPGA technology to maximize the parallelism that can be achieved on analytical queries in a single-box machine to be quite impressive.

The post from 2009 that I feel is the most underrated (meaning that the number of visits did not match up with what I felt was the quality of the post) was:

What is the right way to measure scale?

I really feel that people thought about scale in the wrong way in 2009. People assume that if a database can fit a lot of data inside of it (i.e. many petabytes), it must be really scalable. But if the data is not very accessible (i.e. it is stored on slow media, or it takes forever to scan through it all because there are not enough disk spindles or CPUs to process it), then the system is not nearly as scalable as it would seem.

Bottom line, if you only have time to read three of my postings from 2009, I would like them to be:

The longer HadoopDB post (this is the only post about my own research)
The hybrid column/row-store post
The post on measuring scale

Overall, I'm pleased with the impact my blog seems to have had, and I intend to continue write posts for it in 2010.

Tuesday, November 24, 2009

Deadlines approaching for two upcoming summits

There are two upcoming events that I suspect will be of interest to readers of this blog.

First, the first annual ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) will be held June 10th and 11th in Indianapolis, IN (co-located with SIGMOD 2010). This symposium will focus on systems and data management issues within the context of cloud computing. If the quality of the papers comes close to matching the quality of the people listed as organizers and program committee members of the event, then this will be a can't-miss highlight of 2010. If you are doing research in software as a service, virtualization, or scalable cloud data services, this venue will likely be a nice, high-profile place to publish your findings (the paper submission deadline is January 15th, 2010).

Second, on January 28th, 2010, the third annual New England Database Summit will be held at MIT in Cambridge, Massachusetts. This will be an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their research and experiences. Registration for the event is free (thank you Netezza), and anyone is welcome to attend. The event will feature a keynote talk from Raghu Ramakrishnan (Chief Scientist for Audience & Cloud Computing at Yahoo!) and an invited talk from Curt Monash (President, Monash Research).
I'm serving as PC Chair for the event this year, but I don't expect to make any radical changes to how the summit has been run in previous years, with the day beginning with a keynote talk, followed by a series of 15-25 minute talks from different summit participants (submit a talk abstract here if you want to give a talk --- the program committee will select the set of talks from this pool of abstracts based on what we expect will appeal most to the summit audience), followed by another invited talk and then a poster session over appetizers and drinks at the end of the day.
We had over 300 registered participants at last year's event, reflecting the vibrancy of database systems activity in the New England area. We expect similar numbers at this year's event. Lunch, drinks, and appetizers are all included for free.
Poster and talk proposal submissions must be made by January 11, 2010. All reasonable posters will be accepted; talk invitations will be made by January 20, 2010.

Thursday, October 22, 2009

Analysis of the "MapReduce Online" paper

I recently came across a paper entitled “MapReduce Online” written by Tyson Condie, Neil Conway, Peter Alvaro, Joe Hellerstein, Khaled Elmeleegy, and Russell Sears at Berkeley (University of California). Since I’m very interested in Hadoop-related research (see my group’s work on HadoopDB) and this Berkeley group have historically produced reliably good research papers, I immediately downloaded the paper and read it carefully. The paper demonstrates the impact of several improvements the group made to Hadoop for interactive queries, and since Hadoop is becoming increasingly popular, I expect this paper to have wide interest. Therefore, I think it might be useful to post a summary of the paper and some analysis on this blog. If you have also read this paper, I encourage discussion in the comment thread of this post.

Overview

The authors argue that since MapReduce’s (and therefore Hadoop’s) roots are in batch processing, design decisions were made that are cause problems as Hadoop gets used more and more for interactive query processing. The main problem the paper addresses is how data is transferred between operators --- both between ‘Map’ and ‘Reduce’ operators within a job, and also across multiple MapReduce jobs. This main problem has two subproblems:

Data is pulled by downstream operators from upstream operators (e.g., a Reduce task requests all the data it needs from predecessor Map tasks) instead of pushed from upstream operators to downstream operators.
Every MapReduce operator is a blocking operator. Reduce tasks cannot begin until Map tasks are finished, and Map tasks for the next job cannot get started until the Reduce tasks from the previous job have completed.

The paper does not really distinguish between these two subproblems, but for the purposes of this discussion, I think it is best to separate them.

Subproblem 2 causes issues for interactive queries for four reasons:

It eliminates opportunities for pipelined parallelism (where, e.g., Map and Reduce tasks from the same job can be running simultaneously). There are some cases (such as the example query in the paper --- more on that later) where pipelined parallelism can significantly reduce query latency.
It causes spikes in network utilization (there is no network traffic until a Map task finishes, and then the entire output must be sent at once to the Reduce nodes). This is particularly problematic if Map tasks across nodes in a cluster are synchronized
When every operator is a blocking operator, it is impossible to get early estimations of a query result until the very end of query execution. If data instead can be pipelined through query operators as it is processed, then we can start receiving query results almost immediately, and these early query results might be sufficient for the end user, who can then stop the query before it runs to completion.
If the Map part of a MapReduce job reads from a continuous (and infinite) data source instead of from a file, no operator ever “completes”, so a blocking operator that doesn’t produce results until it completes is entirely useless.

Subproblem 1 (pull vs push) is problematic mostly due to performance reasons. Data that is pushed can be shipped as soon as it is produced, while it is still in cache. If it is pulled at a later time, accessing the data often incurs an index lookup and a disk read, which can reduce performance. There is also slightly more coordination (and therefore network traffic) across nodes in the pull model relative to the push model.

The goal of the paper is to fix these subproblems that arise on interactive query workloads, while maintaining Hadoop’s famous query-level fault tolerance (very little query progress is lost upon a node failure). Essentially, this means that we want to continually push data from producer operators to consumer operators on the fly as the producer operator produces output. However, in order to maintain Hadoop’s fault tolerance, data that is pushed is also checkpointed locally to disk before being sent, so that if a downstream node fails, the task assigned to this node can be rescheduled on a different node, which will begin by pulling data from this upstream checkpoint just as standard Hadoop does.

The previous two sentences is the basic idea suggested in the paper. There are some additional details that are also described (1. to avoid having too many TCP connections between Map and Reduce nodes, not all data can be immediately pushed; 2. data that is pushed should be sent in batches and “combined” before being shipped; 3. in order to handle potential Map node failures, Reduce tasks that receive data before Map tasks have completed have to treat this data as “tentative” until the Map task “commits”), but these details do not change the basic idea: push data downstream as it is produced, but still write all data locally to disk before it is shipped.

The authors implement this idea in Hadoop, and then run some experiments to demonstrate how these changes (1) improve query latency (2) enable early approximations of query results (3) enable the use of Hadoop for continuous data stream of data. In other words, all of the problems that Hadoop has for interactive query workloads described above are fixed.

Analysis

The paper evaluates mainly subproblem (2): every operator in MapReduce is a blocking operator. The main focus of the paper is, in particular, on turning the Map operator from a blocking operator to an operator that can ship query results as they are produced. This technique gets more and more useful the fewer map tasks there are. If there are more nodes than Map tasks (as is the case for every experiment in this paper) then there are entire nodes that have only Reduce tasks assigned to them, and without pipelining, these nodes have to sit around and do nothing until the Map tasks start to finish.

Even if there are as many nodes as Map tasks (so every node can work on a Map task during the Map phase), each node can run a Map task and a Reduce task in parallel and the multicore resources of the node can be more fully utilized if the Reduce task has something to do before the Map task finishes. However, as there are more Map tasks assigned per node (Google reports 200,000 Map tasks for 2,000 nodes in the original MapReduce paper), I would predict that the performance improvement from being able to start the Reduce nodes early gets steadily smaller. I also wonder if the need to send data from a Map task before it completes for the purposes of online aggregation also gets steadily smaller, since the first Map tasks finish at a relatively earlier time in query processing when nodes have to process many Map tasks.

In general, the performance studies that are presented show the best case scenario for the ideas presented in the paper: (1) fewer than 1 Map task assigned per node (2) the Map task does not filter any data so there is a lot of data that must be shipped between the Map and Reduce phases (3) there is no combiner function (4) there are no concurrent MapReduce jobs. Although it is useful to see performance improvement in this best-case scenario, it would also be interesting to see how performance is affected in less ideal settings. I’m particularly interested in seeing what happens when there are many more Map tasks per node.

Even in a less ideal setting I would still expect to see performance improvement from the ideas presented in this paper due to the authors' solution to subproblem 1 (switching from a pull-model to a push-model). It would be great if the experiments could be extended so that the reader can understand how much of the performance improvement is from the authors' solution to subproblem 2 (the Reduce nodes can start early) and how much of the performance improvement is from the disk seeks/reads saved by switching from a pull-model to a push-model.

Another performance related conclusion one can draw from this paper (not mentioned by the authors, but still interesting) is that Hadoop still has plenty of room for improvement from a raw performance perspective. The main task in this paper involved tokenizing and sorting a 5.5 GB file. Hadoop had to use 60 (seriously, 60!) nodes to do this in 900 seconds. The techniques introduced in the paper allowed Hadoop to do this in 600 seconds. But does anyone else think they can write a C program that runs on a single node that runs in half the time on 1/60th of the machines?

Conclusion

The ideas presented in this paper provide clear advantages for interactive query workloads on Hadoop. For this reason, I look forward to the modifications made by the authors making it into the standard Hadoop distribution. Since HadoopDB is able to leverage improvements made to Hadoop, these techniques should be great for HadoopDB as well. While I think the performance advantages in the paper may be a little overstated, I do believe that switching from pull to push will improve performance by at least a small amount for many queries, and the ability to support online aggregation and continuous queries is a nice contribution. My suggestion to the authors would be to spend more time focusing on online aggregation and continuous queries, and less time extolling the performance advantages of their techniques. For example, one question I had about online aggregation is the following: the running aggregation is done using the Map tasks that finish first. But the problem is that Map tasks run on congruous chunks of an HDFS file --- data in the same partition are often more similar to each other than data on other partitions. Hence, might data skew become a problem?

Wednesday, October 14, 2009

Greenplum announces column-oriented storage option

I checked Curt Monash’s blog today (as I do on a somewhat daily basis) and saw a new post announcing Greenplum’s new column-oriented storage option. In my opinion, this is pretty big news. I was amused to see that Curt, in his post, correctly predicted pretty much everything I was going to say, but nonetheless, I feel obligated to post my reactions to this news:

Quick hit reactions:

(1) Congrats to Greenplum and their strong technical team for adding
column-oriented storage. They have essentially created a hybrid storage layer --- now you can store data in rows (write or read optimized) or in columns. I have previously written a long (and surprisingly popular --- 1,638 unique visits) blog post on hybrid column/row-oriented storage schemes. I would put the Greenplum scheme under the “Approach 3: fine-grained hybrids” classification. This puts them in the same category as the Vertica Flexstore scheme (more on this in a second).

(2) I strongly agree with the Greenplum philosophy that religious wars of columns vs rows will not get you anywhere. The fact of the matter is that for some workloads, row-oriented storage can get you an order of magnitude performance improvement relative to column-stores, and for other workloads, column-oriented storage can get you an order of magnitude performance improvement relative to row-stores. Hybrid schemes can theoretically get you the best of both worlds, and that can be a big win.

(3) For a few years now, I've lumped Greenplum and Aster Data together in my data warehouse market world view, even though Greenplum is slightly older and has more customers. Both are software-only solutions, and are big proponents of commodity hardware. Both partner with hardware vendors and will sell you bundled hardware with their software in a pre-optimized appliance if you prefer that to the software-only solution. Both started with vanilla PostgreSQL and turned it into a shared-nothing, MPP analytical DBMS. Both are players in the "big data" game, and heavily market their scalability. Both are west coast companies. Both support in-database MapReduce capabilities (they even announced this on the same day!). Both had research papers in the major DBMS research conferences this year. They both have embraced "the cloud". They even share some of the same customers (e.g. Fox Interactive Media).

The word on the street is that the main difference between the two companies is that Greenplum has made significant modifications to the PostgreSQL code-base and the integration of the DBMS with the distribution (parallelization) layer is more "monolithic", whereas Aster Data has kept more of the original PostgreSQL code, and treats the database more like a black box. This announcement by Greenplum is further evidence that they are more than willing to edit the PostgreSQL code, as significant modifications to the storage layer was required.

(4) Kudos to Greenplum for being open and honest about the limitations of their column-store option. It is a column-store at the storage layer only. This certainly is still a big win for queries that access few attributes from wide tables. However, before I had my own blog, I wrote a guest article for Vertica’s blog explaining that column-oriented storage only gets you part of the way towards column-store performance --- writing a query executor optimized for column-oriented storage can get you an additional order of magnitude performance improvement. I explain this at length in two academic papers (see here and here). I think papers coined the terms "early materialization" and "late materialization" for explaining whether columns are being put back together into tuples (rows) at the beginning or end of execution of a query plan. In retrospect I have regretted using this term ("tuple construction" is a little more descriptive and easier to understand than "tuple materialization"). The fact that Greenplum uses the term “early materialization” to describe their column-oriented scheme is evidence that they’ve read these papers (more kudos!) and will start to implement the low-hanging fruit in the query executor to get increased performance improvement from their column-oriented storage. Hence, I expect that their column-oriented feature (while probably already useful now) will continue to get better in future releases. In the short-term, one should not expect performance approaching regular column-stores.

(5) In my previous blog post on hybrid column/row-oriented storage schemes, I mentioned that it is far easier for a column-store to implement the “Approach 3: fine-grained hybrids” scheme than a row-store, since the row-store would have to make modifications at the storage, query executor, and query optimizer layers, while the column-store (that implements both early and late materialization) would only have to make modifications at the storage layer (since the query executor and optimizer are already capable of working with rows). This would lead me to believe that the Vertica FlexStore scheme is more immediately useful than the Greenplum hybrid storage scheme. But, as I have written before, my history with Vertica gives me some bias, so be careful what you do with that statement.

(6) This has been a pretty big month for column-oriented storage layers. First Oracle announced their new column-oriented compression scheme (classified in the “Approach 1: PAX” category in my previous blog post) and now Greenplum adds column-oriented storage. This should put more pressure on the other vendors in the data warehousing space (especially Microsoft and IBM) to come up with some sort of column-oriented storage option in the near future.

Sunday, September 13, 2009

Kickfire’s approach to parallelism

I was chatting with Raj Cherabuddi, founder of Kickfire recently about Kickfire’s approach to parallelism, and I think that some of the problems they have to deal with regard to parallelizing queries are quite different from standard parallel database systems, and warrant talking about in a blog post.

Parallel databases typically achieve parallelism via “data-partitioned parallelism”. The basic idea is that data is horizontally partitioned across processors, and each processor executes a query on its own local partition of the data. For example, let’s say a user wants to find out how many items were sold over $50 on August 8th 2009:

SELECT count(*)
FROM lineitem
WHERE price > $50 and date = ‘08/08/09’.

If elements of the line item table are partitioned across different processors, then each processor will execute the complete query locally, computing a count of all tuples that pass the predicates on its partition of the data. These counts can then be combined in a “merge” final step into a global count. The vast majority of the query (everything except the very short final merge step) can be processed completely in parallel across processors.

Let’s assume that there are no helpful indexes for this query. A naïve implementation would use the iterator model to implement this query on each processor. The query plan would consist of three operators: a table scan operator, a selection operator (for simplicity, let's assume it is separate from the scan operator), and an aggregation (count) operator. The aggregation operator would call “getNext” on its child operator (the selection operator), and the selection operator would in turn call getNext on its child operator (the scan operator). The scan would read the next tuple from the table and return control along with the tuple to the selection operator. The selection operator would then apply the predicate on the tuple. If the predicate passes, then the selection operator would return control, along with the tuple to the count operator which increments the running count. If the predicate fails, then instead of returning control to the count operator, the selection operator would instead call getNext again on its child (the scan operator) and apply the predicate to the next tuple (and keep on doing so until a tuple passes the predicate).

It turns out that the iterator model is quite inefficient from an instruction cache and data cache perspective, since each operator runs for a very short time before yielding control to a parent or child operator (it just processes one tuple). Furthermore, there is significant function call overhead, as “getNext” is often called multiple times per tuple. Consequently modern systems will run each operator over batches of tuples (instead of on single tuples) to amortize initial cache miss and function call overheads over multiple tuples. Operator output is buffered, and a pointer to this buffer is sent to the parent operator when it is the parent operator’s turn to run.

Whether the iterator model is used or the batched/staged model is used, there is only one operator running at once (per processor). Thus, the only form of parallelism is the aforementioned data-partitioned parallelism across processors. Even if a processor has multiple hardware contexts/cores, each processing resource will be devoted to processing the same one operator at a time (for cache performance reasons --- see e.g. this paper).

Kickfire, along with other companies that perform database operations directly in the hardware using FPGA technology, like Netezza and XtremeData, need to take a different approach to parallelism.

Before discussing the effect on parallelism, let’s begin with a quick overview of FPGA (field programmable gate array) technology, At the lowest level, a FPGA contains a array of combinational logic, state registers, and memory, that can be configured via a mesh of wires to implement desired logical functions. Nearly any complex algorithm, including database operations such as selection, projection, and join, can be implemented in a FPGA and in doing so, can be run at the speed of the hardware. Not only is performance improved by running these algorithms in the hardware, but the chip can also be run at orders of magnitude lower clock frequencies, which result in commensurate gains in power efficiency. In many cases, operations that take hundreds to thousands of CPU instructions can be performed in a single clock cycle in FPGA logic.

Kickfire therefore employs direct transistor-based processing engines in the FPGA to natively execute complete pipelines of relational database operations. The scale and density of the VLSI processes used in today FPGA’s enable a large number (order of hundreds) of these custom operations to occur in parallel, enabling the use of parallel processing algorithms that can improve performance even further.

The ability to have a large number of operations occurring in parallel means that the query processing engines do not need to switch back and forth between different operators (as described in the iterator and blocked/staged schemes above). However, if you want to get the most out of the parallelism, data partitioned parallelism can only get you so far.

For example, if every processing unit is devoted to performing a selection operation on a different partition of the data, then the result of the selection operator will build up, eventually exceeding the size of on-chip and off-chip buffers, thereby starving the execution engines. Consequently, Kickfire implemented efficient pipelined parallelism in addition to data partitioned parallelism, so that all operators in a query plan are running in the hardware at the same time, and data is being consumed at the approximate rate that it is being produced. Kickfire implemented advanced networking techniques in the areas of queuing and flow control to manage the data flow between the multiple producers and consumers, ensuring that the data, in most cases, stays on the chip, and only occasionally spills to the memory (off-chip buffers). Keeping the intermediate datasets live on the chip prevents memory latency and bandwidth from becoming a bottleneck.

However, data-partitioned parallelism is still necessary since operators consume data at different rates. For example, if a selection predicate has 50% selectivity (1 in 2 tuples pass the predicate) followed by an aggregation operator (as in the example above), then one wants to spend approximately twice as much time doing selection as aggregation (since the aggregation operator will only have half as many tuples to process as the selection operator), so Kickfire will use data-partitioned parallelism to have twice as many selection operators as the parent operator.

For example, a hardcoded Kickfire query might look like the figure below (ignoring the column-store specific operators which is a story for another day):

Note that the selection operations on T1 and T2 along with the join between these two tables occurs four times in the query plan (this is data-partitioned parallelism). Since the join has a reasonably low cardinality, there is no need to have the parent selection operator also appear four times in the query plan; rather it can appear twice, with each operator processing the results from two child join operators. Similarly, since the selection operators produce fewer outputs than inputs, the parent operator only needs to appear once. Data from one operator in the query plan is immediately shipped to the next operator for processing.

Kickfire also claims to be able to devote hardware to running multiple queries at the same time (inter-query parallelism). Getting the right mix of data-partitioned parallelism, pipelined parallelism, and inter-query parallelism is a tricky endeavor, and is part of Kickfire’s “secret sauce”. Clearly, this requires some amount of knowledge about the expected cardinality of each operator, and the Kickfire software uses information from the catalog to help figure all of this out. One would expect this process to get more difficult for complex queries --- it will be interesting to see how Kickfire performs on complex workloads as they continue to gain customer traction (Raj makes a compelling case that, in fact, it is the most complex queries where FPGA technology can shine the brightest).

In a nutshell, Kickfire uses column-oriented storage and execution to address I/O bottlenecks (column-oriented storage has been covered extensively elsewhere in my blog, but you can read about the specifics of Kickfire’s column-store on their blog), and FPGA-based data-flow architecture to address processing and memory bottlenecks. Their “SQL chip” acts as a coprocessor and works in conjunction with the x86 processors (which runs a SQL execution engine in the software when needed, though this is usually the exception path) in their base server. By alleviating these three important bottlenecks, Kickfire is able to deliver high performance; yet still achieves tremendous power efficiency thanks to the low clock frequencies.

Overall, although I have openly questioned Kickfire’s go-to-market strategy in past posts (see here and here), their non-technical departments seem a little disorganized at times (see Jerome Pineau’s experience), and some highly visible employees are no longer with the company (notably Ravi Krishnamurthy who presented their SIGMOD paper and Justin Swanhart who did a nice job explaining the Kickfire column-store features in the aforementioned write-up), I remain a fan of their technology. If they make it through the current difficult economic climate, it will be at the virtue of their technology and the tireless work of people like Raj. As the rate of clock speed increases of commodity processors continues to slow down, being able to perform database operations in the hardware becomes an increasingly attractive proposition.

Thursday, September 3, 2009

A tour through hybrid column/row-oriented DBMS schemes

There has been a lot of talk recently about hybrid column-store/row-store database systems. This is likely due to many announcements along these lines in the past month, such as Vertica’s recent 3.5 release which contained FlexStore, Oracle’s recent revelation that Oracle Database 11g Release 2 uses column-oriented storage for the purposes of superior compression, and VectoreWise’s recent decloaking that also announced an optional hybrid storage layer. Furthermore, analysts like Curt Monash and Philip Howard are predicting further development in this space (see here and here).

It’s surprising to me that it has taken this long before we started seeing commercially available implementations of hybrid systems. The research community has been publishing papers on hybrid systems for decades, with straightforward proposals that could easily be implemented in commercial systems starting to be published 8 years ago.

Different approaches to building hybrid systems yield very different performance properties and solve different sets of problems. Thus, as more hybrid systems become commercially available, and as more companies consider developing their own hybrid system, it is important that people understand the different tradeoffs involved between the various hybrid schemes. My goal in this post is to educate people who are not part of the SIGMOD/VLDB research community about three well-known approaches to building hybrid systems taken by different research efforts, and give pointers to research papers where the reader can find out more detail. Each approach has its own set of advantages and disadvantages and I will try to list both sides of the tradeoff in each case. The goal of this post is not to say that one type of hybrid scheme is better than another --- each scheme can be a good fit in the right situation.

Approach 1: PAX

The PAX scheme was published in 2001 in the VLDB paper “Weaving Relations for Cache Performance” by Natassa Ailamaki, David DeWitt, Mark Hill, and Marios Skounakis. The basic idea is the following: instead of storing data row-by-row within a disk block, store it column-by column. This is different than a “pure” column store, which stores each column in entirely separate disk blocks. The key difference is that if you had a table with 10 attributes, then in a “pure” column store, data from each original tuple is spread across 10 different disk blocks, whereas in PAX, all data for each tuple can be found in a single disk block. Since a disk block is the minimum granularity with which data can be read off of disk, in PAX, even if a query only accesses only 1 out of the 10 columns, it is impossible to read only this single column off of disk, since each disk block contains data for all 10 attributes of the table.

To understand why this is a good idea, some context is necessary. At the time the paper was written, column-stores (called the “DSM model” in the paper) had made very limited impact on the commercial landscape (there was Sybase IQ, but it was very much a niche product at the time). It was widely believed that the reason why the DSM model had failed to take off was due to the high tuple reconstruction costs.

Let’s say a query accessed three out of ten columns from a table and required some operator (like an aggregation) that required each of these three columns to be scanned fully. A row-store would have to do a complete table scan, wasting some I/O reading the 7 irrelevant columns in addition to the 3 relevant ones. But at least it would read the whole table sequentially. The column-store would only have to read the 3 relevant columns, but would have to seek back and forth between the 3 columns, doing tuple reconstruction. In 2001, servers had nowhere near the memory capacities they have today, so extensive prefetching was not an option (i.e., instead of reading one block from column 1 and then one block from column 2 and then one block from column 3 and then the next block from column 1, etc., prefetching allows you to read n blocks from column 1 and then n blocks from column 2, etc, allowing the seek cost to be amortized over a large amount of sequential reads, but you need enough memory to keep n blocks from each column in memory at once). Given how expensive seek costs are relative to sequential access, it is no accident column-stores didn’t take off until system memories increased to recent levels to allow for significant prefetching. (Research on late materialization to delay tuple reconstruction until the end of the query when fewer tuples need to be materialized also helped).

Anyway, PAX was able to achieve the CPU efficiency of column-stores while maintaining the disk I/O properties of row-stores. For those without detailed knowledge of column-stores, this might seem strange: the way most column-stores pitch their products is by accentuating the disk I/O efficiency advantage (you only need to read in from disk exactly what attributes are accessed by a particular query). Why would a column-store want equivalent disk access patterns as a row-store? Well, it turns out column-stores have an oft-overlooked significant CPU efficiency as well. The aspect of CPU efficiency that the PAX paper examined was cache hit ratio and memory bandwidth requirements. It turns out that having column data stored sequentially within a block allows cache lines to contain data from just one column. Since most DBMS operators only operate on one or two columns at a time, the cache is filled with relevant data for that operation, thereby reducing CPU inefficiency due to cache misses. Furthermore, only relevant columns for any particular operation need to shipped from memory.

Bottom line:

Advantages of PAX:

Yields the majority of CPU efficiency of column-stores
Allows for column-oriented compression schemes, which can improve compression ratio due do increased data locality (data from the same attribute domain is stored contiguously). This can improve performance since the more data can be compressed, the less time needs to be spent reading it in from storage.
Is theoretically fairly easy to implement in a row-store system to get some of the column-store advantages. I say “theoretically” because no commercial row-store system actually did this (to the best of my knowledge) until Oracle 11gR2.

Disadvantages of PAX

Equivalent I/O properties as row-stores (not counting compression) in the sense that irrelevant columns still have to read from storage in the same blocks as the needed columns for any particular query. In 2001 this was an advantage. Today, for typical analytical workloads, this is a significant disadvantage. (For less scan-oriented workloads, such as tuple lookups and needle-in-the-haystack queries, this remains an advantage).
The cache prefetching features on modern processors renders some of the cache efficiency of PAX obsolete (PAX no longer makes a large difference on cache hit ratio). However, it still reduces the demands on memory bandwidth, and other CPU advantages of column-stores, such as vectorized processing, remain possible to achieve in PAX.

Approach 2: Fractured Mirrors

This scheme was published in 2002 in the VLDB paper “A Case for Fractured Mirrors” by Ravi Ramamurthy, David DeWitt, and Qi Su. (Yes, you read that right. The University of Wisconsin DBMS group lead by DeWitt authored both seminal papers on hybrid column-store/row-store systems.) The approach is essentially the following: you’re going to replicate all of your data anyway for high availability and/or disaster recovery. Why not have different storage layouts in each replica? That way, if you have a tuple-lookup or a needle-in-a-haystack query, you send it to the row-store replica. If the query is more scan-oriented (e.g. an aggregation or summarization query), you send it to the column-store replica. The implementation in the paper is a little more complicated (to avoid skew, each node contains part of the column-store and part of the row-store), but the above description is the basic idea.

Most people agree (I think) that row-stores are more than an order of magnitude faster than column-stores for tuple-lookup queries, and column-stores are (at least) more than order of magnitude faster than row-stores for scan-oriented queries. Here is how one comes to this conclusion: To lookup a tuple in a row-store, one needs to read in just the one block that contains the tuple (let’s assume all relevant indexes for both the row-store and the column-store are in memory). In a column-store, one block needs to be read for each attribute. Assuming there are more than 10 attributes in the table, this is already more than an order of magnitude. On the other hand, for scan queries, if a query accesses less than 10% of the attributes of a table (the common case), column-stores get one order of magnitude improvement relative to row-stores immediately (for disk efficiency). Additionally, many have argued (see here and here) that column-stores get an additional order of magnitude improvement for CPU efficiency.

If you buy the above argument, then it is critical to use a scheme like fractured mirrors for mixed workloads. Given how often people talk about mixed workloads as being a key problem in enterprise data warehouses, it is surprising how long it has taken to see a commercial implementation along the lines written about in the research paper.

Advantages of Fractured Mirrors:

Every individual query gets sent to the optimal storage for that query. Performance is thus an order of magnitude better than either a pure row-store or a pure column-store on queries that are problematic for that type of store.

Disadvantages of Fractured Mirrors:

All data needs to be replicated. Obviously, in most cases you’re going to replicate the data anyway. But if you are already using the replication for something else (e.g. storing the data in a different sort order), then you either need to increase the replication factor or remove some of the additional sort orders in order to implement fractured mirrors.
If you really want to get the most out of this approach, you need to extend what is proposed in the research paper and have complete implementations of row-store and column-store systems (since column-stores have very different query execution engines and query optimizers than row-stores). This is obviously a lot of code, and precludes most companies from using the fractured mirrored approach. I am flummoxed as to why the only company with legitimate row-store and column-store DBMS products (Sybase with ASE and IQ) hasn’t implemented the fractured mirrors approach yet.

Approach 3: Fine-grained hybrids

The VLDB 2003 paper by Hankins and Patel and the CIDR 2009 paper by Cudre-Mauroux, Wu, and Madden are examples of this approach. Here, individual tables can be divided into both row and column-oriented storage. For example, if two columns are often accessed together (e.g. order-date and ship-date), they can be stored together (in rows). The remaining columns can be stored separately. This can be done within a disk block, within a table, or even at a slightly larger grain across tables. For example, if one table is often accessed via tuple lookups (e.g. a customer table), then it can be stored in rows; while other tables that are usually scanned (e.g. lineitem) can be stored in columns.

Advantages of fine-grained hybrids

If correct decisions are made about what attributes (and/or tables) should be stored in rows, and what attributes should be stored in columns, then you can get all of the performance advantages of fractured mirrors without the additional replication disadvantage.

Disadvantages of fine-grained hybrids

Depending on the DBMS system you start with, it can be complex to implement. You essentially have to have both row- and column-oriented execution engines, the optimizer has to know about the differences between row-storage and column-storage, and indexing code has to be updated appropriately. It turns out that it is much easier to implement in a column-store that already supports early materialization (early materialization refers to the ability to reconstruct rows from column-storage early in a query plan) than in other types of systems. This is because early-materialization requires query operators to be able to handle input in both column and row-oriented format (it will be in column format if tuples haven’t been materialized yet, otherwise it will be in row-oriented format). Hence, the execution engine and optimizer already has knowledge about the difference between rows and columns and can act appropriately.
It requires some knowledge about a query workload. Otherwise incorrect decisions about tuple layout will be made, leading to suboptimal performance.

Commercial availability

Oracle, Vertica, and VectorWise have announced hybrid systems using one of these schemes (I have no inside knowledge about any of these implementations, and only know what’s been published publicly in all cases). It appears that Oracle (see Kevin Closson’s helpful responses to my questions in a comment thread to one of his blog posts) and VectorWise use the first approach, PAX. Vertica uses fine-grained hybrids (approach 3), though they probably could use their row-oriented storage scheme to implement fractured mirrors (approach 2) as well, if they so desire. Given that two out of the three authors of the fractured mirrors paper have been reunited at Microsoft, I would not be surprised if Microsoft were to eventually implement a fractured mirrors hybrid scheme.

Conclusion

There was nearly a decade of delay, but at long last we’re starting to see hybrid row/column-stores hit the marketplace. Row-stores and column-stores have very different performance characteristics, and they tend to struggle where the alternative excels. As workloads get more complex, hybrid systems will increase in importance.

Wednesday, August 19, 2009

Netezza TwinFin: A step towards a potential acquisition?

Ever since Netezza became a public company, every once and a while someone tries to start a rumor that Netezza is on the verge of being acquired (likely started by people who want a quick return on their Netezza stock buy). These rumors usually involve a company like Oracle buying Netezza, which never made a lot of sense to me, since Oracle has their own DBMS product and has very little reason to buy a much smaller competitor like Netezza and maintain two lines of code that target the same market. This is why it wasn’t surprising that Microsoft chose to acquire DATAllegro instead of Netezza, even though Netezza was much farther along than DATAllegro and had a larger customer base. DATAllegro essentially left the DBMS engine in tact, with its key technological assets sitting on top of the DBMS, turning many single-node Ingres instances into a large, shared-nothing, MPP DBMS. Since DATAllegro used a nice, modular architecture, Microsoft was able to replace Ingres with SQL Server, and use DATAllegro’s technology to turn SQL Server into a MPP DBMS without significant modifications to the core SQL Server DBMS engine (see Microsoft’s Project Madison).

But two events now make me wonder if Netezza might actually end up being acquired by a vendor that currently sells a competing DBMS product (likely either IBM with DB2 or HP with NeoView).

First, there was the release Oracle Database Machine. Oracle openly admits that the Oracle Database Machine frequently gets a factor of between 10 and 70 performance improvement relative to previous Oracle offerings (i.e. Oracle RAC) on scan-heavy analytical workloads. But the center of the Oracle Database Machine is …. Oracle RAC! So how does it get the order of magnitude performance improvement relative to RAC? By connecting RAC (using Infiniband) to a shared-nothing storage layer (Exadata) that can perform database scans at extremely high speeds and do some basic database operations like tuple selection and projection. Since scan-oriented queries are limited by the speed with which the scan can occur, simply connecting RAC to a storage layer that can do scans really well yields significant improvement.

Perhaps Netezza’s greatest asset is its ability to achieve high performance on table scans. By using FPGAs to perform decompression, selection, and projection as data is read off of disk, Netezza is able to perform scans faster than what competitors (at least row-store competitors) can do on commodity hardware. If the Oracle Database Machine is successful (Larry Ellison said at a recent earnings call that it "is shaping up to be our most exciting and successful new product introduction in Oracle’s 30 year history"), I would expect its competitors to follow suit --- and connect their DBMS engines to a high performance storage layer the way Oracle did with Exadata.

Second, Netezza’s recent move to re-architect their appliance via TwinFin (announced a few weeks ago) is a clear embrace of commodity hardware components. Before this redesign, Netezza was a monolithic appliance. As detailed by ComputerWeekly, if you wanted to upgrade storage or processing capacity, you had to wait for the next Netezza release and replace the whole appliance with the Netezza’s next generation. Now, the core part of the Netezza technology can be placed in the “sidecar” expansion slot in the standard IBM BladeServer family of servers. This allows customers to upgrade the IBM blades independently of the Netezza technology.

Looking at it a different way: the technology behind Netezza’s stellar scan performance can now be found in a nice modular component, the “DB Accelerator” card, that can be placed in standard expansion slots in blade servers. The move towards a more modular architecture is reminiscent of the DATAllegro architecture that allowed Microsoft to replace Linux with Windows and Ingres with SQL Server and keep the majority of the rest of the DATAllegro technology. DATAllegro was sold for $275 million to Microsoft when it only had 3-4 customers.

Netezza’s current market cap is currently $550 million and it has orders of magnitude more customers than DATAllegro did (and is currently profitable). Hence it seems like a prime candidate for an acquisition. Its recent architectural redesign allow it to be acquired even by a company with a competing data warehouse product, since its core technology can be used in the storage layer as a drop in accelerator for table scans and used in a similar way that Oracle uses Exadata. IBM seems like a natural fit given their close partnership on TwinFin. Otherwise HP seems like an option since NeoView seems like it is having trouble getting off of the ground. Time will tell, but I will no longer ignore Netezza acquisition rumors the way I once did.

Tuesday, August 4, 2009

Netezza's competitors open fire

It's fairly unusual to see a company openly attack a competitor in a public forum. I don't know the exact reason, but I presume it has something to do with the old dictum "there's no such thing as bad publicity", so giving a competitor free publicity of any kind (even the negative sort) is deemed a bad idea. Or maybe it's because the attack might backfire --- if the attack is not made using solid reasoning, the company might come off looking foolish. Or maybe it's because attacking a competitor is, in a way, a tacit validation of their position as an equal --- small, insignificant companies would be ignored; an attack is acknowledging that the two companies see each other often in competitive situations, and might encourage a potential customer to consider the competitor in the same POC when they might have not done so otherwise.

This is why the back and forth between Netezza and its competitors has been so jarring. By my estimation, it started when Larry Ellison positioned the new Oracle Exadata release back in September 2008 against Netezza, questioning Netezza's fault tolerance, DW functionality, and DBMS know-how (http://www.rittmanmead.com/2008/09/24/live-blogging-from-the-larry-ellison-keynote-oow-2008/). This was then responded to by Netezza, who became absolutely obsessed with Oracle Exadata, attacking them in multiple postings on their Data Liberators blog (see here, here, here, and here) even resorting to name calling ("Oracle Exaggerdata").

In the last few days, the attacks on Netezza have increased in intensity. First, I came across an Oracle blog post that basically claimed that Oracle Exadata is better than Netezza along every possible dimension: storage, CPU power, memory, interconnect, load performance, query performance, and architecture.

Then, I came across an Aster Data blog post which made wild claims such as Netezza's new release is an indication that Netezza regrets building their DBMS around FPGAs and is now desperately trying to abandon ship and switch to mainstream CPUs.

Be careful reading both of these attacks. I find them both full of FUD and disagree with much of the premise of both of them. Oracle's blog post omits comparing Netezza and Oracle along perhaps the two most important dimensions: price/performance and total cost of ownership. Oracle brags that their database machine uses a 20Gb infiniband interconnect while Netezza only uses 1Gb ethernet. But presumably the price of the expensive interconnect gets passed on to the customer --- it could easily be argued that Netezza's use of 1GB ethernet is an indication that their architecture might be superior --- Oracle needs infiniband to connect the storage and computation layers of their system; Netezza's ability to push computation to the data allows them to avoid having to include the high cost interconnect. I would guess that Netezza's price/performance is significantly superior to Oracle's, but trying to calculate the price of Oracle's database machine is far too complicated to put some meat behind this statement. Furthermore, Netezza's superior total cost of ownership relative to Oracle is common knowledge, I would be surprised to see someone argue otherwise.

I also find the Aster Data post full of FUD. Claiming that Netezza is trying to abandon their FPGA approach is ridiculous (in my opinion). They have invested a huge amount into doing decompression, projections, selections, and other DBMS operations inside the FPGA, and there are performance advantages in doing so. The redesign of their architecture was necessary to be able to improve caching of data in memory (to improve repeated scans of the same table) and to add more commodity components to their system, allowing them to take better advantage of upgrades to the disk and CPU technology they incorporate.

Though the Oracle and Aster Data reactions are misleading, I'm still a little worried about Netezza. Like everyone else, I was looking forward to their "big" announcement at TDWI, and was disappointed when I found out what it was. Sure, the internal architectural redesign is big news to Netezza internally, but ultimately, all it means to the customers is that Netezza can now do some things that its competitors already can do. Sure lower prices and a better ability to handle mixed workloads are nice, but I was expecting something a little more radical. I guess the lesson to be learned is that it is never a good idea to prepare people for a big announcement --- it just leaves lots of potential for disappointment.

Wednesday, July 29, 2009

Watch out for VectorWise

In the last few years, there have been so many new analytical DBMS startups de-cloaking that it’s difficult to keep track of them all. Just off the top of my head, I would put Aster Data, DATAllegro, Dataupia, Exasol, Greenplum, Infobright, Kickfire, ParAccel, Vertica, and XtremeData in that category. Once you add the new analytical DBMS products from established vendors (Oracle Exadata and HP Neoview) we’re at a dozen new analytical DBMS options to go along with older analytical DBMS products like Teradata (industry leader), Netezza, and Sybase IQ. Finally, with the free and open source HadoopDB, we now have (at least) sixteen analytical DBMS solutions to choose from.

Given the current overwhelming number of analytical DBMS solutions, I suspect that VectorWise’s sneak preview happening this week is not going to get the attention that it deserves. VectorWise isn’t making a splash with a flashy customer win (like Aster Data did with MySpace), or a TPC-H benchmark win (like Kickfire and ParAccel did), or an endorsement from a DBMS legend (like Vertica did). They’re not even going to market with their own solution; rather, they’re teaming up with Ingres for a combined solution (though the entire Ingres storage-layer and execution engine has been ripped out and replaced with VectorWise). But I’m telling you: VectorWise is a company to watch.

Here are the reasons why I like them:

They are a column-store. I strongly believe that column-stores are the right solution for the analytical DBMS market space. They can get great compression ratios with lightweight compression algorithms, and are highly I/O efficient. In my opinion, the only reason why there are companies on the above list that are not column-stores is that they wanted to accelerate time to market by extending previously existing DBMS code, and the most readily available DBMS code at the time was a row-store. Any DBMS built from scratch for the (relational, structured data) analytical DBMS market should be a column-store.
Column-stores are so I/O efficient that CPU and/or memory usually become bottlenecks very quickly. Most column-stores do very careful optimizations to eliminate these bottlenecks. But to me, VectorWise has gone the extra mile. The query operators are run via a set of query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instructions. Vectors of 100-1000 values within a column get pipelined through a set of query operations on that column, with many values typically being processed in parallel by SIMD (single instruction, multiple data) capabilities of modern CPU chips. Most database systems are unable to take advantage of SIMD CPU capabilities --- the tuple-at-a-time (iterator) processing model of most database systems is just too hard for compilers to translate to SIMD instructions. VectorWise has gone to great lengths to make sure their code results in vectorized CPU processing. Their execution primitives are also written to allow CPUs to do efficient out-of-order instruction execution via loop-pipelining (although compilers are supposed to discover opportunities for loop-pipelining on their own, without carefully written code, this doesn’t happen in practice as often as it should). So with highly optimized CPU-efficient code, along with (1) operator pipelining to keep the active dataset in the cache and (2) column-oriented execution reducing the amount of data that must be shipped from memory to the CPU, VectorWise reduces the CPU and memory bottlenecks in a major way. The bottom line is that VectorWise is disk efficient AND memory efficient AND CPU efficient. This gets you the total performance package.
Their founders include Peter Boncz and Marcin Zukowski from CWI. I generally think highly of the DBMS research group at CWI, except for one of their papers which ... actually ... maybe it’s better if I don’t finish this sentence. I have spoken highly about them in previous posts on my blog (see here and here).
It looks likely that their solution will be released open source. I was unable to get a definite commitment from Boncz or Zukowski one way or another, but the general sense I got was that an open source release was likely. But please don’t quote me on that.
If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.

In the interest of full disclosure, here are a few limitations of VectorWise

It is not a shared-nothing, MPP DBMS. It runs on a single machine. This limits its scalability to low numbers of terabytes. However, VectorWise is targeting the same “mass market” that Kickfire is, where the vast majority of data warehouses are less than 10TB. Furthermore, as mentioned above, it is a great candidate to be turned into a shared-nothing, parallel DBMS via the HadoopDB technology, and I look forward to investigating this opportunity further.
In my experience, having large amounts of low-level, CPU optimized code is hard to maintain over time, and might limit how nimble VectorWise can be to take advantage of new opportunities. Portability might also become a concern (in the sense that not all optimizations will work equally well on all CPUs). However, I would not put anything past such a high quality technical team.

Two final notes:

I like their go-to-market strategy. Like Infobright and Kickfire they are after the low-priced, high volume analytical DBMS mass market. But the problem with the mass market is that you need a large global sales and support team to handle the many opportunities and customers. Startups that target the high-end have it much easier in that they can get through the early stages of the company with a few high-priced customers and don’t need to invest as much in sales and support. By getting into bed with Ingres, VectorWise gets to immediately take advantage of the global reach of Ingres, a key asset if they want to target the lower end of the market.
CWI is also the creator of the open source MonetDB column-store. VectoreWise is a completely separate codeline, and makes several philosophical departures from MonetDB. According to VectorWise, MonetDB’s materialization of large amounts of intermediate data (e.g., from running operators to completion) makes it less scalable (more suited for in-memory data sets) than VectorWise. VectorWise has superior pipelined parallelism, and vectorized execution. I have not checked with the MonetDB group to see if they dispute these claims, but my knowledge from reading the MonetDB research papers is generally in line with these statements, and my understanding is that the MonetDB and VectorWise teams remain friendly.

Monday, July 20, 2009

Announcing release of HadoopDB (longer version)

If you have a short attention span see the shorter blog post.
If you have a large attention span, see the complete 12-page paper.

There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).

The second trend is what I talked about in my last blog post: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.

I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)

Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as I wrote about previously, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., the work by James Hamilton), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.

Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.

Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.

Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.

In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the Facebook Hadoop deployment as an example). Hadoop simply scales better than any currently available parallel DBMS product. Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)

The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.

Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.

That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Our paper (that will be presented at the upcoming VLDB conference in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).

Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.

The initial release of the source code for HadoopDB can be found at http://db.cs.yale.edu/hadoopdb/hadoopdb.html. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!

Announcing release of HadoopDB (shorter version)

I'm pleased to announce the release of HadoopDB. HadoopDB is:

A hybrid of DBMS and MapReduce technologies targeting analytical query workloads
Designed to run on a shared-nothing cluster of commodity machines, or in the cloud
An attempt to fill the gap in the market for a free and open source parallel DBMS
Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).
As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

For more, see:

Wednesday, July 15, 2009

Sybase IQ throws its hat into the in-DBMS analytics ring

Sybase IQ announced this week the availability of Sybase IQ 15.1. The press release made it clear that this version is all about in-DBMS analytics. Perhaps the most notable addition is the integration of the DB Lytix (a product from Fuzzy Logix) analytics library into Sybase IQ, so DB Lytix functions can be run inside of the DBMS.

It's possible that I'm looking in the wrong places, but I have seen very little attention paid to this announcement. I take this as a symptom of a very good thing: so many DBMS vendors are adding in-DBMS analytics features to their products, that announcements such as this are not really news any more. In the last 18 months we've had the announcement of the Teradata-SAS partnership (where a number of SAS functions are being implemented inside Teradata and run in parallel on Teradata's shared-nothing architecture), Netezza opening up their development platform so that Netezza partners and customers can implement complex analytical functions inside the Netezza system, and the announcement of in-database MapReduce functionality by Greenplum, Aster Data, and Vertica (though, as explained by Colin White, the vision of when MapReduce should be used --- e.g., for ETL or user queries --- varies across these three companies). Though not announced yet, I'm told that Microsoft Project Madison (the shared-nothing version of SQL Server to be released in 2010) will natively run windowed analytics functions in parallel inside the DBMS.

As a DBMS researcher, this is great news, as the DBMS is starting to become the center of the universe, the location where everything happens. Though some would argue that in-DBMS analytics has been available for decades via user-defined functions (UDFs), many agree that UDF performance has been disappointing at best, and shipping data out of the DBMS to perform complex analytics has been common practice. The reasons for this are many-fold: query optimizers have trouble estimating the cost of UDFs, arbitrary user written code is hard to automatically run in parallel, and various security and implementation bugs manifest themselves (see, e.g., Section 4.3.5 of the "A Comparison of Approaches to Large Scale Data Analysis" paper).

One interesting trend to note is that there seem to be two schools of thought emerging with different opinions on how to allow complex in-DBMSs analytics without resorting to regular UDFs. Teradata, Microsoft, Sybase, and, to an extent, Netezza, all seem to believe that providing a library of preoptimized functions distributed with the software is the way to go. This allows the vendor to build into the system the ability to run these functions in parallel across all computing resources (a shared-nothing MPP cluster in all cases except Sybase) and to make sure these functions can be planned appropriately along with other data processing operations. The other school of thought is adopted by vendors that allow customers more freedom to implement their own functions, but constrain the language in which this code is written (such as MapReduce or LINQ) to facilitate the automatic parallelization of this code inside the DBMS.

Obviously, these camps are not mutually exclusive. As in-DBMS analytics continues to grow in popularity, it is likely we'll start to see vendors adopt both options.

Whatever school of thought you prefer, it is clear that the last 18 months has seen tremendous progress for in-database analytics. Shipping data out of the DBMS for analysis never made a lot of sense, and finally, viable alternative options are emerging. Database systems are becoming increasingly powerful platforms for data processing, and this is good for everyone.

Monday, July 6, 2009

ParAccel and their puzzling TPC-H results

[Warning: the ParAccel TPC-H results referred to in this post have since been challenged by a competitor and found to be in violation of certain TPC rules (I cannot find any public disclosure of which specific rules were violated). These results have since been removed from the TPC-H Website, as of Sep 24th 2009.]

At SIGMOD last week, I was chatting with Mike Stonebraker (chatting might be the wrong verb; it was more like downloading knowledge and advice). ParAccel had a paper at SIGMOD, and the week before they had announced that they topped the TPC-H benchmark at the 30TB scale. Naturally, this resulted in ParAccel coming up in our conversation. Mike asked me a question about the ParAccel TPC-H results that floored me --- I was really disappointed I didn't think of this question myself.

Before telling you the question, let me first give you some background. My knowledge about ParAccel's TPC-H results came from reading two blog posts about it. First, there was Merv Adrian's positive post on the subject, and then there was Curt Monash's negative post. Monash's negativity stemmed largely from the seemingly unrealistic configuration of having nearly a petabyte of disk to store 30TB of data (a 32:1 disk/data ratio). This negativity was responded to in a comment by Richard Gostanian who has firsthand knowledge since he helped ParAccel get these benchmark results. The relevant excerpt of his comment is below:

"Why did ParAccel require 961 TB of disk space? The answer is they didn’t. They really only needed about 20TB (10TB compressed times 2 for mirroring). But the servers they used, which allowed for the superb price-performance they achieved, came standard with 961 TB of storage; there simply was no way to configure less storage. These Sun servers, SunFire X4540s, are like no other servers on the market. They combine (1) reasonable processing power (two Quad Core 2.3 GHz AMD Opteron processors) (2) large memory (64 GB), (3) very high storage capacity (24 or 48 TB based on 48 x 500GB or 1TB SATA disks) and 4) exceptional I/O bandwidth (2 - 3 GB/sec depending upon whether you’re working near the outer or inner cylinders) all in a small (4 RU), low cost (~$40K), power efficient (less than 1000 watts) package.

What any DBMS needs to run a disk-based TPC-H benchmark is high I/O throughput. The only economical way to achieve high disk throughput today is with a large number of spindles. But the spindles that ship with today’s storage are much larger than what they were, say, 5 years ago. So any benchmark, or application, requiring high disk throughput is going to waste a lot of capacity. This will change over the next few years as solid-state disks become larger, cheaper and more widely used. But for now, wasted storage is the reality, and should not be viewed in a negative light."

I remember thinking to myself "sounds reasonable" and agreeing with both sides. Yes, there is wasted space, but you need the large number of spindles to get the high I/O throughput and most TPC-H configurations resemble this one.

Then along came Mike Stonebraker, who asked me: "Why does ParAccel need such high I/O throughput? They're a column-store!"

This question immediately triggered my memory of all the experiments I ran on C-Store. I generally ran C-Store on a system with 2 CPUs and 4 disks getting an aggregate I/O bandwidth of 180MB/s (i.e. 90MB/s per CPU) and I rarely saw a disk-bottlenecked query. Why? (1) Because column-stores are already super I/O efficient by only reading the relevant columns for each query and (2) Because they compress data really well (I usually saw at least 5:1) so the effective I/O bandwidth was 90MB/s * 5 = 450MB/s per CPU which, for any reasonably interesting query, is faster than the CPU can generally keep up with.

Meanwhile, the ParAccel TPC-H configuration consisted of nodes with orders of magnitude more disk I/O (3GB/s from the 48 disks) yet just 4 times the CPU processing power. Doing the math: 3GB/s divided by 8 CPU cores = 375MB/s per CPU core. Gostanian said that there was a 3:1 compression ratio, so we're talking about an astounding effective 1GB/s per CPU core.

There's simply no way TPC-H queries are simple enough (except for maybe query 1) for the CPUs to process them at 1GB/s (see my related post on this subject). So it must be the case that part of that 1GB/s is being wasted reading data that will ultimately not be processed (e.g. like unused columns in a row-store). But ParAccel is a column-store! Hence the Mike Stonebraker conundrum.

At this point I was already ready to bet the house that ParAccel does not have the I/O efficiency of a standard column-store. But one who is unconvinced could still argue that disks are cheap and ParAccel was willing to pay the small amount extra for the high I/O bandwidth (which would help the simplest queries that are not CPU intensive) even though most queries would be CPU limited and the extra bandwidth would not help them.

To test this theory I wasted some time slogging through the 82-page ParAccel full disclosure report on the 30TB TPC-H results (seriously!). Query 1 is arguably the easiest query in TPC-H (it scans the single lineitem table, accesses 7 of its attributes, applies a single predicate and several very basic aggregations). There are no joins in this query and the aggregations are really easy since there are only 4 unique groups, so the aggregations can be done during the scan via a hash aggregation. Since it is so basic, this query should be disk limited on most systems. A standard column-store should be able to run this query in the time it takes to read the 7 columns off disk. There are 180 billion rows in the lineitem table, and each attribute should take up about 4 bytes (this is conservative --- with compression this number should be much less). So a total of 180 billion x 7 x 4 = approximately 5TB needs to be read off disk for this query (in reality this number will likely be smaller due to compression and memory caching). So given that there were 43 servers, each with 3GB/s I/O bandwidth, a conservative estimate of how fast this query should be run using a standard column-store is (5TB / (43*3GB/s)) or approximately 40 seconds. But this query takes ParAccel on average 275 seconds according to the report. This is further evidence that ParAccel does not have the I/O efficiency of a standard column-store.

Hence, ever since I looked into this, I've been just totally flummoxed by these puzzling numbers.

Finally, today I finally got around to reading their SIGMOD paper describing the internals of their system. This paper was sadly sparse on details (it was only 4 pages when the SIGMOD limit was 14 and these 4 pages were not that meaty). The main thing I learned from the paper is that ParAccel has heavy PostgreSQL roots. It seems like they started with PostgreSQL and did their best to transform it into a MPP column-store. The one area where there was some amount of meat was their new optimizer's ability to handle plans with "thousands of joins". This of course raised more questions for me. Why would a start-up (with so many things that need to be implemented) be worried about queries with that many joins? Maybe there are a handful of customers that need this many, but the vast majority need far less. But then I got to Figure 2, when they ran queries from TPC-H and TPC-DS and the x-axis had queries with up to 42 joins. But unless I'm sorely mistaken, no TPC-H or TPC-DS query has anywhere near that number of joins.

So then I was really confused. I had four questions about ParAccel, each of which are troubling.

(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker's question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?

And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on "How to create a new column-store DBMS product in a week". They're not a column-store. They're a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel's case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store's regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).

This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store's tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.

If my theory is correct (and remember, I have no inside knowledge), then ParAccel has ignored everything I have done in my research on column-stores the past 6 years. My whole PhD dissertation is about the order of magnitude performance improvement you get by building a query executer specifically for column-stores (instead of using a row-store query executer), which I talked about at SIGMOD last week. I can't help but to take it a little personally that they have not read my papers on this subject, like my column-stores vs row-stores paper from last year. ParAccel would be antithetical to everything I stand for.

Given how neatly my theory explains my four questions, I am pessimistic that someone will be able to recover my perception of ParAccel's product. But I would be happy if someone from ParAccel could confirm or deny what I have said, and if there are alternative explanations to my four questions, I'd love to hear them in an open forum such as comments on this blog. Please though, let's try to keep the conversation polite and civil.

DBMS Musings