Thursday, September 3, 2009

A tour through hybrid column/row-oriented DBMS schemes

There has been a lot of talk recently about hybrid column-store/row-store database systems. This is likely due to many announcements along these lines in the past month, such as Vertica’s recent 3.5 release which contained FlexStore, Oracle’s recent revelation that Oracle Database 11g Release 2 uses column-oriented storage for the purposes of superior compression, and VectoreWise’s recent decloaking that also announced an optional hybrid storage layer. Furthermore, analysts like Curt Monash and Philip Howard are predicting further development in this space (see here and here).

It’s surprising to me that it has taken this long before we started seeing commercially available implementations of hybrid systems. The research community has been publishing papers on hybrid systems for decades, with straightforward proposals that could easily be implemented in commercial systems starting to be published 8 years ago.

Different approaches to building hybrid systems yield very different performance properties and solve different sets of problems. Thus, as more hybrid systems become commercially available, and as more companies consider developing their own hybrid system, it is important that people understand the different tradeoffs involved between the various hybrid schemes. My goal in this post is to educate people who are not part of the SIGMOD/VLDB research community about three well-known approaches to building hybrid systems taken by different research efforts, and give pointers to research papers where the reader can find out more detail. Each approach has its own set of advantages and disadvantages and I will try to list both sides of the tradeoff in each case. The goal of this post is not to say that one type of hybrid scheme is better than another --- each scheme can be a good fit in the right situation.

Approach 1: PAX

The PAX scheme was published in 2001 in the VLDB paper “Weaving Relations for Cache Performance” by Natassa Ailamaki, David DeWitt, Mark Hill, and Marios Skounakis. The basic idea is the following: instead of storing data row-by-row within a disk block, store it column-by column. This is different than a “pure” column store, which stores each column in entirely separate disk blocks. The key difference is that if you had a table with 10 attributes, then in a “pure” column store, data from each original tuple is spread across 10 different disk blocks, whereas in PAX, all data for each tuple can be found in a single disk block. Since a disk block is the minimum granularity with which data can be read off of disk, in PAX, even if a query only accesses only 1 out of the 10 columns, it is impossible to read only this single column off of disk, since each disk block contains data for all 10 attributes of the table.

To understand why this is a good idea, some context is necessary. At the time the paper was written, column-stores (called the “DSM model” in the paper) had made very limited impact on the commercial landscape (there was Sybase IQ, but it was very much a niche product at the time). It was widely believed that the reason why the DSM model had failed to take off was due to the high tuple reconstruction costs.

Let’s say a query accessed three out of ten columns from a table and required some operator (like an aggregation) that required each of these three columns to be scanned fully. A row-store would have to do a complete table scan, wasting some I/O reading the 7 irrelevant columns in addition to the 3 relevant ones. But at least it would read the whole table sequentially. The column-store would only have to read the 3 relevant columns, but would have to seek back and forth between the 3 columns, doing tuple reconstruction. In 2001, servers had nowhere near the memory capacities they have today, so extensive prefetching was not an option (i.e., instead of reading one block from column 1 and then one block from column 2 and then one block from column 3 and then the next block from column 1, etc., prefetching allows you to read n blocks from column 1 and then n blocks from column 2, etc, allowing the seek cost to be amortized over a large amount of sequential reads, but you need enough memory to keep n blocks from each column in memory at once). Given how expensive seek costs are relative to sequential access, it is no accident column-stores didn’t take off until system memories increased to recent levels to allow for significant prefetching. (Research on late materialization to delay tuple reconstruction until the end of the query when fewer tuples need to be materialized also helped).

Anyway, PAX was able to achieve the CPU efficiency of column-stores while maintaining the disk I/O properties of row-stores. For those without detailed knowledge of column-stores, this might seem strange: the way most column-stores pitch their products is by accentuating the disk I/O efficiency advantage (you only need to read in from disk exactly what attributes are accessed by a particular query). Why would a column-store want equivalent disk access patterns as a row-store? Well, it turns out column-stores have an oft-overlooked significant CPU efficiency as well. The aspect of CPU efficiency that the PAX paper examined was cache hit ratio and memory bandwidth requirements. It turns out that having column data stored sequentially within a block allows cache lines to contain data from just one column. Since most DBMS operators only operate on one or two columns at a time, the cache is filled with relevant data for that operation, thereby reducing CPU inefficiency due to cache misses. Furthermore, only relevant columns for any particular operation need to shipped from memory.

Bottom line:

Advantages of PAX:
  • Yields the majority of CPU efficiency of column-stores
  • Allows for column-oriented compression schemes, which can improve compression ratio due do increased data locality (data from the same attribute domain is stored contiguously). This can improve performance since the more data can be compressed, the less time needs to be spent reading it in from storage.
  • Is theoretically fairly easy to implement in a row-store system to get some of the column-store advantages. I say “theoretically” because no commercial row-store system actually did this (to the best of my knowledge) until Oracle 11gR2.

Disadvantages of PAX
  • Equivalent I/O properties as row-stores (not counting compression) in the sense that irrelevant columns still have to read from storage in the same blocks as the needed columns for any particular query. In 2001 this was an advantage. Today, for typical analytical workloads, this is a significant disadvantage. (For less scan-oriented workloads, such as tuple lookups and needle-in-the-haystack queries, this remains an advantage).
  • The cache prefetching features on modern processors renders some of the cache efficiency of PAX obsolete (PAX no longer makes a large difference on cache hit ratio). However, it still reduces the demands on memory bandwidth, and other CPU advantages of column-stores, such as vectorized processing, remain possible to achieve in PAX.

Approach 2: Fractured Mirrors

This scheme was published in 2002 in the VLDB paper “A Case for Fractured Mirrors” by Ravi Ramamurthy, David DeWitt, and Qi Su. (Yes, you read that right. The University of Wisconsin DBMS group lead by DeWitt authored both seminal papers on hybrid column-store/row-store systems.) The approach is essentially the following: you’re going to replicate all of your data anyway for high availability and/or disaster recovery. Why not have different storage layouts in each replica? That way, if you have a tuple-lookup or a needle-in-a-haystack query, you send it to the row-store replica. If the query is more scan-oriented (e.g. an aggregation or summarization query), you send it to the column-store replica. The implementation in the paper is a little more complicated (to avoid skew, each node contains part of the column-store and part of the row-store), but the above description is the basic idea.

Most people agree (I think) that row-stores are more than an order of magnitude faster than column-stores for tuple-lookup queries, and column-stores are (at least) more than order of magnitude faster than row-stores for scan-oriented queries. Here is how one comes to this conclusion: To lookup a tuple in a row-store, one needs to read in just the one block that contains the tuple (let’s assume all relevant indexes for both the row-store and the column-store are in memory). In a column-store, one block needs to be read for each attribute. Assuming there are more than 10 attributes in the table, this is already more than an order of magnitude. On the other hand, for scan queries, if a query accesses less than 10% of the attributes of a table (the common case), column-stores get one order of magnitude improvement relative to row-stores immediately (for disk efficiency). Additionally, many have argued (see here and here) that column-stores get an additional order of magnitude improvement for CPU efficiency.

If you buy the above argument, then it is critical to use a scheme like fractured mirrors for mixed workloads. Given how often people talk about mixed workloads as being a key problem in enterprise data warehouses, it is surprising how long it has taken to see a commercial implementation along the lines written about in the research paper.

Advantages of Fractured Mirrors:
  • Every individual query gets sent to the optimal storage for that query. Performance is thus an order of magnitude better than either a pure row-store or a pure column-store on queries that are problematic for that type of store.

Disadvantages of Fractured Mirrors:
  • All data needs to be replicated. Obviously, in most cases you’re going to replicate the data anyway. But if you are already using the replication for something else (e.g. storing the data in a different sort order), then you either need to increase the replication factor or remove some of the additional sort orders in order to implement fractured mirrors.
  • If you really want to get the most out of this approach, you need to extend what is proposed in the research paper and have complete implementations of row-store and column-store systems (since column-stores have very different query execution engines and query optimizers than row-stores). This is obviously a lot of code, and precludes most companies from using the fractured mirrored approach. I am flummoxed as to why the only company with legitimate row-store and column-store DBMS products (Sybase with ASE and IQ) hasn’t implemented the fractured mirrors approach yet.

Approach 3: Fine-grained hybrids

The VLDB 2003 paper by Hankins and Patel and the CIDR 2009 paper by Cudre-Mauroux, Wu, and Madden are examples of this approach. Here, individual tables can be divided into both row and column-oriented storage. For example, if two columns are often accessed together (e.g. order-date and ship-date), they can be stored together (in rows). The remaining columns can be stored separately. This can be done within a disk block, within a table, or even at a slightly larger grain across tables. For example, if one table is often accessed via tuple lookups (e.g. a customer table), then it can be stored in rows; while other tables that are usually scanned (e.g. lineitem) can be stored in columns.

Advantages of fine-grained hybrids
  • If correct decisions are made about what attributes (and/or tables) should be stored in rows, and what attributes should be stored in columns, then you can get all of the performance advantages of fractured mirrors without the additional replication disadvantage.

Disadvantages of fine-grained hybrids
  • Depending on the DBMS system you start with, it can be complex to implement. You essentially have to have both row- and column-oriented execution engines, the optimizer has to know about the differences between row-storage and column-storage, and indexing code has to be updated appropriately. It turns out that it is much easier to implement in a column-store that already supports early materialization (early materialization refers to the ability to reconstruct rows from column-storage early in a query plan) than in other types of systems. This is because early-materialization requires query operators to be able to handle input in both column and row-oriented format (it will be in column format if tuples haven’t been materialized yet, otherwise it will be in row-oriented format). Hence, the execution engine and optimizer already has knowledge about the difference between rows and columns and can act appropriately.
  • It requires some knowledge about a query workload. Otherwise incorrect decisions about tuple layout will be made, leading to suboptimal performance.


Commercial availability

Oracle, Vertica, and VectorWise have announced hybrid systems using one of these schemes (I have no inside knowledge about any of these implementations, and only know what’s been published publicly in all cases). It appears that Oracle (see Kevin Closson’s helpful responses to my questions in a comment thread to one of his blog posts) and VectorWise use the first approach, PAX. Vertica uses fine-grained hybrids (approach 3), though they probably could use their row-oriented storage scheme to implement fractured mirrors (approach 2) as well, if they so desire. Given that two out of the three authors of the fractured mirrors paper have been reunited at Microsoft, I would not be surprised if Microsoft were to eventually implement a fractured mirrors hybrid scheme.

Conclusion

There was nearly a decade of delay, but at long last we’re starting to see hybrid row/column-stores hit the marketplace. Row-stores and column-stores have very different performance characteristics, and they tend to struggle where the alternative excels. As workloads get more complex, hybrid systems will increase in importance.

11 comments:

  1. Interesting taxonomy! Good work. Thanks for sharing.

    My gut feeling is that Curt Monash is probably right to think that Oracle went with something like PAX. It is a fairly natural evolution for people dealing with row stores. (Reference: http://www.dbms2.com/2009/09/03/oracle-11g-exadata-hybrid-columnar-compression/)

    ReplyDelete
  2. A few more remarks:

    1) Adding a column-oriented index to an existing row store creates an hybrid of sort. Think about indexing a few columns out of a row store with a bitmap or project index.

    2) PAX can be viewed as a row/column hybrid, but it can also be *implemented* as something close to a column store with horizontal partitioning.

    ReplyDelete
  3. Oh yes, I definitely agree with Curt. He's had at least three good posts on hybrid row/column-store systems.

    I agree with your first remark (bit-map indexes are a sort of lightweight hybrid). I don't really agree with the second one since column-stores do not store values from the different columns in the same page (even if you horizontally partition them), while PAX does. But I understand what you are saying --- and from 10000 feet, PAX sort of looks like a horizontally partitioned column-store.

    ReplyDelete
  4. Hi Daniel,
    Good write up. One thing to note is that the row/column choice is not only for storage, but also for processing. In particular, as the referenced "Data Morphing" paper (VLDB03) discusses, row/column hybrids are useful not only on disk, but also for in-memory processing. This idea has been taken even further in our DAMON'08 paper, where we discuss how the data model can be dynamically changed _during_ the query execution to exploit various performance characteristics of DSM and NSM.

    P.S. Delete that post above if you can :)

    ReplyDelete
  5. Daniel,
    Excellent post. One more question, and possible follow on post for you, but would you consider a HBase or BigTable a row/column hybrid?
    Thanks
    WB

    ReplyDelete
  6. Hi Marcin,

    Totally agreed. I actually spent a while on the point that column-stores are more than just a storage layer innovation in the tutorial I did with your PhD adviser and Stavros at VLDB in France last week. Stavros also covered your DAMON08 paper very nicely. I chose not to list your paper in the hybrid schemes above since it's not really a hybrid scheme in the traditional sense and it might be confusing for people.

    Before I get similar comments from other people:
    allow me to list two other papers that I almost wrote about and decided that one has to warp the mind a little too much to see why they're a hybrid scheme:

    "Clotho: Decoupling Memory Page Layout from Storage Organization." Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.

    "A Multi-Resolution Block Storage Model for Database Design." Jingren Zhou, Kenneth A. Ross. IDEAS 2003.

    PS: I deleted the SPAM comment as you suggested
    PPS: Good luck on your PhD defense next week!

    ReplyDelete
  7. Hi Walter,

    I'm OK with people calling HBase/BigTable a row/column hybrid. If you were to do so, I would put them in category 3 (fine-grained hybrids) since different columns in the same table can be stored as rows or columns. Personally I do not consider their use of locality groups to be enough grounds to call them column-stores (since column-stores are more than just how data is laid out in storage --- it's also about making changes to the execution engine), but that's just because I'm a little snooty in what I'm willing to call a "true" column-store :)

    ReplyDelete
  8. Oops --- I should have mentioned: the Stavros I referred to above is Stavros Harizopoulos of HP Labs. The slides from our VLDB tutorial on column-stores can be found at: http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

    ReplyDelete
  9. impressive and great posts

    Thanks

    ReplyDelete
  10. Very interesting post,
    If we take into accout your interesting PhD thesis about the three approaches of building a column store, it seems to me that these three hybrid categories mentioned here considre the second approach (changing only the storage layer without changing the execution engine) in building their column store, since I cannot find (in these articles) any special implementaions/modifications for the execution engine.
    what do you think about that?
    Thanks again

    ReplyDelete