Thursday, January 7, 2010

Exadata's columnar compression

I recently came across a nice whitepaper from Oracle that describes the Exadata columnar compression scheme. I wrote up a brief overview of Oracle's columnar compression in the past (in my hybrid storage layout post), but I was pleased to see the whitepaper give some additional details. Before discussing the main content of the whitepaper, I want to correct a few technical inaccuracies in the article:

"Storing column data together, with the same data type and similar characteristics, drastically increases the storage savings achieved from compression. However, storing data in this manner can negatively impact database performance when application queries access more than one or two columns, perform even a modest number of updates, or insert small numbers of rows per transaction."

The first part of the above statement is usually true; columns tend to be self-similar since they contain data from the same attribute domain, and therefore laying out columns contiguously on storage typically yields lower entropy and higher compression ratios than storing rows contiguously. However, the second part of the statement is quite misleading and only true for the most na├»ve of column-oriented layout schemes. True, the benefit of storing data in columns decreases as a query accesses a higher percentage of columns from a table. However, the number of columns that need to be accessed such that storing data in columns impacts performance negatively relative to a storing data in rows is quite a bit larger than one or two columns. There’s a VLDB 2006 paper by Stavros Harizopoulos et. al. that runs a bunch of benchmarks to understand this tradeoff in more detail. There are a bunch of factors that affect whether column-storage or row-storage is best beyond simply the number of columns accessed (e.g. prefetch size, query selectivity, compression types), most notably the fact that column-stores struggle on needle-in-a-haystack queries since the multiple I/Os needed to get data from different columns are not amortized across the retrieval of multiple rows, but the bottom line is that the query space where pure column-oriented layout outperforms pure row-oriented layout is quite a bit larger than what the Oracle whitepaper claims. Regarding updates and inserts, the VLDB 2005 C-Store paper and OSDI 2006 Google Bigtable paper discuss how to alleviate this column-store disadvantage by temporarily storing the updates in a write-optimized store.

"Oracle’s Exadata Hybrid Columnar Compression technology is a new method for organizing data within a database block."

I feel that this is not giving enough credit to the VLDB 2001 paper by Ailamaki et. al. on PAX. Even though the PAX paper didn’t apply compression to this layout, and PAX didn’t include multiple blocks inside a single compression unit, the basic idea is the same.

One of the key benefits of the Exadata Hybrid Columnar approach is that it provides both the compression and performance benefits of columnar storage without sacrificing the robust feature set of the Oracle Database. For example, while optimized for scan-level access, because row data is self-contained within compression units, Oracle is still able to provide efficient row-level access, with entire rows typically being retrieved with a single I/O.

To people who have even a rudimentary knowledge of column-stores, the juxtaposed statements “entire rows being retrieved with a single I/O” and “provides the performance benefits of columnar storage” is clearly inaccurate. The most often cited advantage of column-stores (indeed cited even more often than the compression advantages) is that if you have a query that has to scan two out of twenty columns in a table (i.e. “tell me the sum of revenue per store”), a column-store only has to read in those two columns while a row-store has to scan the entire table (because multiple entire rows are picked up with each I/O). Hence, a column-store is very I/O efficient, an increasingly important bottleneck in database performance (albeit somewhat reduced with good compression algorithms). Again, Oracle’s statement is true for needle-in-a-haystack queries, but false for the more frequent scan-based queries one finds in data warehouse workloads (which is where Exadata is primarily marketed).

Despite the above overstatements, it is clear that Exadata’s Hybrid Columnar scheme is a big win (relative to vanilla Exadata) for most Exadata datasets and workloads, for the following reasons:

  1. Storage costs are lowered by having a smaller data footprint.
  2. I/O performance is improved by having to read less data from storage.
  3. Data is kept compressed as it is scanned in the storage layer. Selections and projections performed in the storage layer can be performed directly on the compressed data. (Those of you who are familiar with my academic work will not be surprised I like this, given that I wrote a paper on integrating compression and execution for column-oriented storage layouts in a SIGMOD 2006 paper). This yields all the storage costs savings and runtime I/O savings of compression without the performance disadvantage of runtime decompression.

    The whitepaper is vague when it comes to the question of whether
    data remains in compressed form after it is scanned in the Exadata Storage Servers as it is shipped over the Infiniband network to Oracle RAC for further processing. It appears from the way I read the paper (I could be wrong about this) that data is decompressed before it is shipped over the network, which means that Oracle is not gaining all of the potential performance from operating directly on compressed data, as data does have to be decompressed eventually, though the amount of data that needs to be decompressed is smaller, since selections and projections have occurred. If Oracle RAC could operate directly on the columnar compressed format, then decompression would never have to occur, which would further increase performance (query results would obviously have to be decompressed, but usually the magnitude of query results is much smaller than the magnitude of data processed to produce them). Not only is Oracle missing out on potential performance by decompressing data before sending it over the network, but also network bottlenecks could be introduced.
  4. If storage costs are the most important consideration (e.g. for historical data that would otherwise be archived to tape or other offline device), there’s a knob that allows you to further increase compression at the cost of having to decompress data at runtime (and this decompression is heavyweight), thereby reducing runtime performance. Furthermore, the same table can be compressed using the lightweight (high-performance) compression and heavyweight (low-performance) compression using Oracle’s partitioning feature (e.g. you could partition by date and have the historical partitions use the heavyweight compression scheme and the active partitions use the lightweight compression scheme). Note that lightweight still yields good compression --- Oracle claims up to 10X, but this is obviously going to be very dependent on your data.

The most interesting part of the whitepaper was the figure of a compression unit on page four. It is clear that one complication that arises from compressing each column separately is that for a fixed number of rows (e.g. tuples numbered 1-1000), the column sizes are going to vary wildly (depending on how compressible the data from each column is). Figuring out how to fit all columns from a set of rows into a fixed-size block (or set of blocks inside a compression unit) without leaving large chunks of the block empty is a nontrivial optimization problem, and something that Oracle probably had to think about in detail in their implementation.

Anyway, if you’re interested in Oracle’s approach to hybrid columnar storage, the whitepaper is worth a read. With rumors of additional developments in Oracle’s hybrid columnar storage starting to surface, it is likely that this will not be my last blog post on this subject .


  1. Can you explain why the Vertica white paper states "a good rule of thumb is that fewer than 1% of the total SQL statements should be DELETEs or UPDATEs".

    Does the C-Store family solve the IO bottleneck for inserts and updates? Or only for inserts?

    After re-reading the C-Store overview and Vertica white paper my vague understanding is:

    1)since updates are done as delete/insert they may require twice the work when merging into the ROS (once to remove, once to add)

    2) update and delete require random reads from ROS to determine that the rows exist, insert does not

    What is done for primary key and unique constraints? In that case random reads must also be done during insert.

    Can we expect a paper on optimizing for IO during the merge process? The LSM paper has many details on that.

  2. Hi Mark,

    I haven't read the Vertica whitepaper (and wouldn't want to speak for them anyway), but for the C-Store work it was definitely the case that if you had too many updates (or even inserts) then the write-optimized buffer would get too large and slow-down read queries (because read-queries have to go to both the read-optimized buffer and the write-optimized buffer unless the user is willing to receive results from a historical snapshot). So if you want to have optimal performance on read-only queries, you don't want to intermingle too many write queries. The best way to figure out the write load before things go bad is not so much the percentage of queries that are update queries (as in Vertica's "rule of thumb"), but rather the rate with which data is being added to the write buffer.

    The size of the write buffer is really the only thing that can cause major problems in performance. The stuff you mention --- while accurate --- will cause a factor of 2 slower write performance in the worst case, which one gladly pays for the factor of 10+ in read performance. Hence, usually inserts and updates are not distinguished when talking about write queries since they cause the write buffer to fill up equally fast (even though, as you say, updates are more costly than inserts for other reasons).

    The main point here is that there is an enormous difference of write performance between column-stores with write-buffers and column-stores without write-buffers, since without write-buffers column-stores have to turn individual tuple inserts into n random writes where n is the number of columns in the tuple. This causes the order of magnitude slowdown that negates much of the benefit of column-stores. Since most column-stores have write-buffers, I believe that people should assume the existence of the write-buffer when discussing column-store write performance.

  3. very interesting article!

    few thoughts ...

    1. I think as well that Oracle decompresses the data on OS level before it sends it to oracle database. First the paper somewhat says it, but mostly the fact that this feature is available only on exadata tells me that this is probably on the storage/os level.

    2. I believe that Teradata adds interesting twist to how they compress data. In general their compression is not as broad and good as oracle, but I like their dictionary approach, it can be very handy in some specific scenarios.

    3. Oracle whitepaper is (and I'm sorry to say that) a joke. It's more a sales pitch than serious whitepaper about compression. There are not that many different exadata hardware appliances configurations, it should be very easy to run and publish solid accurate stats (not only I/O and space, but also CPU).

    4. I would like to see (in the oracle whitepaper) where this compression does not work or better where it cannot be implemented. I'm sure there are a lot of exceptions.

  4. Daniel, where would one find direct comparisons of Exadata vs Vertica vs Greenplum?
    or do vendors block the publications of such benchmarks by their clients?


  5. now that you are a CTO of Hadapt, maybe you will show how Hadapt outperforms say Greenplum during Hadoop meetup?

  6. I had the doubtful pleasure of working with Greenplum product. This is one of the worst enterprise products i have ever worked with. The product is immature and doesn't come near the capabilities of Oracle Exadata . Greenplum is where oracle was 15 years ago. EMC will try to sell you an illusion that is it the same and better but they are selling a bad product.
    The product have a lot of limitations (for example - no ability to connect to external database, no replication capabilities, dependency on the physical infrastructure when performing backup and restore, the development tools are not good and not comfortable and if you read the documentation you will find about 5 pages of limitations two of them are that you can use column-oriented storage option and compression only on append only table – that is right you read well – you can use the strongest features of GP only on tables that you don't perform update or delete. )
    Since the product is not very popular (at least yet) there will be times that you will experience issues like bugs and behavior not known to anyone else including GreenPlum support and development teams.
    The knowledgebase doesn't exist on the web and you will be fully dependent on EMC professional team.

    If you don't believe me – read Greenplum documentation and at least don't pay before you check the product from the inside out.

    Some of EMC products are good but this is not one of them.

  7. If I read the Oracle White Paper correctly, it states clearly on page 5

    ".. Note that data remains compress not only on disk, but also remains compressed in teh Exatat Smart Flash Cache, on Infiniband, in the database server buffer cache, as well as doing backups or log shipping to Data Guard."