At SIGMOD last week, I was chatting with Mike Stonebraker (chatting might be the wrong verb; it was more like downloading knowledge and advice). ParAccel had a paper at SIGMOD, and the week before they had announced that they topped the TPC-H benchmark at the 30TB scale. Naturally, this resulted in ParAccel coming up in our conversation. Mike asked me a question about the ParAccel TPC-H results that floored me --- I was really disappointed I didn't think of this question myself.
Before telling you the question, let me first give you some background. My knowledge about ParAccel's TPC-H results came from reading two blog posts about it. First, there was Merv Adrian's positive post on the subject, and then there was Curt Monash's negative post. Monash's negativity stemmed largely from the seemingly unrealistic configuration of having nearly a petabyte of disk to store 30TB of data (a 32:1 disk/data ratio). This negativity was responded to in a comment by Richard Gostanian who has firsthand knowledge since he helped ParAccel get these benchmark results. The relevant excerpt of his comment is below:
"Why did ParAccel require 961 TB of disk space? The answer is they didn’t. They really only needed about 20TB (10TB compressed times 2 for mirroring). But the servers they used, which allowed for the superb price-performance they achieved, came standard with 961 TB of storage; there simply was no way to configure less storage. These Sun servers, SunFire X4540s, are like no other servers on the market. They combine (1) reasonable processing power (two Quad Core 2.3 GHz AMD Opteron processors) (2) large memory (64 GB), (3) very high storage capacity (24 or 48 TB based on 48 x 500GB or 1TB SATA disks) and 4) exceptional I/O bandwidth (2 - 3 GB/sec depending upon whether you’re working near the outer or inner cylinders) all in a small (4 RU), low cost (~$40K), power efficient (less than 1000 watts) package.
What any DBMS needs to run a disk-based TPC-H benchmark is high I/O throughput. The only economical way to achieve high disk throughput today is with a large number of spindles. But the spindles that ship with today’s storage are much larger than what they were, say, 5 years ago. So any benchmark, or application, requiring high disk throughput is going to waste a lot of capacity. This will change over the next few years as solid-state disks become larger, cheaper and more widely used. But for now, wasted storage is the reality, and should not be viewed in a negative light."
I remember thinking to myself "sounds reasonable" and agreeing with both sides. Yes, there is wasted space, but you need the large number of spindles to get the high I/O throughput and most TPC-H configurations resemble this one.
Then along came Mike Stonebraker, who asked me: "Why does ParAccel need such high I/O throughput? They're a column-store!"
This question immediately triggered my memory of all the experiments I ran on C-Store. I generally ran C-Store on a system with 2 CPUs and 4 disks getting an aggregate I/O bandwidth of 180MB/s (i.e. 90MB/s per CPU) and I rarely saw a disk-bottlenecked query. Why? (1) Because column-stores are already super I/O efficient by only reading the relevant columns for each query and (2) Because they compress data really well (I usually saw at least 5:1) so the effective I/O bandwidth was 90MB/s * 5 = 450MB/s per CPU which, for any reasonably interesting query, is faster than the CPU can generally keep up with.
Meanwhile, the ParAccel TPC-H configuration consisted of nodes with orders of magnitude more disk I/O (3GB/s from the 48 disks) yet just 4 times the CPU processing power. Doing the math: 3GB/s divided by 8 CPU cores = 375MB/s per CPU core. Gostanian said that there was a 3:1 compression ratio, so we're talking about an astounding effective 1GB/s per CPU core.
There's simply no way TPC-H queries are simple enough (except for maybe query 1) for the CPUs to process them at 1GB/s (see my related post on this subject). So it must be the case that part of that 1GB/s is being wasted reading data that will ultimately not be processed (e.g. like unused columns in a row-store). But ParAccel is a column-store! Hence the Mike Stonebraker conundrum.
At this point I was already ready to bet the house that ParAccel does not have the I/O efficiency of a standard column-store. But one who is unconvinced could still argue that disks are cheap and ParAccel was willing to pay the small amount extra for the high I/O bandwidth (which would help the simplest queries that are not CPU intensive) even though most queries would be CPU limited and the extra bandwidth would not help them.
To test this theory I wasted some time slogging through the 82-page ParAccel full disclosure report on the 30TB TPC-H results (seriously!). Query 1 is arguably the easiest query in TPC-H (it scans the single lineitem table, accesses 7 of its attributes, applies a single predicate and several very basic aggregations). There are no joins in this query and the aggregations are really easy since there are only 4 unique groups, so the aggregations can be done during the scan via a hash aggregation. Since it is so basic, this query should be disk limited on most systems. A standard column-store should be able to run this query in the time it takes to read the 7 columns off disk. There are 180 billion rows in the lineitem table, and each attribute should take up about 4 bytes (this is conservative --- with compression this number should be much less). So a total of 180 billion x 7 x 4 = approximately 5TB needs to be read off disk for this query (in reality this number will likely be smaller due to compression and memory caching). So given that there were 43 servers, each with 3GB/s I/O bandwidth, a conservative estimate of how fast this query should be run using a standard column-store is (5TB / (43*3GB/s)) or approximately 40 seconds. But this query takes ParAccel on average 275 seconds according to the report. This is further evidence that ParAccel does not have the I/O efficiency of a standard column-store.
Hence, ever since I looked into this, I've been just totally flummoxed by these puzzling numbers.
Finally, today I finally got around to reading their SIGMOD paper describing the internals of their system. This paper was sadly sparse on details (it was only 4 pages when the SIGMOD limit was 14 and these 4 pages were not that meaty). The main thing I learned from the paper is that ParAccel has heavy PostgreSQL roots. It seems like they started with PostgreSQL and did their best to transform it into a MPP column-store. The one area where there was some amount of meat was their new optimizer's ability to handle plans with "thousands of joins". This of course raised more questions for me. Why would a start-up (with so many things that need to be implemented) be worried about queries with that many joins? Maybe there are a handful of customers that need this many, but the vast majority need far less. But then I got to Figure 2, when they ran queries from TPC-H and TPC-DS and the x-axis had queries with up to 42 joins. But unless I'm sorely mistaken, no TPC-H or TPC-DS query has anywhere near that number of joins.
So then I was really confused. I had four questions about ParAccel, each of which are troubling.
(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker's question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?
And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on "How to create a new column-store DBMS product in a week". They're not a column-store. They're a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel's case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store's regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).
This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store's tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.
If my theory is correct (and remember, I have no inside knowledge), then ParAccel has ignored everything I have done in my research on column-stores the past 6 years. My whole PhD dissertation is about the order of magnitude performance improvement you get by building a query executer specifically for column-stores (instead of using a row-store query executer), which I talked about at SIGMOD last week. I can't help but to take it a little personally that they have not read my papers on this subject, like my column-stores vs row-stores paper from last year. ParAccel would be antithetical to everything I stand for.
Given how neatly my theory explains my four questions, I am pessimistic that someone will be able to recover my perception of ParAccel's product. But I would be happy if someone from ParAccel could confirm or deny what I have said, and if there are alternative explanations to my four questions, I'd love to hear them in an open forum such as comments on this blog. Please though, let's try to keep the conversation polite and civil.