At SIGMOD last week, I was chatting with Mike Stonebraker (chatting might be the wrong verb; it was more like downloading knowledge and advice). ParAccel had a paper at SIGMOD, and the week before they had announced that they topped the TPC-H benchmark at the 30TB scale. Naturally, this resulted in ParAccel coming up in our conversation. Mike asked me a question about the ParAccel TPC-H results that floored me --- I was really disappointed I didn't think of this question myself.
Before telling you the question, let me first give you some background. My knowledge about ParAccel's TPC-H results came from reading two blog posts about it. First, there was Merv Adrian's positive post on the subject, and then there was Curt Monash's negative post. Monash's negativity stemmed largely from the seemingly unrealistic configuration of having nearly a petabyte of disk to store 30TB of data (a 32:1 disk/data ratio). This negativity was responded to in a comment by Richard Gostanian who has firsthand knowledge since he helped ParAccel get these benchmark results. The relevant excerpt of his comment is below:
"Why did ParAccel require 961 TB of disk space? The answer is they didn’t. They really only needed about 20TB (10TB compressed times 2 for mirroring). But the servers they used, which allowed for the superb price-performance they achieved, came standard with 961 TB of storage; there simply was no way to configure less storage. These Sun servers, SunFire X4540s, are like no other servers on the market. They combine (1) reasonable processing power (two Quad Core 2.3 GHz AMD Opteron processors) (2) large memory (64 GB), (3) very high storage capacity (24 or 48 TB based on 48 x 500GB or 1TB SATA disks) and 4) exceptional I/O bandwidth (2 - 3 GB/sec depending upon whether you’re working near the outer or inner cylinders) all in a small (4 RU), low cost (~$40K), power efficient (less than 1000 watts) package.
What any DBMS needs to run a disk-based TPC-H benchmark is high I/O throughput. The only economical way to achieve high disk throughput today is with a large number of spindles. But the spindles that ship with today’s storage are much larger than what they were, say, 5 years ago. So any benchmark, or application, requiring high disk throughput is going to waste a lot of capacity. This will change over the next few years as solid-state disks become larger, cheaper and more widely used. But for now, wasted storage is the reality, and should not be viewed in a negative light."
I remember thinking to myself "sounds reasonable" and agreeing with both sides. Yes, there is wasted space, but you need the large number of spindles to get the high I/O throughput and most TPC-H configurations resemble this one.
Then along came Mike Stonebraker, who asked me: "Why does ParAccel need such high I/O throughput? They're a column-store!"
This question immediately triggered my memory of all the experiments I ran on C-Store. I generally ran C-Store on a system with 2 CPUs and 4 disks getting an aggregate I/O bandwidth of 180MB/s (i.e. 90MB/s per CPU) and I rarely saw a disk-bottlenecked query. Why? (1) Because column-stores are already super I/O efficient by only reading the relevant columns for each query and (2) Because they compress data really well (I usually saw at least 5:1) so the effective I/O bandwidth was 90MB/s * 5 = 450MB/s per CPU which, for any reasonably interesting query, is faster than the CPU can generally keep up with.
Meanwhile, the ParAccel TPC-H configuration consisted of nodes with orders of magnitude more disk I/O (3GB/s from the 48 disks) yet just 4 times the CPU processing power. Doing the math: 3GB/s divided by 8 CPU cores = 375MB/s per CPU core. Gostanian said that there was a 3:1 compression ratio, so we're talking about an astounding effective 1GB/s per CPU core.
There's simply no way TPC-H queries are simple enough (except for maybe query 1) for the CPUs to process them at 1GB/s (see my related post on this subject). So it must be the case that part of that 1GB/s is being wasted reading data that will ultimately not be processed (e.g. like unused columns in a row-store). But ParAccel is a column-store! Hence the Mike Stonebraker conundrum.
At this point I was already ready to bet the house that ParAccel does not have the I/O efficiency of a standard column-store. But one who is unconvinced could still argue that disks are cheap and ParAccel was willing to pay the small amount extra for the high I/O bandwidth (which would help the simplest queries that are not CPU intensive) even though most queries would be CPU limited and the extra bandwidth would not help them.
To test this theory I wasted some time slogging through the 82-page ParAccel full disclosure report on the 30TB TPC-H results (seriously!). Query 1 is arguably the easiest query in TPC-H (it scans the single lineitem table, accesses 7 of its attributes, applies a single predicate and several very basic aggregations). There are no joins in this query and the aggregations are really easy since there are only 4 unique groups, so the aggregations can be done during the scan via a hash aggregation. Since it is so basic, this query should be disk limited on most systems. A standard column-store should be able to run this query in the time it takes to read the 7 columns off disk. There are 180 billion rows in the lineitem table, and each attribute should take up about 4 bytes (this is conservative --- with compression this number should be much less). So a total of 180 billion x 7 x 4 = approximately 5TB needs to be read off disk for this query (in reality this number will likely be smaller due to compression and memory caching). So given that there were 43 servers, each with 3GB/s I/O bandwidth, a conservative estimate of how fast this query should be run using a standard column-store is (5TB / (43*3GB/s)) or approximately 40 seconds. But this query takes ParAccel on average 275 seconds according to the report. This is further evidence that ParAccel does not have the I/O efficiency of a standard column-store.
Hence, ever since I looked into this, I've been just totally flummoxed by these puzzling numbers.
Finally, today I finally got around to reading their SIGMOD paper describing the internals of their system. This paper was sadly sparse on details (it was only 4 pages when the SIGMOD limit was 14 and these 4 pages were not that meaty). The main thing I learned from the paper is that ParAccel has heavy PostgreSQL roots. It seems like they started with PostgreSQL and did their best to transform it into a MPP column-store. The one area where there was some amount of meat was their new optimizer's ability to handle plans with "thousands of joins". This of course raised more questions for me. Why would a start-up (with so many things that need to be implemented) be worried about queries with that many joins? Maybe there are a handful of customers that need this many, but the vast majority need far less. But then I got to Figure 2, when they ran queries from TPC-H and TPC-DS and the x-axis had queries with up to 42 joins. But unless I'm sorely mistaken, no TPC-H or TPC-DS query has anywhere near that number of joins.
So then I was really confused. I had four questions about ParAccel, each of which are troubling.
(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker's question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?
And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on "How to create a new column-store DBMS product in a week". They're not a column-store. They're a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel's case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store's regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).
This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store's tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.
If my theory is correct (and remember, I have no inside knowledge), then ParAccel has ignored everything I have done in my research on column-stores the past 6 years. My whole PhD dissertation is about the order of magnitude performance improvement you get by building a query executer specifically for column-stores (instead of using a row-store query executer), which I talked about at SIGMOD last week. I can't help but to take it a little personally that they have not read my papers on this subject, like my column-stores vs row-stores paper from last year. ParAccel would be antithetical to everything I stand for.
Given how neatly my theory explains my four questions, I am pessimistic that someone will be able to recover my perception of ParAccel's product. But I would be happy if someone from ParAccel could confirm or deny what I have said, and if there are alternative explanations to my four questions, I'd love to hear them in an open forum such as comments on this blog. Please though, let's try to keep the conversation polite and civil.
Nice work. And just when I thought the dust had settled (http://tr.im/rcI7)ReplyDelete
I wonder if their VCs are kicking themselves because they didn't figure this out?
Brutal. Let's at least give them an opportunity to address this!ReplyDelete
Incidentally, the probability of a VC getting past the 1st sentence in Gostanian's comment is about zero point zero. So I wouldn't really worry about that one ;)ReplyDelete
As you point out, you should see pretty much an automatic 10x if using a column store with a purpose-built optimizer.ReplyDelete
Oracle ran the 30TB 2 years ago, with much less compute power. ParAccel achieved just 7x over that result. This hints at a problem in the technology right off the bat, as you should expect 10x+ with a decent column store and optimizer, before you factor in gains in CPU, memory and interconnect performance over two years.
At least it looks that way to me.
I enjoyed your post but I am afraid you misread ParAccel's TPC-H results. Your quoted number of 275 seconds for TPC-H Query 1 is from the "throughput" test that has 10 concurrent streams of queries. What you should be looking at is the "power" test, which is their "Stream ID 00" numbers. You can see that Query 1 takes 51.4 secs which is very close to your calculations. Therefore, it seems to me, they are indeed a column-store (and hopefully this should answer your question #2).
About Stonebraker's question (your question #1): When it comes to joining two (or more) tables where the projected columns do not fit in your RAM workspace, chances are that you will be using a 2-pass algorithm, and therefore you will be needing high disk I/O throughput. As you know, for these types of ad-hoc joins, column-stores can only achieve modest improvements over row-stores (after you factor out any compression-related benefits -- we showed this in our Sigmod 2009 paper on SSDs and column page layouts http://bit.ly/SSD_sigmod09). Whether column-stores really need high disk I/O in real-life workloads, is, of course, a different question.
Your question #4: From the paper, it is clear that they know that TPC-DS does not have 42 tables (it has 24). They explicitly state that they added tables to show the benefits of their algorithm. If you look at they results, they only perform better for a large number of tables, hence the modification of the original specs. I think this is fair in a paper. Is 42 tables realistic? Again, a different question. (And I think they only have 4 pages because it is an industrial paper, which required that they only submit an abstract for reviewing -- it is hard to come up with 14 pages between notification and camera-ready deadline, which is less than one month apart).
Finally, about your question #3, why they worry about 1000s of joins? Beats me.
Thanks for the comment. It's really late on the east coast and I'm about to get into bed, so I'll respond tomorrow; but from my recollection, they only added tables to the numbers in figure 1 in their paper (not figure 2). So I'm not sure I'm willing to accept your answer to question 4. Your answers to questions 1 and 2 might well be correct. I need to look into more detail at this when I wake up.
Wouldn't it be more realistic to look at the power run number for query 1 versus the average times of both the power run and the throughput run executions? Given there are other queries running in a throughput run, it is seems harder to isolate the exact resources used, correct? The elapsed time for query 1 in the power run was 51.4s, much more in line with your rough estimate of 40s.
If we run the calculations applying the 3:1 compression ratio (from Richard's comment) to the 5TB we get 1.666TB and given the 51.4s elapsed time with a the physical rate around 33GB/s. This results in just under 100MB/s (physical) per CPU core which seems reasonable close to the rate you observed of 90MB/s (physical) per CPU.
My suspension is that query 1 is more CPU sensitive than I/O sensitive due to the aggregations and the group (meaning I think it gets CPU bound quite fast compared to the I/O rates). I would guess that the physical I/O scan capacity of this cluster is more than enough to keep the CPUs busy, especially for query 1.
After further consideration, I think Stavros and Greg’s points are good ones (that the power runs in TPC-H might be better to look at for the context of this discussion). This brings down the ParAccel number on query 1 from 275 seconds to 51.4 seconds. However, let’s dig a little deeper into the details of this query. It only queries one of the TPC-H tables, the lineitem table. Here’s the ParAccel definition of this table according to the full disclosure report:ReplyDelete
create table lineitem (
l_orderkey int8 not null encode delta32k distkey
l_partkey int8 not null,
l_suppkey int4 not null,
l_linenumber int4 not null encode always8,
l_quantity numeric(12,2) not null encode bytedict,
l_extendedprice numeric(12,2) not null encode
l_discount numeric(12,2) not null encode always8,
l_tax numeric(12,2) not null encode always8,
l_returnflag char(1) not null,
l_linestatus char(1) not null,
l_shipdate date not null encode delta,
l_commitdate date not null encode delta,
l_receiptdate date not null encode delta,
l_shipinstruct char(25) not null encode bytedict,
l_shipmode char(10) not null encode bytedict,
l_comment varchar(44) not null encode text255,
FOREIGN KEY (l_orderkey) REFERENCES orders
FOREIGN KEY (l_partkey,l_suppkey) REFERENCES
And here are the seven columns that are accessed by query 1:
As you can see, these are seven of the more narrow columns of the schema. l_returnflag and l_linestatus are super narrow, having less than 4 unique values (you can see this in the output of the query, since the query groups by l_returnflag and l_linestatus and there are only 4 rows output). So if the entire TPC-H data is 10TB according to ParAccel (which includes ALL TPC-H tables), then my estimate that these seven columns take up 5TB is very, very, very conservative. 1TB is probably a more reasonable number, so we’re still seeing about a factor of 4 wasted I/O. And this is assuming ParAccel is not using some sort of Zone Map (like Netezza) or Knowledge Grid (like Infobright) to further reduce blocks that have to be read in thanks to the predicate on shipdate. Finally, I respectfully disagree with Greg that this query should be CPU limited given that the aggregation can be done in a single pass and there are only 4 unique groups. Please see the work by Boncz et all (http://old-www.cwi.nl/themes/ins1/publications/docs/BoZuNe:CIDR:05.pdf) which looks at TPC-H query 1 in particular and show that you only need about 55 cycles per tuple when processing this query using a column-store.
The bottom line is that questions 3 and 4 remain unanswered, there still seems to be some wasted I/O for question 2 (though maybe not as much as I previously thought) and Stavros’ answer to Stonebraker’s question doesn’t take into account that in TPC-H most joins are FK-PK joins and the partitioned dimension tables usually fit in memory (never mind the fact that as Gostanian pointed out in his first comment at http://www.dbms2.com/2009/06/22/the-tpc-h-benchmark-is-a-blight-upon-the-industry/ that 50 GB per node (!) was devoted to storing hash tables and other related data structures to keep joins in memory). So I’m going to stand by my theory, although you have succeeded in raising some doubt in my mind.
Thanks so much for your interest in our recently posted TPC-H results. Regardless of one’s position on the TPC-H benchmark, our results ( http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=109062201 ), which far exceeded the performance and price/performance of traditional database technologies, at the very least seem to have stimulated some interesting discussion.ReplyDelete
Unfortunately, your blog site limits responses to 4096 characters, which did not allow me to post a full and adequate response to your questions here. I thus posted my full response on our blog and invite you and your readers to visit www.paraccel.com/data_warehouse_blog for a continuation of this discussion.
Thanks again for your interest.
Thanks very much for responding to my questions. ParAccel still seems to be doing more I/O than a standard column-store as per my previous comment, but at this point since you obviously know the ParAccel code and I'm just speculating, and I've never met you and must give you the benefit of the doubt wrt the truthfulness of your responses, I will give you the last word and not speculate any further.
Next time I’m in New England, we should get together for beers. You might find some of the things we are doing very interesting and may want to become more involved with us.ReplyDelete
I’d like to thank Stavros and Greg for helping to clarify. I don’t know where you live, but if we can get together, I’ll buy the beer.
There were a couple of questions left open:
1. One of the rules of TPC-H is you cannot use duplicated or aggregated versions of the data because it would conflict with the Ad Hoc goal of the benchmark. So, no, we didn’t use any projections, zone-maps, aggregate-grids or even indexes. We were pure “Load and Go”. That is part of why we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own.
2. Thanks for the pointer to Boncz’s paper. Very interesting analysis! There is an important difference, though, in their research and our results. Since they didn’t do any compression, they couldn’t account for the number of CPU cycles to decompress, so this is a little apples-to-oranges. If we don’t compress, we are much more sensitive to the IO and memory-bus limitations. The net is that across the 24 queries (240 for each Throughput run), we averaged better than 80% CPU utilization on the 2.3GHz Opterons. The best proof-point will be when another columnar database publishes a TPC-H.
3. Sections 4.5.1 and 5.2 in the FDR point out that the hardware was configured RAID1. Naturally, ParAccel PADB includes its own full mirroring, but RAID1 is still a common datacenter requirement to make it completely invisible to the application when a disk fails. The result of RAID1 is a 50% reduction in per-spindle speed as well as space. So, the server’s available IO rate, instead of being 2500-3000MB/sec, becomes more like 1300-1500MB/sec.
I hope this helps clarify.
Barry Zane, ParAccel CTO
Thanks for the additional clarifications, these are extremely helpful. I totally agree that until another column-store publishes TPC-H 30TB numbers, my statements about what a column-store should be able to achieve remain entirely theoretical. My gut feeling is that we won't see other column-stores publish TPC-H numbers any time soon; there are a bunch of really weird queries in TPC-H that are not found all that often in the real world and as far as I understand it, other companies prefer to optimize their product based on what their customers are asking of them instead of optimizing for TPC-H. But I'd really like to see some non-ParAccel column-store TPC-H 30TB numbers. If anyone would be willing to share them with me, I'd love to see them (even if I have to sign a NDA).ReplyDelete
My only small quibble with your most recent comment is that in many cases it is possible to operate directly on compressed data. For example, your load script indicates that the shipdate column is delta-compressed. It should be possible to apply the shipdate range predicate on delta-compressed data without decompression. My research shows that this can be a pretty big performance win. I don't know if ParAccel can already do this or not, but if not, you may want to look into it.
If you're ever in New Haven or driving through it (it is half-way between New York and Boston) please do let me know.
Even though Barry has answered your 4 questions, there is still at least one additional area that I feel is worth addressing, namely the CPU cost and overall performance of Q1.
Based on the paper of Boncz et.al., you believe that Query 1 should not be CPU bound in a column store. I have not yet read that paper, so I cannot comment, at this time, on the analysis that was used.
What I can do however is tell you of my own experiences with Q1, over many years of running TPC-H benchmarks in a variety of environments, and point out what some of the database vendors have been able to achieve with Q1.
From all of my observations I can categorically say that Q1 is one of the most, if not the most, CPU bound query in the whole TPC-H ensemble. This is not just my opinion, but the generally accepted view of others who have a long history with running TPC-H. A study of the available data also leads to the same conclusion.
For example, a look at some of the results from the only other shared nothing column store, with published TPC-H benchmarks, i.e. Exasol, reveals some very interesting data.
Exasol has a 1TB result using 96 Intel 3.16 GHz Intel X5460 quad core processors (384 cores) and a 3TB result using 160 of the same processors (644 cores). Q1 took 9.2 secs at 1TB and 15 secs at 3TB in those benchmarks.
Not much is known about Exasol. However we do know that Exasol is a highly compressed, shared nothing DBMS designed for in-memory processing. http://www.exasol.com/exasolution.html?&L=1
We don't know for sure, but it's a good bet that Exasol runs Q1 at close to 100% CPU utilization, based on the assumption that Exasol parallelizes well and that it is running in memory. Thus we have direct evidence that Q1 is extremely CPU-bound on at least one commercially available column store.
You might want to argue that maybe Exasol has not optimally implemented Q1. But when Q1 is able to consume virtually all the available CPU in every TPC-H benchmark that I've ever run, it's hard to argue that everybody has got Q1 wrong.
Now let's compare ParAccel's 30 TB Q1 response time with numbers projected from Exasol's published numbers.
Note that Q1 requires scanning almost the entire lineitem table and thus, would not benefit from the use of any kind of index, regardless of what DBMS is being used. Hence a reasonable assumption is that if we take the 3 TB number and scale it up by a factor of 10, Exasol would take roughly 150 seconds to run a 30TB Q1 on 644 cores in a memory-based environment.
By similar reasoning if we start with the 1TB number, we can project that Exasol would take 276 seconds (i.e. 30 * 9.2) to run a 30TB Q1 on 384 cores in a memory-based environment.
By contrast ParAccel actually achieved a 30TB Q1 on 43 x 2.3 GHz Quad core Opteron processors in 51 secs, in a disk-based environment.
Note these Opteron cores have roughly 55% of the CPU power of an Intel X5460 core, using published SPEC CPU results for the comparison. See http://www.exasol.com/exasolution.html?&L=1
Normalizing the results to a single X5460 core gives the following:
- ParAccel: 9,649 secs to do Q1 from disk (i.e. 51*344*.55)
- Exasol: 96,600 secs (projected from their 3TB result) to do Q1 in memory (i.e. 150*640)
- Exasol: 105,984 secs (projected from their 1TB result) to do Q1 in memory (i.e. 276*384)
Thus whichever estimate you use for Exasol, ParAccel's disk-based normalized Q1 response time is about 10 times better than Exasol's.
You may not agree, but to me this is quite impressive. I'd love to see what some of the other column stores can do.
Senior Staff Engineer, Sun Microsystems Inc.
PS. If you want a truly disk based query to analyze, Q6 is the one. It's actually a close cousin of Q1, but without most of the arithmetic computations. It differs from Q1 in that it only retrieves 1 year's data - about 1/7 of lineitem, whereas Q1 must retrieve almost all of lineitem.
Given that Barry Zane seemed to indicate in his previous comment that ParAccel does not operate directly on compressed data, and given that it is my understanding that you are the one who ran the ParAccel benchmark numbers, I'm certainly willing to believe that ParAccel was CPU-limited for query 1. Given this is the case, we can't learn much about ParAccel's I/O performance from examining this query. The problem with using query 6 instead is that since only 1/7th of lineitem is needed for the query, it's unclear how much of the table is actually read off disk. I know there were no indexes, so probably the entire shipdate column needs to be read. But if ParAccel uses "late materialization" (see http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf) only 1/7th of the other columns need to be read (assuming I remember correctly that the generated data is automatically loosely ordered by shipdate so the 1/7th of lineitem that passes the predicate is pretty much contiguous). Since we don't know if ParAccel uses late materialization or early materialization, I am afraid to conclude anything about about ParAccel's I/O characteristics from this query. (BTW: query 6 takes ParAccel 20 seconds in the "power run" --- I could do some more back of the envelope calculations and claim that in theory a column-store that uses late materialization and operates directly on compressed data could go much faster, but I made a promise in a previous comment that I would stop speculating about what column-stores should theoretically be able to achieve until one actually runs the 30TB TPC-H benchmark and publishes results.)
The bottom line is that it sounds like we don't have enough information to conclude anything about ParAccel's I/O characteristics from just looking at the TPC-H results (though if someone has a clever way of doing this, please let me know). I guess the best way to ultimately answer this question is to run some simple (non-CPU intensive) test queries on uncompressed ParAccel data and see if the expected performance based on the I/O speed of the available disk(s) meets the actual performance of ParAccel. If ParAccel is willing to give me a copy of their software and allow me to publish the results, I'd certainly be willing to run this experiment myself and let you know what I find.
"If ParAccel is willing to give me a copy of their software and allow me to publish the results"ReplyDelete
Daniel, I'm curious if _any_ vendor has ever let you do that in the past. If so can you point me to the published results?
Well, Vertica let me benchmark their software and publish the result in my group's upcoming HadoopDB paper (which will be unveiled in a week or two --- we used Vertica as a comparison point for HadoopDB). But obviously that's a different case given my history with them and they know me and trust me to treat them fairly. I have no history with anyone at ParAccel (as far as I know), so I certainly would understand if they preferred not have me benchmark their software.ReplyDelete