Thursday, June 18, 2009

What is the right way to measure scale?

Berkeley Professor Joe Hellerstein wrote a really interesting blog post a month ago, comparing two wildly different hardware architectures for performing data analysis. He looked at the architecture Yahoo used to sort a petabyte of data with Hadoop, and the architecture eBay uses to do web and event log analysis on 6.5 petabytes of data with Greenplum.

Given just the information I gave you above, that Hadoop and Greenplum can be used to process over one petabyte of data, what can we learn about the scalability of these systems? If you ask your average DBMS academic, you’ll get an answer along the lines of “nothing at all”. They will likely tell you that the gold standard is linear scalability: if you add 10X the amount of data and 10X the hardware, your performance will remain constant. Since you need to measure how performance changes as you add data/hardware, you can’t conclude anything from a single data point.

But come on, we must be able to conclude something, right? One petabyte is a LOT of data. For example, if each letter in this blog post is about 3 millimeters wide on your computer screen (or mobile device or whatever) and each letter is stored in 1-byte ASCII, and assuming that you’ve disabled text-wrapping so that the entire blog post is on a single line, how long (in distance) would I have to make this blog post to reach 1 petabyte of data? A mile? 100 miles? Maybe it could cross the Atlantic Ocean? No! This blog post would make it all the way to the moon. And back. And then go back and forth three more times. [Edit: Oops! It's actually more than 3400 more times; you could even make it to the sun and back ten times] And eBay’s database has 6.5 times that amount of data! A petabyte is simply a phenomenal amount of data. A data analysis system must surely be able to scale reasonably well if people are using it to manage a petabyte of data. If it scaled significantly sublinearly, then it would be prohibitively expensive to add enough hardware to the system in order to get reasonable performance.

At the end of his post a month ago on the Facebook 2.5 petabyte data warehouse managed by Hadoop, Curt Monash gives the list of the largest wareshouses he’s come across as an analyst (at least those that are not NDA restricted). The DBMS software used in these warehouses were: Teradata, Greenplum, Aster Data, DATAllegro, and Vertica. What’s the commonalty? There’re all shared-nothing MPP database systems. Hadoop is also shared-nothing and MPP, though most would not call it a database system. So without detailed scalability experiments, can we use this data to verify the claim of senior database system researchers that shared-nothing MPP architectures scale better than alternative approaches (like shared-disk or shared-memory)? Not directly, but it’s pretty good evidence.

Let’s get back to the two systems Hellerstein highlighted in his blog post: the Greenplum 6.5 petabyte database and the Hadoop 1 petabyte database. One might use similar reasoning as used above to say that Greenplum scales better Hadoop. Or at least it doesn’t seem to scale worse. But let’s dig a little deeper. The architecture Hadoop used was 3800 “nodes” where each node consisted of 2 quad-core Xeons at 2.5ghz, 16GB RAM, and 4 SATA disks. The architecture Greenplum used contained only 96 nodes. Assuming each node is as Hellerstein insinuates (SunFire X4540), then each node contained 2 quad-core AMD Opterons at 2.3 GHz, 32-64GB RAM, and 48 SATA disks. So the Hadoop cluster has about 40X the number of nodes and 40X the amount of processing power (while just 3X the number of SATA disks).

Now, let’s dig even deeper. Let’s assume that each SATA disk can read data at 60MB/s. Then each SunFire node in the Greenplum cluster can scan (sequentially) a table at a rate of 48 disks X 60 MB/s = just under 3GB/s. This is inline with Sun’s claims that data can be read from a SunFire node from disk into memory at a rate of 3GB/s, so this seems reasonable. But don’t forget that Greenplum compressed eBay’s data at a rate of 70% (6.5 petabytes user data compressed to 1.95 petabytes). So that 3GB/s of bandwidth is actually an astonishing 10GB/s of effective read bandwidth.

So what can the two quad-core Opteron processors do to this 10GB/s fire hose? Well, first they have to decompress the data, and then they need to do whatever analysis is required via the SQL query (or in Greenplum’s case, alternatively a MapReduce task). The minimum case is maybe a selection or an aggregation, but with MapReduce the analysis could be arbitrarily complex. The CPUs need to do all this analysis at a rate of 10GB/s in order to keep up with the disks.

There’s an ICDE paper I highly recommend by Marcin Zukowski et. al., Super-Scalar RAM-CPU Cache Compression, that looks at this point. They ran some experiments on a single Opteron 2GHz core and found that state-of-the-art fast decompression algorithms such as LZRW1 or LZOP usually obtain 200-500MB/s decompression throughput on the 2GHz Opteron core. They introduced some super-fast light-weight decompression schemes that can do an order of magnitude better (around 3GB/s decompression on the same CPU). They calculated (see page 5) that given an effective disk bandwidth of 6GB/s, a decompression rate of 3GB/s gives them 5 CPU cycles per tuple that can be spent on additional analysis after decompression. 5 cycles! Is that even enough to do an aggregation?

The SunFire node has approximately eight times the processing power (8 Opteron cores rather than 1), but given their near-entropy compression claims, they are likely using heavier-weight compression schemes rather than the light-weight schemes of Zukowski et. al., which removes that factor of 8 more processing power with a factor of 10 slower decompression performance. So we’re still talking around 5 cycles per tuple for analysis.

The bottom line is that if you want to do advanced analysis (e.g. using MapReduce), the eBay architecture is hopelessly unbalanced. There’s simply not enough CPU power to keep up with the disks. You need more “nodes”, like the Yahoo architecture.

So which scales better? Is using the number of nodes a better proxy than size of data? Hadoop can “scale” to 3800 nodes. So far, all we know is that Greenplum can “scale” to 96 nodes. Can it handle more nodes? I have an opinion on this, but I’m going to save it for my HadoopDB post.

(I know, I know, I’ve been talking about by upcoming HadoopDB post for a while. It’s coming. I promise!)


  1. Hey Daniel: thanks for the shout-out and the analysis. Good reading.

    The Greenplum or eBay guys would have to comment (or not) on what compression schemes they're using and what kind of tuple bandwidth they're getting in their query executor. As your research shows, you can get crazy compression rates in some cases with simple schemes like RLE, so this is very data- (and method-) dependent. Entropy measures would need to take that info into account, and I don't know that info myself.

    Meanwhile, experienced folks like Oliver Ratzesberger at eBay are solving real-world problems in a production environment. So maybe we should try unwinding things the other way round: Given that eBay is deploying and paying for this cluster, what kind of workload are they running and why don't they need more CPU? That's a question that doesn't apply to the Yahoo Petasort number, which is a straight up performance/scaling benchmark (and a super impressive one at that). Clearly my post was an apples and oranges thing, but answers to questions like these will help with some rough comparison.

    FWIW, in discussing these numbers with Hadoop folks in the know I heard two consistent things. One is an assertion that eBay probably isn't actually sorting a full Petabyte, which makes a lot of sense (sounds like something to avoid unless you're stress-testing). I also heard some sheepish comments about the current costs of Java serialization in the Hadoop codebase, leavened with optimism that efficiency will improve significantly in time. Finally, Google recently described deploying 12-disk boxes, and the folks at Yahoo have told me that makes sense for them going forward. So some combo of these issues starts to tell the story of the discrepancies that cropped up that week...

  2. Hi Joe,

    I feel honored to receive a comment from you on my blog. Thanks for the scoop on Google's new 12-disk/box deployment --- I had not heard this before. I can independently confirm Hadoop's Java serialization issues.

  3. There is a 0% chance you are reading data at 60MB/s off of one SATA disk. Thus your other assumptions are way off.

    I'll believe 25MB/s peak, more like 20MB/s sustained.

  4. My estimates are in line with the Sun hardware specifications. See: