Monday, June 29, 2009

More on node scalability

I'm at SIGMOD this week, so I'll make this blog post quick:

A week and a half ago, I blogged on analytical database scalability. The basic thesis was:
(1) People are overly focused on raw data managed by the DBMS as an indicator of scalability.
(2) If you store a lot of data but don't have the CPU processing power to process it at the speed it can be read off of disk, this is a potential indication of a scalability problem not a scalability success.

I insinuated that:
(1) Scalability in terms of number of nodes might be a better measure than scalability in size of data (assuming of course, you can't get true scalability experiments as specified by the academic definition of scalability)
(2) Scaling the number of nodes is harder than people realize.

I am not so presumptuous as to assume that my blog carries enough weight to affect the actions of Aster Data's marketing department, but I was nonetheless pleased to see Aster's new marketing material (which I believe was just put online today) which specifically shows how their new appliance has more CPU processing power per usable storage space than a subset of their competitors (I'm assuming they either only showed the competitors they can beat in this metric, or their other competitors don't publish this number).

I do want to point out one interesting thing, however, about Aster Data's scalability (at least what limited knowledge we can deduce from their marketing material). For 6.25TB, 12.5TB, 25TB, 50TB, and 100TB, Aster Data's appliance has 8, 16, 32, 63, and 125 worker nodes, and 64, 128, 256, 504, and 1000 processing cores respectively. So basically, it's completely linear: double the amount of data, then double the amount of nodes and processing cores respectively. But then going from 100TB to 1PB of data (10X more data) they increase the number of worker nodes by less than 2X (to 330) and processing cores by a little less than 3X (2640). So after 100TB/100 nodes, their processing power per storage space drops off a cliff. (Aster has a note for the 1PB saying they are assuming 3.75X compression of user data at 1PB, but compression at 1PB shouldn't be any different than compression at 100TB; and if they are assuming uncompressed data at 100TB, then the comparison in processing power per storage space relative to other vendors is misleading since everyone else compresses data.)

Bottom line: scaling to 100 nodes is one thing (I know that Teradata, Vertica, and Greenplum can do this in addition to Aster Data, and I'm probably forgetting some other vendors). Scaling to 1000 nodes (and keeping performance constant) is much harder. This might explain why Aster Data only doubles the number of nodes/cores for 10X more data after they hit 100 nodes.

I will have more on why it is difficult to scale to 1000 nodes in future posts.

Thursday, June 18, 2009

What is the right way to measure scale?

Berkeley Professor Joe Hellerstein wrote a really interesting blog post a month ago, comparing two wildly different hardware architectures for performing data analysis. He looked at the architecture Yahoo used to sort a petabyte of data with Hadoop, and the architecture eBay uses to do web and event log analysis on 6.5 petabytes of data with Greenplum.

Given just the information I gave you above, that Hadoop and Greenplum can be used to process over one petabyte of data, what can we learn about the scalability of these systems? If you ask your average DBMS academic, you’ll get an answer along the lines of “nothing at all”. They will likely tell you that the gold standard is linear scalability: if you add 10X the amount of data and 10X the hardware, your performance will remain constant. Since you need to measure how performance changes as you add data/hardware, you can’t conclude anything from a single data point.

But come on, we must be able to conclude something, right? One petabyte is a LOT of data. For example, if each letter in this blog post is about 3 millimeters wide on your computer screen (or mobile device or whatever) and each letter is stored in 1-byte ASCII, and assuming that you’ve disabled text-wrapping so that the entire blog post is on a single line, how long (in distance) would I have to make this blog post to reach 1 petabyte of data? A mile? 100 miles? Maybe it could cross the Atlantic Ocean? No! This blog post would make it all the way to the moon. And back. And then go back and forth three more times. [Edit: Oops! It's actually more than 3400 more times; you could even make it to the sun and back ten times] And eBay’s database has 6.5 times that amount of data! A petabyte is simply a phenomenal amount of data. A data analysis system must surely be able to scale reasonably well if people are using it to manage a petabyte of data. If it scaled significantly sublinearly, then it would be prohibitively expensive to add enough hardware to the system in order to get reasonable performance.

At the end of his post a month ago on the Facebook 2.5 petabyte data warehouse managed by Hadoop, Curt Monash gives the list of the largest wareshouses he’s come across as an analyst (at least those that are not NDA restricted). The DBMS software used in these warehouses were: Teradata, Greenplum, Aster Data, DATAllegro, and Vertica. What’s the commonalty? There’re all shared-nothing MPP database systems. Hadoop is also shared-nothing and MPP, though most would not call it a database system. So without detailed scalability experiments, can we use this data to verify the claim of senior database system researchers that shared-nothing MPP architectures scale better than alternative approaches (like shared-disk or shared-memory)? Not directly, but it’s pretty good evidence.

Let’s get back to the two systems Hellerstein highlighted in his blog post: the Greenplum 6.5 petabyte database and the Hadoop 1 petabyte database. One might use similar reasoning as used above to say that Greenplum scales better Hadoop. Or at least it doesn’t seem to scale worse. But let’s dig a little deeper. The architecture Hadoop used was 3800 “nodes” where each node consisted of 2 quad-core Xeons at 2.5ghz, 16GB RAM, and 4 SATA disks. The architecture Greenplum used contained only 96 nodes. Assuming each node is as Hellerstein insinuates (SunFire X4540), then each node contained 2 quad-core AMD Opterons at 2.3 GHz, 32-64GB RAM, and 48 SATA disks. So the Hadoop cluster has about 40X the number of nodes and 40X the amount of processing power (while just 3X the number of SATA disks).

Now, let’s dig even deeper. Let’s assume that each SATA disk can read data at 60MB/s. Then each SunFire node in the Greenplum cluster can scan (sequentially) a table at a rate of 48 disks X 60 MB/s = just under 3GB/s. This is inline with Sun’s claims that data can be read from a SunFire node from disk into memory at a rate of 3GB/s, so this seems reasonable. But don’t forget that Greenplum compressed eBay’s data at a rate of 70% (6.5 petabytes user data compressed to 1.95 petabytes). So that 3GB/s of bandwidth is actually an astonishing 10GB/s of effective read bandwidth.

So what can the two quad-core Opteron processors do to this 10GB/s fire hose? Well, first they have to decompress the data, and then they need to do whatever analysis is required via the SQL query (or in Greenplum’s case, alternatively a MapReduce task). The minimum case is maybe a selection or an aggregation, but with MapReduce the analysis could be arbitrarily complex. The CPUs need to do all this analysis at a rate of 10GB/s in order to keep up with the disks.

There’s an ICDE paper I highly recommend by Marcin Zukowski et. al., Super-Scalar RAM-CPU Cache Compression, that looks at this point. They ran some experiments on a single Opteron 2GHz core and found that state-of-the-art fast decompression algorithms such as LZRW1 or LZOP usually obtain 200-500MB/s decompression throughput on the 2GHz Opteron core. They introduced some super-fast light-weight decompression schemes that can do an order of magnitude better (around 3GB/s decompression on the same CPU). They calculated (see page 5) that given an effective disk bandwidth of 6GB/s, a decompression rate of 3GB/s gives them 5 CPU cycles per tuple that can be spent on additional analysis after decompression. 5 cycles! Is that even enough to do an aggregation?

The SunFire node has approximately eight times the processing power (8 Opteron cores rather than 1), but given their near-entropy compression claims, they are likely using heavier-weight compression schemes rather than the light-weight schemes of Zukowski et. al., which removes that factor of 8 more processing power with a factor of 10 slower decompression performance. So we’re still talking around 5 cycles per tuple for analysis.

The bottom line is that if you want to do advanced analysis (e.g. using MapReduce), the eBay architecture is hopelessly unbalanced. There’s simply not enough CPU power to keep up with the disks. You need more “nodes”, like the Yahoo architecture.

So which scales better? Is using the number of nodes a better proxy than size of data? Hadoop can “scale” to 3800 nodes. So far, all we know is that Greenplum can “scale” to 96 nodes. Can it handle more nodes? I have an opinion on this, but I’m going to save it for my HadoopDB post.

(I know, I know, I’ve been talking about by upcoming HadoopDB post for a while. It’s coming. I promise!)

Friday, June 12, 2009

SenSage crosses a line

OK, I need to vent.

Sensage is another column-store product that I've known about for a long time. Mike Stonebraker even cited them in the original C-Store paper. But their marketing department really crossed a line in my opinion. Yesterday, a blog post under the CEO's name (I doubt the CEO actually wrote this post for reasons that will be obvious in a second) appeared that attacked MapReduce in a similar, but less inflammatory way than DeWitt and Stonebraker's famous blog post written a year and a half ago in January 2008. The basic premise of both posts:
  1. MapReduce is not new
  2. Everything you can do in MapReduce you can do with group-by and aggregate UDFs
The overlap of the argument already bothered me a little. But then I saw the text circled above:

"[...] map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average or sum) that is computed over all the rows with the same group-by attribute. "

It seems like an unusual writing style to say that Map is "like" group-by and Reduce is "analogous" to aggregate. Either they're both "like" or both "analogous". But then, here's a sentence in the original blog post by DeWitt and Stonebraker:

"[...] map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute."

This is WAY too similar in my opinion.

Come on SenSage. Have an original opinion. And don't copy year-and-a-half old text that most of us have already read.

Stonebraker cited you in the C-Store paper. You could at least cite him and DeWitt in return when you are basically rehashing their ideas.

Wednesday, June 10, 2009

SIGMOD 2009: Get There Early and Leave Late

SIGMOD 2009 (a top tier DBMS research conference) is being held in Providence, RI, at the end of this month (the week of June 29th). Column-stores seem to be taking off as a mainstream research topic, as I've counted at least 5 talks at SIGMOD on column-stores (including an entire session!). Make sure you get to SIGMOD by (at least) Tuesday morning though, as "Query Processing Techniques for Solid State Drives", by Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul Shah, Janet Wiener, and Goetz Graefe, is being presented in the first research session slot on Tuesday (research session 2). It doesn't look like it from the title, but trust me, it's a column-store paper. The controversial Vertica / parallel DBMS vs Hadoop benchmark paper is being presented in research session 5 (Sam Madden, an excellent speaker, is presenting the paper), and the aforementioned column-store session (with 3 column-store papers) is in research session 8 at the end of the day Tuesday.

But if you get there early don't leave early! I will be giving a 30-minute talk on column-stores in the awards session at the very end of the conference (I don't believe the talks in this session have been officially announced yet, but I'm indeed giving one of the talks) which goes until 5:30PM on Thursday, July 2nd.

Tuesday, June 9, 2009

CEO responds to my post on the Kickfire market

I had the unexpected and pleasant surprise of having the Kickfire CEO himself responding to my previous post on the Kickfire market (this is no doubt thanks to Curt Monash being kind enough to point his large readership to it). Even though my Kickfire post was positive on the whole, I did raise some questions about their go-to-market strategy, and I want to give more prominence to the response (beyond just a comment thread), especially since it corrected an inaccuracy in my original post. At the end of this blog posting, I will provide my own comments on the Kickfire response.

Here is the response of Kickfire CEO Bruce Armstrong, in his own words:

"Thanks for the post on Kickfire, Daniel.
Some comments:

1) We actually came out of stealth mode as a company at the April 2008 MySQL User Conference, where we announced our world records for TPC-H at 100GB and 300GB;

2) Our product went GA at the end of 2008, which we formally announced at the April 2009 MySQL User Conference along with one of our production reference customers, Mamasource, a Web 2.0 online community doing clickstream analysis that hit performance and scalability limitations with MySQL at 50GB;

3) Our focus is on the data warehouse "mass market" with databases ranging from gigabytes to low terabytes, where over 75% of deployments are today according to IDC/Computerworld survey 2008;

4) We chose MySQL as a key component because it has emerged as a standard (12 million deployments) and 3rd-most deployed database for data warehousing according to IDC;

5) While we do take over much of the processing with our column-store pluggable storage engine and parallel-processing SQL chip, we feel it's important to minimize any changes to a customer's database schema and/or application and to allow transparent interoperability with third-party tools;

6) Having come from 15 years at Teradata (and after that Sybase and Broadbase), I know that the high-end of data warehousing is very, very different from the mass market - both are techincally challenging in their own right and require very different product and go-to-market approaches;

7) Finally, regarding Oracle and whether they would "embrace" Kickfire (the question I was asked by Jason on the Frugal Friday show), we believe the data warehouse mass market could create several winners - and having recently raised $20M from top-tier silicon valley investors, we believe we have the resources to be one of them.

Thanks again for the post - we look forward to more from you and the

My comments on this response (note: please read the tone as positive and collaborative --- I might need a job one day):

(1 and 2) I went back to the TPC-H Website, and believe I was indeed incorrect about Kickfire topping TPC-H in 2007 (I might have been thinking about ParAccel instead of Kickfire). According to the Website, Kickfire topped TPC-H in April of 2008 (though assumedly the product being tested was finished sometime earlier than that in order to leave time for auditing the results, etc). That said, there still does seem to be a double launch. The second sentence of the press release from April 14th 2008 said the company "officially launches this week" while the 1st sentence of the press release from April 15th 2009 announces the launching again. But I think what Bruce is saying is that in one case it was the company and in the other case it was the product.

(3 and 4) The point of my post was that I think the market is smaller than these numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, as pointed out in the comment thread of the previous post, a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect.

(5) Agreed.

(6) I'd argue that Bruce's experience at Teradata gave him a lot of knowledge about the high-end market. I'm not sure it automatically gives him a lot of knowledge about the mass market. That said, he probably has more knowledge about the mass market than an academic at Yale :)

Monday, June 8, 2009

Quick Thoughts on the Greenplum EDC Announcement

The big news from today seems to be Greenplum's Enterprise Data Cloud announcement. Here are some quick thoughts about it:

  1. I totally agree that the future of high-end data warehousing is going to move away from data warehouse appliances and into software-only solutions on top of commodity hardware served from private or public clouds (appliances still might be an option in the low-end).
  2. I also totally agree that data silos and data marts are an unfortunate reality. So you may as well get them into a centralized infrastructure first, and then worry about data modelling later.
  3. I also agree with the self-service nature of the vision.
  4. I wonder what Teradata has to say. This seems very much counter to their centralized "model-first" enterprise data warehouse pitch they've been espousing for years.
  5. I wonder how Oliver Ratzesberger feels about Greenplum essentially rebranding eBay's virtual data mart idea and claiming it for themselves (though admittedly there is slightly more to the EDC vision than virtual data marts). I agree with Curt Monash when he says that you're probably not going to want to copy the data for each new self-service data mart, in which case good workload management is a must. Teradata is probably the only data warehouse system that already has the workload management needed for the EDC vision. NeoView might also have good enough workload management, but it hasn't been around very long.
  6. I wonder if Greenplum felt a little burned from their experience with their MapReduce announcement. In that case, they implemented it, tested it, and then announced it; but unfortunately they then had to share the spotlight with Aster Data which announced a nearly identical in-database MapReduce feature the same day. This time around, they've apparently decided to make the announcement first, and then do the implementation afterwards.
  7. It appears that the only part of the EDC initiative that Greenplum's new version (3.3) has implemented is online data warehouse expansion (you can add a new node and the data warehouse/data mart can incorporate it into the parallel storage/processing without having to go down). All this means is that Greenplum has finally caught up to Aster Data along this dimension. I'd argue that since Aster Data also has a public cloud version and has customers using it there, they're actually farther along the EDC initiative than Greenplum is (Greenplum says that the public cloud availability is on its road map). If I wasn't trying to avoid talking too much about Vertica in this blog (due to a potential bias) I'd go in detail about their virtualized and cloud versions at this point, but I'll stop here.

(Note: I am not associated with Aster Data or Greenplum in any way)

Sunday, June 7, 2009

Is betting on the "MySQL mass market for data warehousing" a good idea?

I came across a podcast the other day where host Ken Hess interviewed the CEO of Kickfire, Bruce Armstrong ( --- note: Armstrong doesn't come on until the 30 minute mark and I suggest skipping to that since the discussion at the 17 minute mark made me a little uncomfortable). Kickfire intrigues me since they are currently at the top of TPC-H for price performance ( at the 100 and 300GB data warehouse sizes (admittedly these are pretty small warehouses these days, but Kickfire feels that the market for small data warehouses is nothing to sneeze at). Although TPC-H has many faults, it is the best benchmark we have (as far as I know), and I've used it as the benchmark in several of my research papers.

In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker's voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore's law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore's law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past (Netezza is a good example of a business succeeding in selling proprietary DBMS hardware, though of course they will tell you that they use all commodity components in their hardware).

Anyway, the main sales pitch for Kickfire is that they want to do for data warehousing what Nvidia did for graphics processing: sell a specialized chip for data analysis applications currently running MySQL. The basic idea is that you would switch out your Dell box running MySQL with the Kickfire box, and everything else would stay the same, since Kickfire looks to the application like a simple storage engine for MySQL. You would get 100-1000X the performance of MySQL (assuming a standard storage engine like MyISAM or InnoDB) at only about twice the price of the Dell box. And, by the way, they are a column-store, which I'm a huge fan of.

But at the 50 minute mark of the above mentioned podcast, Armstrong started talking about potentially being acquired by Oracle. Although he did use the term "down the road", it struck me as a little weird to start talking about acquisition as such a young startup (it seems to me like if you want to maximize the purchase price, you need to establish yourself in the market before being acquired). It made me start wondering, maybe things aren't going as well as the Kickfire CEO makes it seem. If I remember correctly, they burst onto the scene in 2007 in topping the TPC-H rankings, launched at a MySQL conference somewhere around the middle of 2008, didn't make any customer win announcements for the whole year (as far as I recall), and then relaunched at another MySQL conference in the middle of 2009 (last month) along with (finally) a customer win announcement (Mamasource).

Maybe Kickfire is doing just fine and I'm reading way too much into the words of the Kickfire CEO. But if not, why would a company with what seems to be a high quality product be struggling? The conclusion might be: the go-to-market strategy. Kickfire has decided to target the "MySQL data warehousing mass market" and their whole strategy depends on there really being such a market. But do people really use MySQL for their data warehousing needs? My research group's experience with using MySQL to run a data warehousing benchmark for our HadoopDB project (I'll post about that later) was very negative. It didn't seem capable of high performance for the complex joins we needed in our benchmark.

Meanwhile, Infobright and Calpont have chosen similar go-to-market strategies. I don't have much more knowledge about Calpont than can be found in Curt Monash's blog (e.g.,, but I've been hearing about them for years (since they are also a column-store), and I haven't heard about any customer wins from them either. Meanwhile Infobright (another column-store that I like, and their technical team --- lead by VP of Engineering Victoria Eastwood --- are high quality and were very helpful when my research group played around with Infobright for one of our projects) recently open sourced their software which is either an act of desperation or their plan all along, depending on who you ask.

The bottom line is that I've having doubts about whether there really is a MySQL data warehousing mass market. I know this blog is still very young and does not have many readers, so there are unlikely to be any comments, but if you do have thoughts on this subject, I'd be interested to hear them.

Monday, June 1, 2009

About: DBMS Musings Blog

I've been inspired by Joe Hellerstein's blog, James Hamilton's blog, and Curt Monash's blog, and have decided to start one of my own. The goal of this blog is to:

(1) Describe some of my research in small, easy to digest entries (as an alternative to some of the more in-depth research papers that I write).

(2) Present my views on various developments in the DBMS field, both in research and in industry. My particular areas of expertise are in data warehousing/analytics, parallel databases, and cloud computing. I wrote my PhD dissertation on query execution in column-store databases and I'm reasonably knowledgeable about the commercial products in this space, including Calpont, Exasol, Infobright, Kickfire, ParAccel, Sensage, Sybase IQ, and Vertica. Due to my research history with column-stores, my bias is probably a little bit in their favor, but there are also some non-column-oriented products in the same data analytics space, such as Aster Data, Dataupia, DB2, Exadata, Greenplum, Kognitio, Microsoft's Project Madison, NeoView, Netezza, Teradata that have interesting approaches to database management and will receive some attention in this blog.

(3) Comment on and recommend some papers that I read over the course of my everyday activities as a DBMS researcher at Yale University.