Wednesday, July 29, 2009

Watch out for VectorWise

In the last few years, there have been so many new analytical DBMS startups de-cloaking that it’s difficult to keep track of them all. Just off the top of my head, I would put Aster Data, DATAllegro, Dataupia, Exasol, Greenplum, Infobright, Kickfire, ParAccel, Vertica, and XtremeData in that category. Once you add the new analytical DBMS products from established vendors (Oracle Exadata and HP Neoview) we’re at a dozen new analytical DBMS options to go along with older analytical DBMS products like Teradata (industry leader), Netezza, and Sybase IQ. Finally, with the free and open source HadoopDB, we now have (at least) sixteen analytical DBMS solutions to choose from.

Given the current overwhelming number of analytical DBMS solutions, I suspect that VectorWise’s sneak preview happening this week is not going to get the attention that it deserves. VectorWise isn’t making a splash with a flashy customer win (like Aster Data did with MySpace), or a TPC-H benchmark win (like Kickfire and ParAccel did), or an endorsement from a DBMS legend (like Vertica did). They’re not even going to market with their own solution; rather, they’re teaming up with Ingres for a combined solution (though the entire Ingres storage-layer and execution engine has been ripped out and replaced with VectorWise). But I’m telling you: VectorWise is a company to watch.

Here are the reasons why I like them:


  1. They are a column-store. I strongly believe that column-stores are the right solution for the analytical DBMS market space. They can get great compression ratios with lightweight compression algorithms, and are highly I/O efficient. In my opinion, the only reason why there are companies on the above list that are not column-stores is that they wanted to accelerate time to market by extending previously existing DBMS code, and the most readily available DBMS code at the time was a row-store. Any DBMS built from scratch for the (relational, structured data) analytical DBMS market should be a column-store.

  2. Column-stores are so I/O efficient that CPU and/or memory usually become bottlenecks very quickly. Most column-stores do very careful optimizations to eliminate these bottlenecks. But to me, VectorWise has gone the extra mile. The query operators are run via a set of query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instructions. Vectors of 100-1000 values within a column get pipelined through a set of query operations on that column, with many values typically being processed in parallel by SIMD (single instruction, multiple data) capabilities of modern CPU chips. Most database systems are unable to take advantage of SIMD CPU capabilities --- the tuple-at-a-time (iterator) processing model of most database systems is just too hard for compilers to translate to SIMD instructions. VectorWise has gone to great lengths to make sure their code results in vectorized CPU processing. Their execution primitives are also written to allow CPUs to do efficient out-of-order instruction execution via loop-pipelining (although compilers are supposed to discover opportunities for loop-pipelining on their own, without carefully written code, this doesn’t happen in practice as often as it should). So with highly optimized CPU-efficient code, along with (1) operator pipelining to keep the active dataset in the cache and (2) column-oriented execution reducing the amount of data that must be shipped from memory to the CPU, VectorWise reduces the CPU and memory bottlenecks in a major way. The bottom line is that VectorWise is disk efficient AND memory efficient AND CPU efficient. This gets you the total performance package.
  3. Their founders include Peter Boncz and Marcin Zukowski from CWI. I generally think highly of the DBMS research group at CWI, except for one of their papers which ... actually ... maybe it’s better if I don’t finish this sentence. I have spoken highly about them in previous posts on my blog (see here and here).
  4. It looks likely that their solution will be released open source. I was unable to get a definite commitment from Boncz or Zukowski one way or another, but the general sense I got was that an open source release was likely. But please don’t quote me on that.
  5. If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.

In the interest of full disclosure, here are a few limitations of VectorWise


  • It is not a shared-nothing, MPP DBMS. It runs on a single machine. This limits its scalability to low numbers of terabytes. However, VectorWise is targeting the same “mass market” that Kickfire is, where the vast majority of data warehouses are less than 10TB. Furthermore, as mentioned above, it is a great candidate to be turned into a shared-nothing, parallel DBMS via the HadoopDB technology, and I look forward to investigating this opportunity further.
  • In my experience, having large amounts of low-level, CPU optimized code is hard to maintain over time, and might limit how nimble VectorWise can be to take advantage of new opportunities. Portability might also become a concern (in the sense that not all optimizations will work equally well on all CPUs). However, I would not put anything past such a high quality technical team.

Two final notes:

  • I like their go-to-market strategy. Like Infobright and Kickfire they are after the low-priced, high volume analytical DBMS mass market. But the problem with the mass market is that you need a large global sales and support team to handle the many opportunities and customers. Startups that target the high-end have it much easier in that they can get through the early stages of the company with a few high-priced customers and don’t need to invest as much in sales and support. By getting into bed with Ingres, VectorWise gets to immediately take advantage of the global reach of Ingres, a key asset if they want to target the lower end of the market.
  • CWI is also the creator of the open source MonetDB column-store. VectoreWise is a completely separate codeline, and makes several philosophical departures from MonetDB. According to VectorWise, MonetDB’s materialization of large amounts of intermediate data (e.g., from running operators to completion) makes it less scalable (more suited for in-memory data sets) than VectorWise. VectorWise has superior pipelined parallelism, and vectorized execution. I have not checked with the MonetDB group to see if they dispute these claims, but my knowledge from reading the MonetDB research papers is generally in line with these statements, and my understanding is that the MonetDB and VectorWise teams remain friendly.

Monday, July 20, 2009

Announcing release of HadoopDB (longer version)

If you have a short attention span see the shorter blog post.
If you have a large attention span, see the complete 12-page paper.

There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).

The second trend is what I talked about in my last blog post: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.

I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)

Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as I wrote about previously, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., the work by James Hamilton), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.

Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.

Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.

Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.

In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the Facebook Hadoop deployment as an example). Hadoop simply scales better than any currently available parallel DBMS product. Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)

The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.

Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.

That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Our paper (that will be presented at the upcoming VLDB conference in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).

Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.

The initial release of the source code for HadoopDB can be found at http://db.cs.yale.edu/hadoopdb/hadoopdb.html. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!

Announcing release of HadoopDB (shorter version)

I'm pleased to announce the release of HadoopDB. HadoopDB is:
  1. A hybrid of DBMS and MapReduce technologies targeting analytical query workloads
  2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud
  3. An attempt to fill the gap in the market for a free and open source parallel DBMS
  4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).
  5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads
For more, see:
  1. Project Webpage
  2. More detailed blog post
  3. Complete 12-page VLDB paper

Wednesday, July 15, 2009

Sybase IQ throws its hat into the in-DBMS analytics ring

Sybase IQ announced this week the availability of Sybase IQ 15.1. The press release made it clear that this version is all about in-DBMS analytics. Perhaps the most notable addition is the integration of the DB Lytix (a product from Fuzzy Logix) analytics library into Sybase IQ, so DB Lytix functions can be run inside of the DBMS.

It's possible that I'm looking in the wrong places, but I have seen very little attention paid to this announcement. I take this as a symptom of a very good thing: so many DBMS vendors are adding in-DBMS analytics features to their products, that announcements such as this are not really news any more. In the last 18 months we've had the announcement of the Teradata-SAS partnership (where a number of SAS functions are being implemented inside Teradata and run in parallel on Teradata's shared-nothing architecture), Netezza opening up their development platform so that Netezza partners and customers can implement complex analytical functions inside the Netezza system, and the announcement of in-database MapReduce functionality by Greenplum, Aster Data, and Vertica (though, as explained by Colin White, the vision of when MapReduce should be used --- e.g., for ETL or user queries --- varies across these three companies). Though not announced yet, I'm told that Microsoft Project Madison (the shared-nothing version of SQL Server to be released in 2010) will natively run windowed analytics functions in parallel inside the DBMS.

As a DBMS researcher, this is great news, as the DBMS is starting to become the center of the universe, the location where everything happens. Though some would argue that in-DBMS analytics has been available for decades via user-defined functions (UDFs), many agree that UDF performance has been disappointing at best, and shipping data out of the DBMS to perform complex analytics has been common practice. The reasons for this are many-fold: query optimizers have trouble estimating the cost of UDFs, arbitrary user written code is hard to automatically run in parallel, and various security and implementation bugs manifest themselves (see, e.g., Section 4.3.5 of the "A Comparison of Approaches to Large Scale Data Analysis" paper).

One interesting trend to note is that there seem to be two schools of thought emerging with different opinions on how to allow complex in-DBMSs analytics without resorting to regular UDFs. Teradata, Microsoft, Sybase, and, to an extent, Netezza, all seem to believe that providing a library of preoptimized functions distributed with the software is the way to go. This allows the vendor to build into the system the ability to run these functions in parallel across all computing resources (a shared-nothing MPP cluster in all cases except Sybase) and to make sure these functions can be planned appropriately along with other data processing operations. The other school of thought is adopted by vendors that allow customers more freedom to implement their own functions, but constrain the language in which this code is written (such as MapReduce or LINQ) to facilitate the automatic parallelization of this code inside the DBMS.

Obviously, these camps are not mutually exclusive. As in-DBMS analytics continues to grow in popularity, it is likely we'll start to see vendors adopt both options.

Whatever school of thought you prefer, it is clear that the last 18 months has seen tremendous progress for in-database analytics. Shipping data out of the DBMS for analysis never made a lot of sense, and finally, viable alternative options are emerging. Database systems are becoming increasingly powerful platforms for data processing, and this is good for everyone.

Monday, July 6, 2009

ParAccel and their puzzling TPC-H results

[Warning: the ParAccel TPC-H results referred to in this post have since been challenged by a competitor and found to be in violation of certain TPC rules (I cannot find any public disclosure of which specific rules were violated). These results have since been removed from the TPC-H Website, as of Sep 24th 2009.]

At SIGMOD last week, I was chatting with Mike Stonebraker (chatting might be the wrong verb; it was more like downloading knowledge and advice). ParAccel had a paper at SIGMOD, and the week before they had announced that they topped the TPC-H benchmark at the 30TB scale. Naturally, this resulted in ParAccel coming up in our conversation. Mike asked me a question about the ParAccel TPC-H results that floored me --- I was really disappointed I didn't think of this question myself.

Before telling you the question, let me first give you some background. My knowledge about ParAccel's TPC-H results came from reading two blog posts about it. First, there was Merv Adrian's positive post on the subject, and then there was Curt Monash's negative post. Monash's negativity stemmed largely from the seemingly unrealistic configuration of having nearly a petabyte of disk to store 30TB of data (a 32:1 disk/data ratio). This negativity was responded to in a comment by Richard Gostanian who has firsthand knowledge since he helped ParAccel get these benchmark results. The relevant excerpt of his comment is below:

"Why did ParAccel require 961 TB of disk space? The answer is they didn’t. They really only needed about 20TB (10TB compressed times 2 for mirroring). But the servers they used, which allowed for the superb price-performance they achieved, came standard with 961 TB of storage; there simply was no way to configure less storage. These Sun servers, SunFire X4540s, are like no other servers on the market. They combine (1) reasonable processing power (two Quad Core 2.3 GHz AMD Opteron processors) (2) large memory (64 GB), (3) very high storage capacity (24 or 48 TB based on 48 x 500GB or 1TB SATA disks) and 4) exceptional I/O bandwidth (2 - 3 GB/sec depending upon whether you’re working near the outer or inner cylinders) all in a small (4 RU), low cost (~$40K), power efficient (less than 1000 watts) package.

What any DBMS needs to run a disk-based TPC-H benchmark is high I/O throughput. The only economical way to achieve high disk throughput today is with a large number of spindles. But the spindles that ship with today’s storage are much larger than what they were, say, 5 years ago. So any benchmark, or application, requiring high disk throughput is going to waste a lot of capacity. This will change over the next few years as solid-state disks become larger, cheaper and more widely used. But for now, wasted storage is the reality, and should not be viewed in a negative light."



I remember thinking to myself "sounds reasonable" and agreeing with both sides. Yes, there is wasted space, but you need the large number of spindles to get the high I/O throughput and most TPC-H configurations resemble this one.

Then along came Mike Stonebraker, who asked me: "Why does ParAccel need such high I/O throughput? They're a column-store!"

This question immediately triggered my memory of all the experiments I ran on C-Store. I generally ran C-Store on a system with 2 CPUs and 4 disks getting an aggregate I/O bandwidth of 180MB/s (i.e. 90MB/s per CPU) and I rarely saw a disk-bottlenecked query. Why? (1) Because column-stores are already super I/O efficient by only reading the relevant columns for each query and (2) Because they compress data really well (I usually saw at least 5:1) so the effective I/O bandwidth was 90MB/s * 5 = 450MB/s per CPU which, for any reasonably interesting query, is faster than the CPU can generally keep up with.

Meanwhile, the ParAccel TPC-H configuration consisted of nodes with orders of magnitude more disk I/O (3GB/s from the 48 disks) yet just 4 times the CPU processing power. Doing the math: 3GB/s divided by 8 CPU cores = 375MB/s per CPU core. Gostanian said that there was a 3:1 compression ratio, so we're talking about an astounding effective 1GB/s per CPU core.

There's simply no way TPC-H queries are simple enough (except for maybe query 1) for the CPUs to process them at 1GB/s (see my related post on this subject). So it must be the case that part of that 1GB/s is being wasted reading data that will ultimately not be processed (e.g. like unused columns in a row-store). But ParAccel is a column-store! Hence the Mike Stonebraker conundrum.

At this point I was already ready to bet the house that ParAccel does not have the I/O efficiency of a standard column-store. But one who is unconvinced could still argue that disks are cheap and ParAccel was willing to pay the small amount extra for the high I/O bandwidth (which would help the simplest queries that are not CPU intensive) even though most queries would be CPU limited and the extra bandwidth would not help them.

To test this theory I wasted some time slogging through the 82-page ParAccel full disclosure report on the 30TB TPC-H results (seriously!). Query 1 is arguably the easiest query in TPC-H (it scans the single lineitem table, accesses 7 of its attributes, applies a single predicate and several very basic aggregations). There are no joins in this query and the aggregations are really easy since there are only 4 unique groups, so the aggregations can be done during the scan via a hash aggregation. Since it is so basic, this query should be disk limited on most systems. A standard column-store should be able to run this query in the time it takes to read the 7 columns off disk. There are 180 billion rows in the lineitem table, and each attribute should take up about 4 bytes (this is conservative --- with compression this number should be much less). So a total of 180 billion x 7 x 4 = approximately 5TB needs to be read off disk for this query (in reality this number will likely be smaller due to compression and memory caching). So given that there were 43 servers, each with 3GB/s I/O bandwidth, a conservative estimate of how fast this query should be run using a standard column-store is (5TB / (43*3GB/s)) or approximately 40 seconds. But this query takes ParAccel on average 275 seconds according to the report. This is further evidence that ParAccel does not have the I/O efficiency of a standard column-store.

Hence, ever since I looked into this, I've been just totally flummoxed by these puzzling numbers.

Finally, today I finally got around to reading their SIGMOD paper describing the internals of their system. This paper was sadly sparse on details (it was only 4 pages when the SIGMOD limit was 14 and these 4 pages were not that meaty). The main thing I learned from the paper is that ParAccel has heavy PostgreSQL roots. It seems like they started with PostgreSQL and did their best to transform it into a MPP column-store. The one area where there was some amount of meat was their new optimizer's ability to handle plans with "thousands of joins". This of course raised more questions for me. Why would a start-up (with so many things that need to be implemented) be worried about queries with that many joins? Maybe there are a handful of customers that need this many, but the vast majority need far less. But then I got to Figure 2, when they ran queries from TPC-H and TPC-DS and the x-axis had queries with up to 42 joins. But unless I'm sorely mistaken, no TPC-H or TPC-DS query has anywhere near that number of joins.

So then I was really confused. I had four questions about ParAccel, each of which are troubling.

(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker's question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?

And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on "How to create a new column-store DBMS product in a week". They're not a column-store. They're a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel's case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store's regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).

This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store's tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.

If my theory is correct (and remember, I have no inside knowledge), then ParAccel has ignored everything I have done in my research on column-stores the past 6 years. My whole PhD dissertation is about the order of magnitude performance improvement you get by building a query executer specifically for column-stores (instead of using a row-store query executer), which I talked about at SIGMOD last week. I can't help but to take it a little personally that they have not read my papers on this subject, like my column-stores vs row-stores paper from last year. ParAccel would be antithetical to everything I stand for.

Given how neatly my theory explains my four questions, I am pessimistic that someone will be able to recover my perception of ParAccel's product. But I would be happy if someone from ParAccel could confirm or deny what I have said, and if there are alternative explanations to my four questions, I'd love to hear them in an open forum such as comments on this blog. Please though, let's try to keep the conversation polite and civil.