<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8899645800948009496</id><updated>2012-02-02T11:02:00.853-08:00</updated><category term='HadoopDB'/><category term='database replication'/><category term='Bigtable'/><category term='Sam Madden'/><category term='research funding'/><category term='BerkeleyDB'/><category term='SPARQL'/><category term='clickstream data'/><category term='Machine-generated data'/><category term='SHARD'/><category term='RDF'/><category term='PNUTS'/><category term='Riak'/><category term='Sherpa'/><category term='SimpleDB'/><category term='Semantic Web'/><category term='entrepreneurship'/><category term='Oracle'/><category term='Cloudera'/><category term='NoSQL'/><category term='Big Data'/><category term='CouchDB'/><category term='data growth rates'/><category term='MongoDB'/><category term='Dynamo'/><category term='CAP'/><category term='eventual consistency'/><category term='database community'/><category term='peer review'/><category term='Hadoop'/><category term='Graph Databases'/><category term='Oracle NoSQL database'/><category term='HBase'/><category term='Oracle Open World'/><category term='ACID'/><category term='PACELC'/><category term='Pregel'/><category term='research publication'/><category term='Hadapt'/><category term='Cassandra'/><title type='text'>DBMS Musings</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>36</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-190191398605973736</id><published>2011-12-07T20:38:00.000-08:00</published><updated>2011-12-08T07:16:11.595-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Cassandra'/><category scheme='http://www.blogger.com/atom/ns#' term='PACELC'/><category scheme='http://www.blogger.com/atom/ns#' term='database replication'/><category scheme='http://www.blogger.com/atom/ns#' term='CAP'/><category scheme='http://www.blogger.com/atom/ns#' term='Riak'/><category scheme='http://www.blogger.com/atom/ns#' term='PNUTS'/><category scheme='http://www.blogger.com/atom/ns#' term='HBase'/><category scheme='http://www.blogger.com/atom/ns#' term='Dynamo'/><title type='text'>Replication and the latency-consistency tradeoff</title><content type='html'>&lt;div&gt;As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of  a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my &lt;a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html"&gt;PACELC post&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first.  Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.&lt;div&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;b&gt;Data updates are sent to all replicas at the same time&lt;/b&gt;. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:&lt;br /&gt;&lt;ol type="a"&gt;&lt;br /&gt;&lt;li&gt;Routing updates through an additional system component (the preprocessor) increases latency&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;Data updates are sent to an agreed upon location first&lt;/b&gt; (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:&lt;br /&gt;&lt;ol type="a"&gt;&lt;br /&gt;&lt;li&gt;The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:&lt;br /&gt;&lt;ol type="i"&gt;&lt;li&gt;If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W &amp;gt; N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W &amp;lt;= N), then inconsistent reads are possible, as in (b)(ii) above .&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Data updates are sent to an arbitrary location first&lt;/b&gt;, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:&lt;br /&gt;&lt;ol type="a"&gt;&lt;li&gt;If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.&lt;/li&gt;&lt;/ol&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-190191398605973736?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/190191398605973736/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html#comment-form' title='14 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/190191398605973736'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/190191398605973736'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html' title='Replication and the latency-consistency tradeoff'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>14</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6181938996520194725</id><published>2011-10-04T21:57:00.000-07:00</published><updated>2011-10-07T09:28:21.076-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SimpleDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Cassandra'/><category scheme='http://www.blogger.com/atom/ns#' term='MongoDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Bigtable'/><category scheme='http://www.blogger.com/atom/ns#' term='ACID'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle NoSQL database'/><category scheme='http://www.blogger.com/atom/ns#' term='Sherpa'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle Open World'/><category scheme='http://www.blogger.com/atom/ns#' term='HBase'/><category scheme='http://www.blogger.com/atom/ns#' term='CouchDB'/><category scheme='http://www.blogger.com/atom/ns#' term='eventual consistency'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='BerkeleyDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Riak'/><title type='text'>Overview of the Oracle NoSQL Database</title><content type='html'>Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.&lt;br /&gt;&lt;br /&gt;I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (&lt;a href="http://www.hadapt.com"&gt;Hadapt&lt;/a&gt;). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).&lt;br /&gt;&lt;br /&gt;In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is &lt;a href="http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf"&gt;a useful whitepaper&lt;/a&gt; written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).&lt;br /&gt;&lt;br /&gt;Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.&lt;br /&gt;&lt;br /&gt;The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Overview of the system:&lt;/span&gt;&lt;br /&gt;Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.&lt;br /&gt;&lt;br /&gt;The following is where the Oracle NoSQL Database falls in various key dimensions:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;CAP&lt;/span&gt;&lt;br /&gt;Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [&lt;span style="font-style:italic;"&gt;Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.&lt;/span&gt;]&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Eventual consistency&lt;/span&gt;&lt;br /&gt;Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates  NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, &lt;span style="font-weight:bold;"&gt;there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff &lt;span style="font-style:italic;"&gt;consistency &lt;/span&gt;for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of &lt;span style="font-style:italic;"&gt;durability &lt;/span&gt;for latency and availability.&lt;/span&gt;  To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the &lt;a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html"&gt;Yahoo PNUTs/Sherpa system&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Joins&lt;/span&gt;&lt;br /&gt;Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Data Model&lt;/span&gt;&lt;br /&gt;The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and  minor key path model to be more elegant than the column family model (I have &lt;a href="http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html"&gt;ranted against column-families&lt;/a&gt; in the past).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;ACID compliance&lt;/span&gt;&lt;br /&gt;Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Summary&lt;/span&gt;&lt;br /&gt;The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6181938996520194725?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6181938996520194725/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/10/overview-of-oracle-nosql-database.html#comment-form' title='27 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6181938996520194725'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6181938996520194725'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/10/overview-of-oracle-nosql-database.html' title='Overview of the Oracle NoSQL Database'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>27</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-7879014102783549155</id><published>2011-07-19T06:31:00.001-07:00</published><updated>2011-07-28T14:16:02.794-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Graph Databases'/><category scheme='http://www.blogger.com/atom/ns#' term='SHARD'/><category scheme='http://www.blogger.com/atom/ns#' term='Semantic Web'/><category scheme='http://www.blogger.com/atom/ns#' term='Big Data'/><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Cloudera'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><category scheme='http://www.blogger.com/atom/ns#' term='HadoopDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Pregel'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop's tremendous inefficiency on graph data management  (and how to avoid it)</title><content type='html'>Hadoop is great.  It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.&lt;br /&gt;&lt;br /&gt;The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.&lt;br /&gt;&lt;br /&gt;Although not the subject of this post, an example of this inefficiency can be found in a &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/split-execution-hadoopdb.pdf"&gt;SIGMOD paper&lt;/a&gt; that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why &lt;a href="http://www.hadapt.com/"&gt;Hadapt&lt;/a&gt; was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.&lt;br /&gt;&lt;br /&gt;Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread.  Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 &lt;span style="font-style:italic;"&gt;more power utilization, more carbon emissions, more data center space, and more silicon waste&lt;/span&gt;. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.&lt;br /&gt;&lt;br /&gt;Let's delve into the subject of graph data in more detail. Recently there was a &lt;a href="http://www.dist-systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_2011.pdf"&gt;paper by Rohloff et. al.&lt;/a&gt; that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a &lt;a href="http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/"&gt;feature on Cloudera's Website&lt;/a&gt;. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.&lt;br /&gt;&lt;br /&gt;The only problem is that, as shown by an &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf"&gt;upcoming VLDB paper that we're releasing today&lt;/a&gt;, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine).  When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of  times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient.  If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors.  Hence, allowing different data to be replicated at different factors can further improve system efficiency.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper.  (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf"&gt;please see our paper&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;div&gt;It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties).  Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-7879014102783549155?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/7879014102783549155/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html#comment-form' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7879014102783549155'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7879014102783549155'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html' title='Hadoop&apos;s tremendous inefficiency on graph data management  (and how to avoid it)'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6580429962009664480</id><published>2011-05-19T06:16:00.000-07:00</published><updated>2011-05-19T06:37:26.571-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='database community'/><category scheme='http://www.blogger.com/atom/ns#' term='peer review'/><category scheme='http://www.blogger.com/atom/ns#' term='Sam Madden'/><category scheme='http://www.blogger.com/atom/ns#' term='research publication'/><title type='text'>Why Sam Madden is wrong about peer review</title><content type='html'>&lt;p class="MsoNormal"&gt;Yesterday my former PhD advisor, Sam Madden, &lt;a href="http://joinedandserved.blogspot.com/2011/05/thoughts-on-value-of-peer-review-in-cs.html"&gt;wrote a blog post&lt;/a&gt; consisting of a passionate defense for the status quo in the peer review process (though he does say that the review quality needs to be improved). In an effort to draw attention to his blog (Sam is a super-smart guy, and you will get a lot out of reading his blog) I intend to start a flame war with him in this space.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;At issue: The quality of reviews of research paper submissions in the database community is deteriorating rapidly. It is clear that something needs to be fixed. Jeff Naughton offered several suggestions for how to fix the problem in his &lt;a href="http://cs-www.cs.yale.edu/homes/dna/talks/naughtonicde.pdf"&gt;ICDE keynote&lt;/a&gt;. A few days ago, I publicly supported his fifth suggestion (eliminating the review process altogether) on Twitter. Sam argued against this suggestion using five main points. Below I list each of Sam's points, and explain why everything he says is wrong:&lt;/p&gt; &lt;hr&gt; &lt;p class="MsoNormal"&gt;Sam's point #1: Most of the submissions aren't very good. The review process does the community a favor in making sure that these bad papers do not get published.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;My response: I think only a few papers are truly embarrassing, but who cares? Most of the videos uploaded to YouTube aren't very good. They don't in any way detract from the good videos that are uploaded. The cost of publishing a bad paper is basically zero if everybody knows that all papers will be accepted. The cost of rejecting a good paper, which then gets sent to a non-database venue and receives all kind of publicity there, yields tremendous opportunity cost to the database community. Sam Madden should know this very well since (perhaps) his most famous paper fits in that category. The model of "accept everything and let the good submissions carry you" has always proven to be a better model than "let's have a committee of busy people who basically have zero incentive to do a good job (beyond their own ethical standards) decide what to accept" when the marginal cost of accepting an additional submission &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;is near zero. In the Internet age, the good submissions (even from unknown authors) get their appropriate publicity with surprising speed (see YouTube, Hacker News, Quora, etc.).&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Sam's point #2: If every paper is accepted, then how do we decide which papers get the opportunity to be presented at the conference? It seems we need a review committee at least for that.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;My response: First of all, there might be fewer submissions under the "accept everything model", since there will not be any resubmissions, and there is incentive for people to make sure that their paper is actually ready for publication before submitting it (because the onus of making sure their paper is not an embarrassment now falls on the authors and not on the PC --- assuming once something is published, you can't take it back). So it might be possible to just let everyone give a talk (if you increase the number of tracks). However, if that is not feasible, there are plenty of other options. For example, all papers are accepted immediately; over the course of one calendar year, it sits out there in the public and can be cited by other papers. The top sixty papers in terms of citations after one year get to present at the conference. This only extends the delay between submission and the actual conference by 4 months --- today there is usually an 8 month delay while papers are being reviewed, and camera-ready papers are being prepared. &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;Sam's point #3: Eliminating the review system will discourage people from working hard on their papers.&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;My response: I could not disagree more. Instead of having your paper reviewed by three people in private, every problem, every flaw in logic, every typo is immediately out there in the public for people to look at and comment on. As long as submissions cannot be withdrawn, the fear of long term embarrassment yields enough incentive for the authors to ensure that the paper is in good shape at the time of submission.&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;Sam's point #4: Having papers in top conferences is an important metric for evaluating researchers.&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;My Response: This is a horrible, horrible metric, and being able to finally eliminate it might be the best outcome of switching to an "accept everything" model. Everybody knows that it is much easier to get a paper accepted that goes into tremendous depth on an extremely narrow (and ultimately inconsequential) problem than to write a broad paper that solves a higher level (and important) problem, but has less depth. The "paper counting" metric incentivizes people to write inconsequential papers. Good riddance.&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;Sam's point #5: Having papers accepted provides a form of validation, a way to measure progress and success. There is also some kind of psychological benefit.&lt;/p&gt;  &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;My response: People who measure themselves in this way are doomed for failure. If you have&lt;span style="mso-spacerun:yes"&gt; a &lt;/span&gt;paper accepted that nobody ever reads or cites over the long term, you have made zero impact. Just because you managed to get a paper through three poor reviewers is no cause for celebration. We should be celebrating impact, not publication. Furthermore, I strongly disagree with the psychological benefit argument. Getting a paper rejected does &lt;b&gt;FAR&lt;/b&gt; more psychological damage than getting a paper accepted does good.&lt;/p&gt; &lt;hr&gt; &lt;p class="MsoNormal" style="tab-stops:100.9pt"&gt;In conclusion, it's time to eliminate the private peer review process and open it up to the public. All papers should be accepted for publication, and people should be encouraged to review papers in public (on their blogs, on twitter, in class assignments that are published on the Web, etc). Let the free market bring the good papers to the top and let the bad papers languish in obscurity. This is the way the rest of the Internet works. It's time to bring the database community to the Internet age. Imagine how much more research could be done if we didn't have to waste so much time of the top researchers in the world with PC duties, and revising good papers because they were improperly rejected. Imagine how many good researchers we have lost because of the psychological trauma of working really hard on a good paper, only to see it rejected. The current system is antiquated and broken, and the solution is obvious and easy to implement. It's time for a change.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6580429962009664480?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6580429962009664480/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/05/why-sam-madden-is-wrong-about-peer.html#comment-form' title='23 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6580429962009664480'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6580429962009664480'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/05/why-sam-madden-is-wrong-about-peer.html' title='Why Sam Madden is wrong about peer review'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>23</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-7297328660079103636</id><published>2011-03-28T19:48:00.000-07:00</published><updated>2011-03-29T06:12:13.295-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='entrepreneurship'/><category scheme='http://www.blogger.com/atom/ns#' term='Big Data'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadapt'/><category scheme='http://www.blogger.com/atom/ns#' term='research funding'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Why I'm doing a start-up pre-tenure</title><content type='html'>&lt;p class="MsoNormal"&gt;Thanks to the tireless work of the entire &lt;a href="http://www.hadapt.com/"&gt;Hadapt &lt;/a&gt;team, we had a very successful launch at GigaOM's Structure Big Data conference last week. In coming out of stealth, we told the world what we're doing (&lt;a href="http://www.hadapt.com/please-register-to-download-the-white-paper"&gt;in short&lt;/a&gt;, we're building the only Big Data analytical platform architected from scratch to be (1) optimized for cloud deployments and (2) closely integrated with Hadoop so you don't need those annoying connectors to non-Hadoop-based data management systems anymore; i.e. we're bringing high performance SQL to Hadoop). Although a lot of people knew I was involved in a start-up, several people were surprised to find out at the launch how centrally involved I am in Hadapt, and I have received a lot of questions along the lines of what Maryland professor Jimmy Lin (@lintool) tweeted last week:&lt;span class="Apple-style-span" style="color: rgb(68, 68, 68); font-family: Arial, sans-serif; font-size: 10px; line-height: 10px; "&gt; &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: 9.5pt"&gt;&lt;span style="font-size:7.5pt;font-family:&amp;quot;Arial&amp;quot;,&amp;quot;sans-serif&amp;quot;; mso-fareast-font-family:&amp;quot;Times New Roman&amp;quot;;color:#444444"&gt;.@&lt;a href="http://twitter.com/daniel_abadi"&gt;&lt;span style="mso-bidi-font-size:11.0pt; color:#0084B4"&gt;daniel_abadi&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="font-size:7.5pt; mso-bidi-font-size:11.0pt;font-family:&amp;quot;Arial&amp;quot;,&amp;quot;sans-serif&amp;quot;;mso-fareast-font-family: &amp;quot;Times New Roman&amp;quot;;color:#444444"&gt; &lt;/span&gt;&lt;span style="font-size:7.5pt; font-family:&amp;quot;Arial&amp;quot;,&amp;quot;sans-serif&amp;quot;;mso-fareast-font-family:&amp;quot;Times New Roman&amp;quot;; color:#444444"&gt;wondering how the tenure track thing fits in with @&lt;a href="http://twitter.com/Hadapt"&gt;&lt;span style="mso-bidi-font-size:11.0pt; color:#0084B4"&gt;Hadapt&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="font-size:7.5pt;mso-bidi-font-size: 11.0pt;font-family:&amp;quot;Arial&amp;quot;,&amp;quot;sans-serif&amp;quot;;mso-fareast-font-family:&amp;quot;Times New Roman&amp;quot;; color:#444444"&gt; &lt;/span&gt;&lt;span style="font-size:7.5pt;font-family:&amp;quot;Arial&amp;quot;,&amp;quot;sans-serif&amp;quot;; mso-fareast-font-family:&amp;quot;Times New Roman&amp;quot;;color:#444444"&gt;(r u on leave?) - but congrats on coming out of the Ivory tower! :)&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;Although Jimmy did not question my sanity in his tweet, others have, so I think it is time for me to explain my (hopefully rational) decision-making process that lead me to start a company while still on the tenure-track at Yale.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;A few facts to get out the way: although I am currently on teaching leave from Yale, I am not taking a complete leave of absence, which means my tenure clock is still ticking while I'm putting all this effort into Hadapt.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;The time I'm spending on Hadapt necessarily subtracts from the time I have available to spend on more traditional research activities of junior faculty (publishing papers, serving on program committees and editorial boards of publication venues, and attending conferences), which means that there is a huge risk that when I come up for tenure, if I am evaluated using traditional evaluation metrics, I will not have optimized my performance in these areas, and thereby will reduce the probability of receiving tenure. When I was considering starting Hadapt, I sent e-mails to several senior faculty members in my field and asked them if they could think of an example of a database systems professor doing a start-up while still a junior faculty member, and going on to eventually receive tenure (I desperately wanted a precedent that I could use to justify my decision). Not a single one of the people I e-mailed were able to think of such a case (in fact, one of them called the chair of my department to yell at him for even thinking of letting me start a company while still pre-tenure). Starting Hadapt is a gamble --- there's no doubt about it.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;So why am I doing it? I want my research to make impact, which to me means that my research ideas should make it into real systems that are used by real people. Unfortunately for me, the research I enjoy the most is research that envisions complete system designs (rather than research on individual techniques that can be applied directly to today's systems). It's hard enough to publish these system design papers; but it's almost impossible to get other people to actually adopt your design in real-world deployments unless an extensive and complete prototype is available, or your design is already proven in real-world applications. For example, there have been many papers published by academics that fall in the same general space as the Google Bigtable paper. Yet the Bigtable paper has had a tremendous amount of impact, while the other papers languish in obscurity. Why? Because when Powerset and Zvents needed to implement a scalable real-time database, they felt safer using the design suggested in the Google paper (in their respective HBase and Hypertable projects) than the design from some other academic paper that has not been proven in the real world (even if the other design is more elegant and a better fit for the problem at hand). &lt;/p&gt;  &lt;p class="MsoNormal"&gt;Therefore, if you want to publish system design papers that make impact on the real world, you seemingly only have three choices:&lt;/p&gt;  &lt;p class="MsoNormal"&gt;(1) You can use the resources in your lab to build a complete prototype of your idea. That way, when people are considering using your idea, their risk is significantly reduced by trying out your system on their application without significant upfront development cost.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;Unfortunately, building a complete prototype is a much harder task than building enough of a prototype to get a paper published. It involves a ton of work to deal with all of the corner cases, and to make it work well out of the box --- this amount of work is far too much for a small handful of students to do (especially if they want to graduate before they retire). Therefore additional engineers must be hired to complete the prototype. In the DARPA glory days, this was possible --- I've heard stories of database projects burning over a million dollars per year to complete the engineering of an academic prototype. Unfortunately, those days are long gone. My attempts to get just one tiny programmer to build out the HadoopDB prototype were rebuffed by the National Science Foundation.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;(2) You leave academia and work for Google, Yahoo, Facebook, IBM, etc. Matt Welsh has &lt;a href="http://matt-welsh.blogspot.com/2010/11/why-im-leaving-harvard.html"&gt;discussed in significant detail&lt;/a&gt; his decision to leave Harvard and do exactly that. This is a great solution in many ways --- it increases the probability of your research making impact by orders of magnitude, and has the added bonus of eliminating a lot of the administrative time sinks inherent in academic jobs. If I didn't love other aspects of being part of an academic community so much, &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;this is certainly what I would do.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;(3) You do a start-up. This is basically the same as choice (1), except you raise the money to build out the prototype from angel investors and venture capitalists instead of from the government (which typically funds the academic lab). The main downside is that starting a company is highly non-trivial, and you end up having to spend a lot of time in all kinds of non-technical tasks --- meeting with investors, meeting with potential customers, interviewing potential employees, investing the time to understand the market, coming up with a go-to-market strategy, attending board meetings, dealing with patents, participating in boring trade-shows, etc., etc., etc. It adds up to an extraordinary amount of time. It's also more competitive than academia --- there are far more people who want to see you fail in the start-up world than in academia, and some of these people go to great lengths to increase the probability of your failure. There are all kinds of hurdles that come up, and you need to have a strong will to overcome them. If it wasn't for the most determined person I have ever met, Justin Borgman, the CEO of Hadapt, we would never have made it to where we are today. It's hard to start a company, but in my mind, it was the only viable option if I wanted my three years of research on HadoopDB to make impact (Hadapt is a commercialization of the HadoopDB research project). &lt;/p&gt;  &lt;p class="MsoNormal"&gt;If it wasn't for the fact that I spent the majority of the last decade soaking up the wisdom of Mike Stonebraker, I might not have chosen option (3). But I watched as my PhD thesis on C-Store was commercialized by Vertica (which was sold last month to HP), and another one of my research projects (H-Store) was commercialized by VoltDB. Thanks to Stonebraker and the first-class engineers at Vertica, I can claim that my PhD research is in use today by Groupon, Verizon, Twitter, Zynga, and hundreds of other businesses. When I come up for tenure, I want to be able to make similar claims about my research at Yale on HadoopDB. So I'm taking the biggest gamble of my career to see that happen. I just hope that the people writing letters for me at tenure time take my contributions to Hadapt into consideration when they are evaluating the impact I have made on the database systems field. I know that this will require a departure from the traditional way junior faculty are evaluated, but it's time to increase the value we place on building real, usable systems. Otherwise, there'll be no place left in academia for systems researchers.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;br /&gt;&lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;[Note: Hadapt has &lt;a href="http://www.hadapt.com/news"&gt;successfully raised&lt;/a&gt; a round of financing and is hiring. If you have experience building real systems, especially database systems --- or even if you have built reasonably complex academic prototypes --- &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;please send an e-mail to hackers@hadapt.com.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;I personally&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;read every e-mail that goes to that address.]&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-7297328660079103636?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/7297328660079103636/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/03/why-im-doing-start-up-pre-tenure.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7297328660079103636'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7297328660079103636'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/03/why-im-doing-start-up-pre-tenure.html' title='Why I&apos;m doing a start-up pre-tenure'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-4846552301281050904</id><published>2011-02-15T06:14:00.000-08:00</published><updated>2011-02-15T10:38:52.785-08:00</updated><title type='text'>Why I no longer trust EMC [Update: maybe they are not so bad]</title><content type='html'>&lt;p class="MsoNormal"&gt;[Update: After publishing this blog post I received a very pleasant phone call from two representatives from Mozy informing me they had managed to recover my data. See the end of this blog post for more details.]&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;It's possible to argue that my entire research agenda over the past few years has focused on cloud computing. &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;HadoopDB &lt;/a&gt;can be thought of as a large scale analytical database system for the cloud. My work on &lt;a href="http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html"&gt;database determinism&lt;/a&gt; that focuses on building horizontally scalable database systems &lt;span style=""&gt; &lt;/span&gt;is entirely motivated by the elastic scalability of the cloud. In order for this research to make impact, "the cloud" needs to be more than a temporary phenomenon. Therefore, I feel quite invested in the success or failure of the cloud.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;One common argument people make against the cloud (amongst others) is that if you put your data in the cloud, you are losing control over your data. If the cloud provider does not have appropriate processes in place to safeguard data, it's quite possible that your data could get corrupted or lost. This is problematic since most users do not get to see the internal processes, so they need to (to some extent) blindly trust the cloud provider --- a tricky proposition for many people. The way I usually answer this criticism is that a competitive business climate will solve this problem --- the companies that have bad processes will lose data and go out of business, and the ones that have more safeguards in place will win.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;However, the above argument only works if cases of data loss get publicized so that the companies that lose data will lose business. Since I recently went through the horrible experience of losing data I put in the cloud, I therefore feel obligated to share this experience on this blog.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;About a year ago I felt that if I was going to go around talking about how great the cloud was, I should at least be using a cloud data backup service for my PC. I ended up deciding between Mozy and DropBox, and went with Mozy because it was owned by EMC. I figured that EMC was a trustworthy company, and they understand storage and the cloud better than most. I figured I would start out with the free version, and then would upgrade to the paid version when I ran out of space.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Around 2 months ago, the hard drive on my Sony Vaio laptop failed. Since the laptop was owned by Yale, I had to go through the Yale processes to get it replaced. It turned out to be a nightmare because Yale did not buy the laptop directly from Sony, but went through an intermediary organization. Although the laptop was under warrantee, neither Sony nor the intermediary organization was willing to take responsibility for following through on the warrantee. This caused significant delays in getting the hard drive replaced, especially during the holiday season at the end of the semester. &lt;/p&gt;  &lt;p class="MsoNormal"&gt;After around two months, my laptop was finally returned with a new hard drive. I was excited that I had an opportunity to take advantage of my EMC Mozy backup for the first time --- theoretically they should have been able to recover all my files and put them in the same places where they existed on my laptop before it failed.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;When I went to log in, Mozy claimed that I had the wrong password. I tried again. And again. Mozy would not let me in. Finally, I gave up and clicked on "Forgot my password" and Mozy claimed to reset it and send it to me. But I never received an e-mail. So I tried again. And again. Still no e-mails. I e-mailed support. Four days later, I still had not received a response. I e-mailed again --- another few days and still no response. At this point I was getting desperate --- it had been a week and I had no way of logging in to retrieve my files. Since I wasn't a "MozyPro" customer, all my attempts to call up support were rebuffed. I tried calling up the Mozy sales number to see if they could help me, but they were unable to. I tried the online chat, and they were unable to as well, but suggested that I email "forgot@mozy.com" to try to get my password reset manually that way.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;This last suggestion worked, and I was finally able to log in. But to my horror, all my files were gone! It's hard to describe the despair as one starts to realize that one put the trust in the wrong place and all files created in the last year might actually be gone. I e-mailed support and this time received a much faster response:&lt;/p&gt;  &lt;p class="MsoNormal"&gt;" Mozy may terminate your account and these Terms immediately and without notice if your computer fails to access the Services to perform a backup for more than thirty (30) days or you fail to comply with these Terms."&lt;/p&gt;  &lt;p class="MsoNormal"&gt;So, because it took so long to get my computer replaced (the whole reason why I was using Mozy in the first place), Mozy decided to delete my account (without telling me). I e-mailed back support and asked if there was any way to recover the files even though the account was deleted. They wrote back:&lt;/p&gt;  &lt;p class="MsoNormal"&gt;"I wish there was something I could do for you. I have even checked with an L2 tech to try and get the files back and he said that he was not able to recover them."&lt;/p&gt;  &lt;p class="MsoNormal"&gt;So I trusted EMC Mozy to backup my files, and they decided to delete them. And they do not have the processes in place to recover them. This is not how the cloud is supposed to work. Clearly EMC does not understand the cloud. I hope that anybody reading this blog does not make the same mistake: EMC 's cloud services are not trustworthy. If you have similar stories, please share them with me --- cloud providers need to feel pressure not to arbitrarily delete data without first warning their customers. Otherwise the cloud cannot work.&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;[Update: It turns out that EMC Mozy does have important safeguards in place. After coming across this blog post, several members of the Mozy technical team met with each other to try to understand what happened, managed to recover my data, and called me afterward. Here's the scoop: the Mozy software is designed to notify you before your account is deleted. The problem was that my computer with the Mozy software installed had failed, so the software couldn't notify me. Mozy does indeed wait six months before deleting an account, but for me, due to a weird corner case involving a second computer that had previously been backing up to Mozy that I had stopped using, the six month clock had started ticking in July. Thus, the timing of my second computer failing was really unlucky. However, since they did have safeguards in place, they did manage to recover this deleted data. I am obviously really grateful that they went to this great effort. They told me in the phone that they learned from this experience and are making improvements as a result --- most notably to do more than rely on the software to notify a user before the account is deleted. Given how helpful and straightforward the Mozy employees were over the course of this phone call, I wholeheartedly believe that they really are going to fix this issue. Hence, I have no qualms recommending Mozy to other people moving forward. Again, the most important thing was that there were safeguards in place --- obviously it took some additional motivation for these safeguards to be used, but as long as they exist, I feel comfortable using cloud storage moving forward.]&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-4846552301281050904?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/4846552301281050904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2011/02/why-i-no-longer-trust-emc.html#comment-form' title='28 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4846552301281050904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4846552301281050904'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2011/02/why-i-no-longer-trust-emc.html' title='Why I no longer trust EMC [Update: maybe they are not so bad]'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>28</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-3632241812827225146</id><published>2010-12-30T13:53:00.000-08:00</published><updated>2010-12-30T14:13:12.045-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Machine-generated data'/><category scheme='http://www.blogger.com/atom/ns#' term='data growth rates'/><category scheme='http://www.blogger.com/atom/ns#' term='Big Data'/><category scheme='http://www.blogger.com/atom/ns#' term='clickstream data'/><title type='text'>Machine vs. human generated data</title><content type='html'>Curt Monash has &lt;a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/"&gt;recently been discussing&lt;/a&gt; the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.&lt;br /&gt;&lt;br /&gt;In short, the following are the main reasons why machine-generated data is important:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Machines are capable of producing data at very high rates. In the time it took you to read this sentence, my three-year old laptop could have produced the entire works of Shakespeare.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The human population is not growing anywhere near as fast as Moore’s law. In the last decade, the world’s population has increased by about 20%. Meanwhile transistor counts (and also hard-disk capacity since it increases by roughly the same rate) has increased by over 2000%, If all data was closely tied to human actions, then the “Big Data” research area would be a dying field, as technological advancements would eventually render today’s “Big Data” miniscule, and there would be no new “Big Data” to take its place. (All this assumes that women don’t start to routinely give birth to 15 children, and nobody figures out how to perform human cloning in a scalable fashion). No researcher dreams of writing papers that makes only a temporary impact. &lt;span style="font-weight: bold;"&gt;With machine-generated data, we have the potential for data generation to increase at the same rate as machines are getting faster&lt;/span&gt;, which means that “Big Data” today will still be “Big Data” tomorrow (even though the definition of “Big” will be adjusted).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until &lt;a href="http://en.wikipedia.org/wiki/Racetrack_memory"&gt;race-track memory&lt;/a&gt; becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans. &lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Curt attempts to define “machine-generated data” in his post as the following:&lt;br /&gt;&lt;blockquote style="color: rgb(0, 0, 153); font-style: italic;"&gt;&lt;br /&gt;Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.&lt;/blockquote&gt;&lt;br /&gt;He then goes on to include Web log data (including user clickstream logs), and social media and gaming records data as examples of machine-generated data.&lt;br /&gt;&lt;br /&gt;If you agree with the three reasons listed above on why machine-generated data is important, then there is a problem with both the above definition of machine-generated data and the examples. Clickstream data and social media/gaming data are fundamentally different from environmental sensor data that has no human involvement whatsoever. Certainly the scale of clickstream and gaming datasets is much larger than the scale of other human-generated datasets such as point of sale data (humans can make clicks on the Internet or in a computer game at a much faster rate than they can buy things, or write things down). And certainly, for every human click, there might be 5X more network log data (as Monash writes about in his post). But ultimately, without humans making clicks, there would be no data, and as long as the additional machine-generated data is linearly related to each human action (e.g. this 5X number remains relatively constant over time) then these datasets are not always going to be “Big Data”, for the reasons described in point (2) above.&lt;br /&gt;&lt;br /&gt;The basic source of confusion here is that click-stream datasets and social gaming data sets are some of the biggest datasets known to exist (eBay, Facebook, and Yahoo’s multi-petabyte clickstream data warehouses are known to be amongst the largest data warehouses in the world). Since machines are well-known to have the ability to produce data at a faster rate than humans, it is easy to fall into the trap of thinking that these huge datasets are machine generated.&lt;br /&gt;&lt;br /&gt;However, these datasets are not increasing at the same rate that machines are getting faster. It might seem that way since the companies that broadcast the size of their datasets are getting larger and gaining users a rapid pace, and these companies are deciding to throw away less data, but over the long term the rate of increase of these datasets must slow down due to the human limitation. This makes them less interesting for the future of “Big Data” research.&lt;br /&gt;&lt;br /&gt;I don’t necessarily have a better way to define machine-generated data, but I’ll end this blog post with my best attempt:&lt;br /&gt;&lt;br /&gt;&lt;blockquote style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Machine-generated data&lt;/span&gt; is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Machine generated “Big Data”&lt;/span&gt; is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.&lt;/blockquote&gt;&lt;br /&gt;Under this definition, stock trade data (independent computation agents), environmental sensor data, RFID data, and satellite data all fall under the category of machine-generated data. An interesting debate could form over whether genomic sequencing data is machine-generated or not. To the extent that DNA and mRNA are being produced outside of humans, I think it is fair to put genomic sequencing data under the machine-generated category as well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-3632241812827225146?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/3632241812827225146/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/12/machine-vs-human-generated-data.html#comment-form' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/3632241812827225146'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/3632241812827225146'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/12/machine-vs-human-generated-data.html' title='Machine vs. human generated data'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-2262636218329754865</id><published>2010-08-31T11:20:00.000-07:00</published><updated>2010-09-01T10:49:45.322-07:00</updated><title type='text'>The problems with ACID, and how to fix them without going NoSQL</title><content type='html'>&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;(This post is coauthored by &lt;/span&gt;&lt;/span&gt;&lt;a href="http://cs.yale.edu/homes/thomson/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Alexander Thomson&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;a href="http://cs-www.cs.yale.edu/homes/dna/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Daniel Abadi&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;) &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., &lt;/span&gt;&lt;/span&gt;&lt;a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;see&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.kellblog.com/2010/02/24/the-database-tea-party-the-nosql-movement/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;these&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; links&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;). Rather, systems such as &lt;/span&gt;&lt;/span&gt;&lt;a href="http://labs.google.com/papers/bigtable.html"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Bigtable&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://hbase.apache.org/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;HBase&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://hypertable.org/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Hypertable&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://cassandra.apache.org/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Cassandra&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Dynamo&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://aws.amazon.com/simpledb/"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;SimpleDB&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; (and a host of other key-value stores), &lt;/span&gt;&lt;/span&gt;&lt;a href="http://research.yahoo.com/files/pnuts.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;PNUTS/Sherpa&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;&lt;/span&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;ACID, scalability and replication&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;b&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/&lt;/span&gt;&lt;/span&gt;&lt;a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;PACELC&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; sense of the word consistency).&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;&lt;/span&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;The NoACID solution&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;b&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-family:'Times New Roman', serif;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;&lt;/span&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;Fixing ACID without going NoSQL&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;b&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the &lt;/span&gt;&lt;/span&gt;&lt;a href="http://db.cs.yale.edu/determinism-vldb10.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;research paper&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; that we will present at VLDB this month.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about &lt;i&gt;which &lt;/i&gt;particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also&lt;/span&gt;&lt;/span&gt;&lt;a href="http://db.cs.yale.edu/determinism-vldb10.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt; addressed at length in our paper&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;.)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;&lt;/span&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;span style="Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;font-family:&amp;quot;;"&gt;A deterministic DBMS prototype&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;b&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;In our paper, entitled “&lt;/span&gt;&lt;/span&gt;&lt;a href="http://db.cs.yale.edu/determinism-vldb10.pdf"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;The Case for Determinism in Database Systems&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height: normal;mso-layout-grid-align:none;text-autospace:none"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-2262636218329754865?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/2262636218329754865/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html#comment-form' title='65 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2262636218329754865'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2262636218329754865'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html' title='The problems with ACID, and how to fix them without going NoSQL'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>65</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-5658603866534434051</id><published>2010-08-11T08:56:00.000-07:00</published><updated>2010-08-11T09:22:16.898-07:00</updated><title type='text'>Defending Oracle Exadata</title><content type='html'>&lt;p class="MsoNormal"&gt;I recently came across a &lt;a href="http://www.teradata.com/t/assets/0/206/276/5bfc4694-ce82-4a07-867d-3f1040d3df8b.pdf"&gt;whitepaper from Teradata&lt;/a&gt;, written by a senior consultant for Teradata, Richard Burns. This is a very well written piece, and has one of the best overviews of Exadata I’ve seen. I did not notice any obvious inaccuracies in the description of Exadata itself, and even the anti-Exadata arguments (presented after the overview), though at times biased and misleading, do not have many clear factual errors. Hence, it is quite a professionally done whitepaper, even though it is devoted to attacking a competitor. Reading it will probably make you smarter.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;That said, even though the facts are more or less correct, the inferences that are made from these facts are certainly up for debate, and I feel an urge to defend Exadata against some of these allegations, even though I have no personal stake in either side of the Exadata-Teradata feud. &lt;/p&gt;  &lt;p class="MsoNormal"&gt;I will not overview Exadata again in this blog post. If you are not familiar with Exadata, I encourage you to read the overview in the Teradata whitepaper, or from &lt;a href="http://www.oracle.com/us/products/database/exadata/index.htm"&gt;Oracle’s own marketing material&lt;/a&gt;. I have covered the columnar compression feature &lt;a href="http://dbmsmusings.blogspot.com/2010/01/exadatas-columnar-compression.html"&gt;separately in this blog&lt;/a&gt;. Hence, I will jump to the arguments that Teradata makes against Exadata, and respond to each one in turn:&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight:normal"&gt;“Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing”&lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Teradata argues that since the Exadata storage only performs selections, projections, and some form of basic joins, and the rest of the query must be performed in the Oracle database server sitting above Exadata storage (which is typically Oracle RAC), the architecture is a whole lot closer to shared-disk than shared-nothing. Factually, this is correct. Exadata storage is indeed shared-nothing, but since only very basic query operations are performed there, it is fair to view the system as Oracle RAC treating Exadata storage as a shared-disk.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;However, it is one thing to point out that Exadata is closer to shared-disk than shared-nothing, but quite another thing to claim that as a result of this, “Exadata does nothing to reduce or eliminate the structural contention for shared resources that fundamentally limits the scalability –of data, users, and workload – of Oracle data warehouses.” This statement is incorrect and unfair. Yes, it is true that contention for shared resources is the source of scalability problems in analytical database systems, and that this is why shared-nothing is widely believed to scale the best (because each compute node contains its own CPU, memory and disk, so neither disk nor memory is shared across the cluster). But shared-disk is very similar to shared-nothing. The only difference is the shared access to disk storage system. If you are going to make an argument that shared-disk causes scalability problems, you have to make the argument that contention for the one shared resource in a shared-disk system is high enough to cause a performance bottleneck in the system --- namely, you have to argue that the network connection between the servers and the shared-disk is a bottleneck. &lt;/p&gt;  &lt;p class="MsoNormal"&gt;At no point in the entire 18-page whitepaper did Teradata make the argument that the Infiniband connection between Exadata storage and the database servers is a bottleneck. Furthermore, even if you believe that there is a bottleneck in this connection, you still must admit that by doing some of the filtering in the Exadata storage layer, some of this bottleneck is alleviated. Hence, it is entirely inaccurate to say &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;“Exadata does &lt;b style="mso-bidi-font-weight: normal"&gt;nothing&lt;/b&gt; to reduce or eliminate the structural contention for shared resources ….” --- at the very least it does something by doing this filtering.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;In fact, I think that the scalability differences between shared-nothing and shared-disk are overblown (I’m not arguing that they scale equally well, just that the gap between them is not as large as people think; even if filtering is not pushed down to the shared-disk like in Exadata). This was &lt;a href="http://www.cs.washington.edu/mssi/2010/LukeLonergan.pdf"&gt;eloquently explained&lt;/a&gt; by Luke Lonergan from Greenplum at the UW/MSR retreat on cloud computing in the first week of August. In essence, he argued that thanks to 10 Gigabit Ethernet, Fibre Channel over Ethernet, and a general flattening of the network into two-tier switching designs, it takes an enormous number of disks to cause network to become a bottleneck. Furthermore, with 40 Gigabit Ethernet around the corner, and 100 Gigabit Ethernet on its way, network is becoming even less of a bottleneck. And by the way, shared-disk has a variety of advantages that shared-nothing does not, including the ability to move around virtual machines executing database operators for improved load balancing and fault tolerance. (This would be a good time to point out that Greenplum was &lt;a href="http://dbmsmusings.blogspot.com/2010/07/quick-thoughts-on-emc-acquiring.html"&gt;recently acquired by EMC&lt;/a&gt;, so one obviously has to be aware of pro-shared-disk bias, but I found the argument quite compelling.)&lt;/p&gt;&lt;p class="MsoNormal"&gt;Teradata also attempts to argue that the striping of data across all available disks on every Exadata cell (using Oracle’s Automatic Storage Manager, ASM) causes “potential contention on each disk for disk head location and I/O bandwidth” when many DBMS workers running in parallel for the same query request data from the same set of disks. However, it is not pointed out until later in the paper that Exadata defaults to 4MB chunks. 4MB&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;blocks should easily amortize the disk seek costs across multiple worker requests.&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;Exadata does NOT Enable High Concurrency&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;/b&gt;If I were to summarize this section, I would say that Teradata is basically arguing that the default block size in Exadata is too large, and this reduces concurrency, since 30 concurrent I/Os/sec * 4MB block size saturates the disks maximum 120MB/sec bandwidth. This argument is clouded by the fact that for scan-based workloads, any database system would have the same concurrency limits. Hence, the only place where Exadata’s large block size would reduce concurrency relative to alternative systems (such as Teradata) would be for non-scan based workloads when there are a lot of tuple lookups and random I/O. Oracle would argue that its memory and flash cache would take care of the tuple lookups and needle-in-a-haystack type queries. Teradata, in turn, argues that the size of cache is far too small to completely take care of this problem. This argument is reasonable, but I do believe that the bulk of the size of a data warehouse is the historical data, and that tuple lookups and non-scan-based queries only touch a much smaller portion of the more recent data, so that caching should do a decent job. But this is definitely a “your mileage will vary” type argument, with the effectiveness of the cache highly dependent on particular data sets and workloads.&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;Exadata does NOT Support Active Data Warehousing&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;/b&gt;Teradata points out that the process of checking which version of the data is the correct version to return for a particular query (used in its MVCC concurrency control scheme) is performed inside the database servers, and so the Exadata storage cannot do its typical filtering of the data in the storage layer for actively updated tables (since there might be multiple potentially correct versions of the data to return). Teradata therefore points out: “While Exadata still performs parallel I/O for the query, the largest benefit provided by Exadata, early filtering of columns and rows meeting the query specification, which may drastically reduce query data volume, is not useful for tables or partitions being actively updated.” While this might be true, the impact of this issue is overstated. The percentage of data that is actively updated is typically a tiny percentage of the total data set size. The historical data, and the data that has been recently appended (but not updated) will not suffer from this problem. Hence, having less than optimal I/O performance for this tiny fraction of the data is not a big deal.&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;Exadata does NOT Provide Superior Query Performance&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;/b&gt;Teradata points out that since Exadata can only perform basic operations in the Exadata storage layer (selection, projection, some simple joins), then as a query gets more complex, more and more of it is performed in the database server, instead of in Exadata storage. Teradata gives an example of a simple query, where Oracle can perform 28% of the query steps inside Exadata, and a more complex one, where Oracle can perform only 22% of the query steps inside Exadata. Again, this is factually correct, but it is misleading to assume the speedup you get from Exadata is linearly correlated with the percentage of steps that are performed within Exadata. For database performance, it’s all about bottlenecks, and eliminating a bottleneck can have a disproportionate effect on query performance. In scan-based workloads, disk I/O is often a bottleneck, and Exadata alleviates this bottleneck equally well for both simple and complex queries. Hence, while the benefit of Exadata does decrease for complex queries, it is misleading to assume that this benefit decreases linearly with complexity.&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;Exadata is Complex; Exadata is Expensive&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;/b&gt;It is hard to argue with these points. However, it is amusing to note that Teradata is willing to point out in this section that they are only (approximately) 11% cheaper than Oracle, and they show numbers such as Teradata costing 194K per terabyte. Both Oracle and Teradata are too expensive for large parts of the analytical database market.&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;Conclusion&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight: normal"&gt;&lt;/b&gt;The truth, as is usually the case, is somewhere in middle, between the claims of Oracle and Teradata. Teradata is probably right when it asserts “Exadata is far from the groundbreaking innovation that Oracle claims”,&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;and that Oracle “throws a lot of hardware” at problems that are solvable in software, but many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic datasets and workloads.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-5658603866534434051?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/5658603866534434051/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/defending-oracle-exadata.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5658603866534434051'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5658603866534434051'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/defending-oracle-exadata.html' title='Defending Oracle Exadata'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6133071819259029877</id><published>2010-08-02T01:03:00.000-07:00</published><updated>2010-08-02T01:20:32.772-07:00</updated><title type='text'>Thoughts on Kickfire’s apparent demise</title><content type='html'>There have been some recent conflicting reports on the future prospects of Kickfire’s analytical database technology. &lt;a href="http://www.forbes.com/forbes/2010/0426/technology-data-analysis-moore-law-mysql-kickfire-cheap-data-mining.html"&gt;Forbes reported a couple of months ago&lt;/a&gt; that Kickfire sold $5 million worth of boxes in their first year of existence (they launched their product in April 2009), and was extremely positive about Kickfire’s outlook. Then, a couple of months later, &lt;a href="http://www.dbms2.com/2010/07/27/kickfire-unlikely-to-survive/"&gt;Curt Monash reported &lt;/a&gt;that Kickfire was discontinuing their product, and selling their IP and engineers. Obviously, Monash is the more reliable source here (and I independently heard a rumor from a reputable source that Teradata was acquiring Kickfire at a “firesale” price).&lt;br /&gt;&lt;br /&gt;In my interactions with the company, I have been impressed with Raj Cherabuddi and members of the Kickfire technical team, and it is always sad when good technology fails to gain any traction in the marketplace. I’m also sad (though obviously to a lesser extent) to see the many thousands of words I have written about Kickfire in this blog (including an &lt;a href="http://dbmsmusings.blogspot.com/2009/09/kickfires-approach-to-parallelism.html"&gt;in-depth post on their technology&lt;/a&gt;) become largely obsolete. In fact, one of the &lt;a href="http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html"&gt;first posts I wrote for this blog &lt;/a&gt;was on the subject of Kickfire --- a mostly positive post --- but questioning their go-to-market strategy. In that post, I took issue with the assumption that there is a “mass market” for data warehousing in the MySQL ecosystem, especially for a proprietary approach like Kickfire's. The CEO of Kickfire kindly &lt;a href="http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html?showComment=1244524864444#c3569501795724243095"&gt;took the time to respond&lt;/a&gt; to this original post, quoting a bunch of IDC numbers about the size of the MySQL data warehouse market. I chose to respond to this comment in &lt;a href="http://dbmsmusings.blogspot.com/2009/06/ceo-responds-to-my-post-on-kickfire.html"&gt;a separate post&lt;/a&gt;, in which I said (amongst other things):&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;em&gt;"The point of my post was that I think the [MySQL data warehouse] market is smaller than these [IDC] numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, […] a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect."&lt;/em&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I think the above paragraph lists two of the three most probable reasons why Kickfire seems to have failed:&lt;br /&gt;&lt;br /&gt;(1) Building a propriety database stack of hardware and software around a MySQL codebase that attributes much of its success to being open and free is a poor cultural match.&lt;br /&gt;&lt;br /&gt;(2) Trying to make it in the "Big Data Era" without a scalable MPP product is a recipe for disaster. It is well known that over 95% of data warehouses are smaller than 5TB, and that MPP is not strictly necessary for less than 5TB, so it is easy to get into the trap of Kickfire’s thinking that the mass market is addressable without building a MPP product. However, businesses are looking forward, and seeing much more data in their future (whether this is wishful or realistic thinking is entirely irrelevant), and can often be reluctant to select a product with known scalability limits.&lt;br /&gt;&lt;br /&gt;(The third alternative reason why Kickfire might have failed is the TPC-H benchmark orientation. It is really easy to spend a lot of time working on an analytical database to get it to run TPC-H --- even optimizing it for TPC-H --- before realizing that the marketing benefit that the product gets from stellar TPC-H numbers does not justify the time investment of getting it to run --- and in fact find out that many of the features that were added for TPC-H are not actually used by real-life customers.)&lt;br /&gt;&lt;br /&gt;It is tempting to add a fourth reason for Kickfire’s demise --- the long list of failed hardware-accelerated DBMS companies and Kickfire’s obvious inclusion in this group. However, I believe that Netezza’s success is a demonstration of the potential of the benefits of hardware acceleration and the appliance approach in the modern era where the rate of performance improvements with each successive processor generation is slowing significantly.&lt;br /&gt;&lt;br /&gt;Anyway, RIP Kickfire (assuming the rumors are correct). Good technology. Bad go-to-market strategy. Tough fit for the “Big Data” era.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6133071819259029877?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6133071819259029877/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/thoughts-on-kickfires-apparent-demise.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6133071819259029877'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6133071819259029877'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/08/thoughts-on-kickfires-apparent-demise.html' title='Thoughts on Kickfire’s apparent demise'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6414941129509532967</id><published>2010-07-06T15:20:00.000-07:00</published><updated>2010-07-10T18:51:58.216-07:00</updated><title type='text'>Quick thoughts on EMC acquiring Greenplum</title><content type='html'>EMC &lt;a href="http://www.emc.com/about/news/press/2010/20100706-01.htm"&gt;announced today&lt;/a&gt; that they are acquiring Greenplum. Below are the first thoughts that crossed my mind when I heard about this deal.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Congratulations to the whole team at Greenplum. Every interaction I’ve had with a Greenplum employee has been very positive (especially Florian Waas, Luke Lonergan, and I guess Joe Hellerstein even though he’s just an advisor), and I’m really happy for all of them.&lt;/li&gt;&lt;li&gt;Aster Data’s launch a couple of years ago seems to have hurt Greenplum more than any other company. Aster Data and Greenplum have extremely similar products (a parallelization layer over PostgreSQL with MapReduce integration), though Greenplum has made more changes and innovations at the DBMS level than Aster Data has (most notably the column-storage option which I have &lt;a href="http://dbmsmusings.blogspot.com/2009/10/greenplum-announces-column-oriented.html"&gt;written about in the past&lt;/a&gt;). However, Aster Data has reshaped their focus on deep, embedded analytics with extensive analytic libraries, which has met the market with more success than Greenplum’s &lt;a href="http://dbmsmusings.blogspot.com/2009/06/quick-thoughts-on-greenplum-edc.html"&gt;focus on the enterprise data cloud (EDC) vision&lt;/a&gt;, since Greenplum’s product was not ready yet to compete with the likes of Teradata in the EDC space. Greenplum’s recent Chorus offering also seems to have also been a failure. Hence, I do not think the acquisition price was an extremely large number. &lt;/li&gt;&lt;li&gt;This deal seems like bad news for Greenplum’s direct competitor, ParAccel, which is &lt;a href="https://gallery.emc.com/docs/DOC-1851"&gt;extremely close to EMC&lt;/a&gt; and relies on EMC for their “enterprise-class” solution that includes high availability and disaster recovery. I believe EMC routinely helps win ParAccel some business.&lt;/li&gt;&lt;li&gt;People predicted that the DatAllegro acquisition by Microsoft would spur additional industry consolidation. That clearly did not happen, as two years passed and there were no non-trivial acquisitions in the data warehouse space. But then SAP squired Sybase (and Sybase IQ) and EMC acquired Greenplum, so I am sure people will be predicting that 2010 is the year for all the predicted consolidation. I have my doubts, since HP is the only major player that clearly needs to upgrade its “Big Data” offerings. But I wouldn’t be surprised if there was one more acquisition this year.&lt;/li&gt;&lt;/ul&gt;Other "Quick Thoughts" posts worth reading:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Curt Monash says the acquisition &lt;a href="http://www.dbms2.com/2010/07/06/emc-is-buying-greenplum/"&gt;should not affect people currently evaluating Greenplum&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Dave Kellogg thinks the &lt;a href="http://www.kellblog.com/2010/07/06/emc-acquires-data-warehouse-vendor-greenplum-as-cornerstone-of-new-data-computing-product-division/"&gt;acquisition price is in the the $150-$250 [Edit: he changed it to $300-$400] range&lt;/a&gt; (though he is a little more optimistic on how Greenplum was doing than I am)&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6414941129509532967?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6414941129509532967/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/07/quick-thoughts-on-emc-acquiring.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6414941129509532967'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6414941129509532967'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/07/quick-thoughts-on-emc-acquiring.html' title='Quick thoughts on EMC acquiring Greenplum'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-5926754179729515588</id><published>2010-04-23T08:31:00.000-07:00</published><updated>2010-04-26T06:38:53.762-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SimpleDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Cassandra'/><category scheme='http://www.blogger.com/atom/ns#' term='CAP'/><category scheme='http://www.blogger.com/atom/ns#' term='Sherpa'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='PNUTS'/><category scheme='http://www.blogger.com/atom/ns#' term='Dynamo'/><title type='text'>Problems with CAP, and Yahoo’s little known NoSQL system</title><content type='html'>&lt;p class="MsoNormal"&gt;Over the past few weeks, in my advanced database system implementation class I teach at Yale, I’ve been covering the CAP theorem, its implications, and various scalable NoSQL systems that would appear to be influenced in their design by the constraints of CAP. Over the course of my coverage of this topic, I am convinced that CAP falls far short of giving a complete picture of the engineering tradeoffs behind building scalable, distributed systems.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;b style=""&gt;My problems with CAP&lt;/b&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;CAP is generally described as the following: when you build a distributed system, of three desirable properties you want in your system: consistency, availability, and tolerance of network partitions, you can only choose two.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Already there is a problem, since this implies that there are three types of distributed systems one can build: CA (consistent and available, but not tolerant of partitions), CP (consistent and tolerant of network partitions, but not available), and AP (available and tolerant of network partitions, but not consistent).&lt;span style=""&gt;  &lt;/span&gt;The definition of CP looks a little strange --- “consistent and tolerant of network partitions, but not available” --- the way that this is written makes it look like such as system is never available --- a clearly useless system. Of course, this is not really the case; rather, availability is only sacrificed when there is a network partition. In practice, this means that the roles of the A and C in CAP are asymmetric. Systems that sacrifice consistency (AP systems) tend to do so &lt;b style=""&gt;all the time&lt;/b&gt;, not just when there is a network partition (the reason for this will become clear by the end of this post). The potential confusion caused by the asymmetry of A and C is my first problem.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;My second problem is that, as far as I can tell, there is no practical difference between CA systems and CP systems. As noted above, CP systems give up availability only when there is a network partition. CA systems are “&lt;a href="http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext"&gt;not tolerant of network partitions&lt;/a&gt;”. But what if there is a network partition? What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially &lt;span style=""&gt; &lt;/span&gt;identical. So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Having three letters in CAP and saying you can pick any two does nothing but confuse this point.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;But my main problem with CAP is that it focuses everyone on a consistency/availability tradeoff, resulting in a perception that the reason why NoSQL systems give up consistency is to get availability. But this is far from the case. A good example of this is Yahoo’s little known NoSQL system called PNUTS (in the academic community) or Sherpa (to everyone else).&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style=""&gt; &lt;/span&gt;(Note, readers from the academic community might wonder why I’m calling PNUTS “little known”. It turns out, however, that outside the academic community, PNUTS/Sherpa is almost never mentioned in the NoSQL discussion --- in fact, as of April 2010, it’s not even categorized in the list of 35+ NoSQL systems at the &lt;a href="http://nosql-database.org/"&gt;nosql-database.org&lt;/a&gt; Website).&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;b style=""&gt;PNUTS and CAP&lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;If you examine PNUTS through the lens of CAP, it would seem that the designers have no idea what they are doing (I assure you this is not the case). Rather than giving up just one of consistency or availability, the system gives up both! It relaxes consistency by only guaranteeing “timeline consistency” where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. However, they also give up availability --- if the master replica for a particular data item is unreachable, that item becomes unavailable for updates (note, there are other configurations of the system with availability guarantees similar to Dynamo/Cassandra, I’m focusing in this post on the default system described in the &lt;span style=""&gt; &lt;/span&gt;original PNUTS paper). Why would anyone want to give up both consistency and availability? CAP says you only have to give up just one!&lt;/p&gt;  &lt;p class="MsoNormal"&gt;The reason is that CAP is missing a very important letter: L. PNUTS gives up consistency not for the goal of improving availability. Instead, it is to lower latency. Keeping replicas consistent over a wide area network requires at least one message to be sent over the WAN in the critical path to perform the write (some think that 2PC is necessary, but my student Alex Thomson has some research showing that this is not the case --- more on this in a future post). Unfortunately, a message over a WAN significantly increases the latency of a transaction (on the order of hundreds of milliseconds), a cost too large for many Web applications that businesses like Amazon and Yahoo need to implement. Consequently, in order to reduce latency, replication must be performed asynchronously. This reduces consistency (&lt;a href="http://portal.acm.org/citation.cfm?id=564585.564601"&gt;by definition&lt;/a&gt;). In Yahoo’s case, their method of reducing consistency (timeline consistency) enables an application developer to rely on some guarantees when reasoning about how this consistency is reduced. But consistency is nonetheless reduced.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;b style=""&gt;Conclusion: Replace CAP with PACELC&lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;In thinking about CAP the past few weeks, I feel that it has become overrated as a tool for explaining the design of modern scalable, distributed systems. Not only is the asymmetry of the contributions of C, A, and P confusing, but the lack of latency considerations in CAP significantly reduces its utility.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Systems that tend to give up consistency for availability when there is a partition also tend to give up consistency for latency when there is no partition. This is the source of the asymmetry of the C and A in CAP. &lt;span style=""&gt; &lt;/span&gt;However, this confusion is not present in PACELC. &lt;/p&gt;&lt;p class="MsoNormal"&gt;For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC --- upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler --- once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Fully ACID systems are PC/EC in PACELC. They refuse to give up consistency, and will pay the availability and latency costs to achieve it.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;However, there are some interesting counterexamples where the C’s of PACELC are not correlated. One such example is PNUTS, which is PC/EL in PACELC. In normal operation they give up consistency for latency; however, upon a partition they don’t give up any additional consistency (rather they give up availability).&lt;/p&gt;  &lt;p class="MsoNormal"&gt;In conclusion, rewriting CAP as PACELC removes some confusing asymmetry in CAP, and, in my opinion, comes closer to explaining the design of NoSQL systems.&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;(A quick plug to conclude this post: the PNUTS guys are presenting a new &lt;a href="http://research.yahoo.com/files/ycsb.pdf"&gt;benchmark for cloud data serving&lt;/a&gt; which compares PNUTS vs. other NoSQL systems at the first annual &lt;a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/"&gt;ACM Symposium on Cloud Computing 2010&lt;/a&gt; (ACM SOCC 2010) in Indianapolis on June 10th and 11th. SOCC 2010 is held in conjunction with SIGMOD 2010 and the &lt;a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/program.htm"&gt;recently released program&lt;/a&gt; looks amazing.)&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-5926754179729515588?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/5926754179729515588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html#comment-form' title='33 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5926754179729515588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5926754179729515588'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html' title='Problems with CAP, and Yahoo’s little known NoSQL system'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>33</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-470842105003385247</id><published>2010-03-29T13:44:00.000-07:00</published><updated>2010-03-31T20:29:20.682-07:00</updated><title type='text'>Distinguishing Two Major Types of Column-Stores</title><content type='html'>&lt;div&gt;I have noticed that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Bigtable&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;HBase&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Hypertable&lt;/span&gt;, and Cassandra are being called column-stores with increasing frequency (e.g. &lt;a href="http://hadoop.apache.org/hbase/"&gt;here&lt;/a&gt;, &lt;a href="http://nosql.mypopescu.com/post/356249273/a-great-week-for-the-column-stores-cassandra-and-hbase"&gt;here&lt;/a&gt;, and &lt;a href="http://blogs.the451group.com/information_management/2010/03/15/categorizing-the-foo-fighters-making-sense-of-nosql/"&gt;here&lt;/a&gt;), due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Sybase&lt;/span&gt; IQ, C-Store, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Vertica&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;VectorWise&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;MonetDB&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;ParAccel&lt;/span&gt;, and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;Infobright&lt;/span&gt;, which also are able to access columns separately. I believe that calling both groups of systems column-stores has lead to a great deal of confusion and misplaced expectations. This blog post attempts to clear up some of this confusion, highlighting the high level differences between these groups of systems. At the end, I will propose some potential ways to rename these groups to avoid confusion in the future.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;For this blog post, I will refer to the following two groups as Group A and Group B:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Group A: &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;Bigtable&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;HBase&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;Hypertable&lt;/span&gt;, and Cassandra. These four systems are not intended to be a complete list of systems in Group A --- these are simply the four systems I understand the best in this category and feel the most confident discussing.&lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Group B: &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;Sybase&lt;/span&gt; IQ, C-Store, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;Vertica&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;VectorWise&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;MonetDB&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;ParAccel&lt;/span&gt;, and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;Infobright&lt;/span&gt;. Again, this is not a complete list, but these are the systems from this group I know best. (Row/column hybrid systems such as Oracle or &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;Greenplum&lt;/span&gt; are ignored from this discussion to avoid confusion, but the column-store aspects of these systems are closer to Group B than Group A.)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;Differences between Group A and Group B&lt;/b&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;&lt;b&gt;Data Model. &lt;/b&gt;Group A uses a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map). Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. Group B uses a traditional relational data model. This distinction has caused great confusion. People more familiar with Group A are very much aware that Group A does not use a relational data model and assume that since Group B are also called column-stores, then Group B also does not use a relational data model. This has resulted in many intelligent people saying “column-stores are not relational”, which is completely incorrect.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Independence of Columns&lt;/b&gt;. Group A stores parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query. However, column-families may consist of many columns, and these columns within column-families are not independently accessible.&lt;br /&gt;&lt;br /&gt;Group B stores columns from a traditional relational database table separately so that they can be accessed independently. Like Group A, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that &lt;span style="font-weight: bold;"&gt;every column&lt;/span&gt; is stored separately, instead of families of columns as in Group A (this statement ignores &lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;fine-grained hybrid options&lt;/a&gt; within Group B).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Interface&lt;/b&gt;. Group A is distinguished by being part of the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;NoSQL&lt;/span&gt; movement and does not typically have a traditional &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;SQL&lt;/span&gt; interface. Group B supports standard &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;SQL&lt;/span&gt; interfaces.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Optimized workload.&lt;/b&gt; Group B is optimized for read-mostly analytical workloads. These systems support reasonably fast load times, but high update rates tend to be problematic. Hence, data warehouses are an ideal market for Group B, since they are typically bulk-loaded, require many complex read queries, and are updated rarely. In contrast, Group A can handle a more diverse set of application requirements (Cassandra, in particular, can handle a much higher rate of updates). Group B systems tend to struggle on workloads that “get” or “put” individual rows in the data set, but thrive on big aggregations and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;summarizations&lt;/span&gt; that require scanning many rows as part of a single query. In contrast, Group A generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. Group A systems can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places. Another reason for the difference is the storage layer implementation, explained below.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Storage layer.&lt;/b&gt; Assume the following customer table&lt;br /&gt;&lt;br /&gt;&lt;img src="http://cs-www.cs.yale.edu/homes/dna/images/SampleTable.png" height="100/" /&gt;&lt;br /&gt;&lt;br /&gt;Although there is some variation within the systems in Group B, to the first order of approximation, this group will store the table in the following way:&lt;br /&gt;&lt;br /&gt;(ID) 1, 2, 3, 4, 5, 6&lt;br /&gt;&lt;br /&gt;(First Name) Joe, Jack, Jill, James, Jamie, Justin&lt;br /&gt;&lt;br /&gt;(Last Name) Smith, Williams, Davis, Miller, Wilson, Taylor&lt;br /&gt;&lt;br /&gt;(Phone) 555-1234, 555-5668, 555-5432, NULL, 555-6527, 555-8247&lt;br /&gt;&lt;br /&gt;(Email) jsmith@gmail.com, jwilliams@gmail.com, NULL, jmiller@yahoo.com, NULL, jtaylor@aol.com&lt;br /&gt;&lt;br /&gt;Note that each value is stored by itself, without information about what row or column it came from. We can figure out what column it came from since all values from the same column are stored consecutively. We can figure out what row it came from by counting how many values came before it in the same column. The fourth value in the id column matches up to the same row as the fourth value in the last name column and the fourth value in the phone column, etc. Note that this means that columns that are undefined for a particular row must be explicitly stored as NULL in the column list; otherwise we can no longer match up values based on their position in their corresponding lists.&lt;br /&gt;&lt;br /&gt;Meanwhile, systems in Group A will either explicitly store the row-name, the column-name or both with each value. E.g.: row2, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;lastname&lt;/span&gt;: Williams; row5, phone: 555-6527, etc. The reason is that Group A uses a sparse data model (different rows can have a very different set of columns defined). Storing &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;NULLs&lt;/span&gt; for every undefined column could soon lead to the majority of the database being filled with &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;NULLs&lt;/span&gt;. Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column column-family. (Group A will also typically store a timestamp per value, but explaining this will only complicate this discussion).&lt;br /&gt;&lt;br /&gt;This results in Group B typically taking much less space on storage than Group A (at least for structured data that easily fits into a relational model). Furthermore, by storing just column-values without column-names or row-names, Group B optimizes performance for column-operations where each element within a column is read and an operation (like a predicate evaluation or an aggregation) is applied. Hence, the data model combined with the storage layer implementation results in wildly different target applications for Group A and Group B.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;b&gt;Renaming the Groups&lt;/b&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Clearly along each of these five dimensions, Group A and Group B are extremely different. Consequently, even though calling them both column-stores has some advantages (it makes it seem like the “column-store movement” is large and a really hot area), I believe that lumping Group A and Group B together does more damage than good, and that we need to make a greater effort to avoid confusing these two groups in the future. Here are some suggestions for names for Group A and Group B towards this goal:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Group A: “column-family store”&lt;br /&gt;Group B: “column-store”&lt;br /&gt;&lt;br /&gt;(The problem here is that Group B &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_26"&gt;doesn&lt;/span&gt;’t get a new name, which means that “column-store” could either refer to Group B or both Group A/B)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Group A: “non-relational column-store”&lt;/li&gt;&lt;li&gt;Group B: “relational column-store”&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Group A: “sparse column-store”&lt;/li&gt;&lt;li&gt;Group B: “dense column-store”&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Of these, the relational/non-relational distinction is probably the most important, and would be my vote for the new names. If you have a different idea for names, or want to vote on one of the above schemes, please leave a comment below. &lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;hr /&gt;&lt;span style="font-weight: bold;"&gt;Addendum:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Michael_Stonebraker"&gt;Mike &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;Stonebraker&lt;/span&gt;&lt;/a&gt; e-mailed me his vote which I reproduce here with his permission:&lt;blockquote&gt;&lt;div&gt;&lt;span class="Apple-style-span"   style="color: rgb(31, 73, 125);font-family:Calibri,sans-serif;font-size:100%;"&gt;&lt;i&gt;&lt;span class="Apple-style-span"&gt;Group A are really row stores.  I.e. they store a column family in a row-by-row fashion.  Effectively, they are using&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="color: rgb(0, 0, 0);font-family:Georgia,serif;"&gt;&lt;i&gt;&lt;span class="Apple-style-span"&gt; &lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"&gt;materialized views for column families and storing materialized views row-by-row. Systems in Group B have a sophisticated column-oriented optimizer -- no such thing exists for Group A.&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;/blockquote&gt;He then went on to call Group A "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;MV&lt;/span&gt; row-stores" and sub-categorize Group B depending on how materialized views are stored, but his overarching point was that referring to anything in Group A as a "column-store" is really misleading.&lt;br /&gt;&lt;br /&gt;&lt;hr /&gt;&lt;span style="font-weight: bold;"&gt;Addendum 2:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This post seems to have attracted some high quality comments. I recommend reading the comment thread.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-470842105003385247?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/470842105003385247/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html#comment-form' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/470842105003385247'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/470842105003385247'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html' title='Distinguishing Two Major Types of Column-Stores'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-1856065087991873525</id><published>2010-01-20T15:43:00.001-08:00</published><updated>2010-01-20T18:35:42.707-08:00</updated><title type='text'>New England Database Summit 2010 Program</title><content type='html'>&lt;div&gt;I just finished putting together the program for New England Database Summit, 2010 (thanks to the PC: &lt;a href="http://www.cs.umass.edu/~yanlei/"&gt;Yanlei Diao&lt;/a&gt;, &lt;a href="http://www.cs.brown.edu/~olga/"&gt;Olga Papaemmanouil&lt;/a&gt;, and &lt;a href="http://davis.wpi.edu/dsrg/MEMBERS/rundenst/"&gt;Elke Rundensteiner&lt;/a&gt;). The schedule is really packed this year, with three keynotes/invited talks, eight technical session talks, and approximately thirty posters from students and researchers. The featured talks are from &lt;a href="http://research.yahoo.com/Raghu_Ramakrishnan"&gt;Raghu Ramakrishnan&lt;/a&gt; (Chief Scientist for Audience &amp;amp; Cloud Computing at Yahoo!), &lt;a href="http://www.monash.com/curtbio.html"&gt;Curt Monash&lt;/a&gt;  (President, Monash Research), and &lt;a href="http://www.almaden.ibm.com/u/mohan/"&gt;C. Mohan&lt;/a&gt; (IBM Fellow and IBM India Chief Scientist).&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://db.csail.mit.edu/nedbday10/htdocs/index.php?page=index"&gt;Registration &lt;/a&gt;for New England Database Summit is free (thanks Netezza!), but we need to know how much food, coffee, appetizers, and drinks to order (breakfast and lunch is included in the free admission, and there will be free beer and wine during the poster session), so &lt;b&gt;registration will be closed after 5PM on Friday, January 22nd&lt;/b&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the past we've had 200+ attendees, so it's a great opportunity to come network with researchers, academics, and database professionals from all over New England.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A summary of the current program is listed below. However, the &lt;a href="http://db.csail.mit.edu/nedbday10/program.html"&gt;NEDBSummit Website&lt;/a&gt; has more details (including talk abstracts). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;table frame="border"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td width="100"&gt; &lt;u&gt;Time&lt;/u&gt; &lt;/td&gt; &lt;td&gt; &lt;u&gt;Event&lt;/u&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; 9:00 AM&lt;/td&gt; &lt;td&gt; Welcoming remarks &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt; 9:10-10:10&lt;/td&gt; &lt;td&gt; Raghu Ramakrishnan (Chief&lt;br /&gt;Scientist for Audience &amp;amp; Cloud Computing at Yahoo!)&lt;i&gt; Cloud Data Serving&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; 10:10-10:35&lt;/td&gt; &lt;td&gt; Coffee Break&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;th colspan="3" bgcolor="#8888FF"&gt;  Technical Session 1 &lt;/th&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt;10:35-10:55&lt;/td&gt; &lt;td&gt;Carlo Curino, Evan  Jones, Yang&lt;br /&gt;Zhang, Eugene Wu, Sam Madden &lt;i&gt;RelationalCloud: The case for a database&lt;br /&gt;service&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top"&gt;10:55-11:15&lt;/td&gt; &lt;td&gt;Mike Dirolf &lt;i&gt;An Introduction to MongoDB&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top"&gt;11:15-11:35&lt;/td&gt; &lt;td&gt;Elke Rundensteiner, R. Nehme,&lt;br /&gt;and E. Bertino &lt;i&gt;The Query Mesh Project: A&lt;br /&gt;Powerful Multi-Route Query Processing Paradigm&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top"&gt;11:35-11:55&lt;/td&gt; &lt;td&gt;Andy Pavlo &lt;i&gt;MapReduce and Parallel&lt;br /&gt;DBMSs: Together At Last&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top"&gt;11:55-12:15&lt;/td&gt; &lt;td&gt;Gregory Malecha, Greg&lt;br /&gt;Morrisett, Avraham Shinnar, and Ryan Wisnesky &lt;i&gt;Toward a Verified&lt;br /&gt;Relational Database Management System&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt; 12:15 PM&lt;/td&gt; &lt;td&gt; Lunch &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt; 1:10-2:10&lt;/td&gt; &lt;td&gt; Curt Monash (President, Monash Research). &lt;i&gt;Database and analytic technology: The state of the union&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;th colspan="3" bgcolor="#8888FF"&gt;  Technical Session 2&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt;2:10-2:30&lt;/td&gt; &lt;td&gt;Paul Brown &lt;i&gt;SciDB: Massively&lt;br /&gt;Parallel Array Data Storage, Processing and Analysis &lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; 2:30-2:50&lt;/td&gt; &lt;td&gt; Coffee Break&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt;2:50-3:10&lt;/td&gt; &lt;td&gt;Julia Stoyanovich, William Mee,&lt;br /&gt;Kenneth A. Ross &lt;i&gt;Semantic Ranking and Result Visualization for Life&lt;br /&gt;Sciences Publications&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top"&gt;3:10-3:30&lt;/td&gt; &lt;td&gt;Mirek Riedewald, Alper Okcan,&lt;br /&gt;Daniel Fink &lt;i&gt;Scalable Search and Ranking for Scientific Data&lt;/i&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td valign="top"&gt; 3:30-4:30&lt;/td&gt; &lt;td&gt; C. Mohan (IBM Fellow and&lt;br /&gt;IBM India Chief Scientist). &lt;i&gt;Implications of Storage Class&lt;br /&gt;Memories (SCMs) on Software Architectures&lt;/i&gt;&lt;br /&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; 4:30 PM&lt;/td&gt; &lt;td&gt; Poster Session and Appetizers / Drinks (Building 32, R&amp;amp;D Area, 4th Floor)&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td&gt; 6:00 PM&lt;/td&gt; &lt;td&gt; Adjourn &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-1856065087991873525?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/1856065087991873525/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/01/new-england-database-summit-2010.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/1856065087991873525'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/1856065087991873525'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/01/new-england-database-summit-2010.html' title='New England Database Summit 2010 Program'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-2835843758882009805</id><published>2010-01-07T10:49:00.000-08:00</published><updated>2010-01-19T07:57:37.959-08:00</updated><title type='text'>Exadata's columnar compression</title><content type='html'>&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;I recently came across a &lt;a href="http://www.oracle.com/technology/products/bi/db/exadata/pdf/ehcc_twp.pdf"&gt;nice whitepaper from Oracle&lt;/a&gt; that describes the Exadata columnar compression scheme. I wrote up a brief overview of Oracle's columnar compression in the past (in my &lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;hybrid storage layout post&lt;/a&gt;), but I was pleased to see the whitepaper give some additional details. Before discussing the main content of the whitepaper, I want to correct a few technical inaccuracies in the article:&lt;/span&gt;&lt;/p&gt;&lt;p style="color: rgb(0, 0, 153);" class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;blockquote&gt;"Storing column data together, with the same data type and similar characteristics, drastically increases the storage savings achieved from compression. However, storing data in this manner can negatively impact database performance when application queries access more than one or two columns, perform even a modest number of updates, or insert small numbers of rows per transaction."&lt;/blockquote&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;The first part of the above statement is usually true; columns tend to be self-similar since they contain data from the same attribute domain, and therefore laying out columns contiguously on storage typically yields lower entropy and higher compression ratios than storing rows contiguously. However, the second part of the statement is quite misleading and only true for the most naïve of column-oriented layout schemes. True, the benefit of storing data in columns decreases as a query accesses a higher percentage of columns from a table. However, the number of columns that need to be accessed such that storing data in columns impacts performance negatively relative to a storing data in rows is quite a bit larger than one or two columns. There’s a &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/VLDB06.pdf"&gt;VLDB 2006 paper&lt;/a&gt; by Stavros Harizopoulos et. al. that runs a bunch of benchmarks to understand this tradeoff in more detail. There are a bunch of factors that affect whether column-storage or row-storage is best beyond simply the number of columns accessed (e.g. prefetch size, query selectivity, compression types), most notably the fact that column-stores struggle on needle-in-a-haystack queries since the multiple I/Os needed to get data from different columns are not amortized across the retrieval of multiple rows, but the bottom line is that the query space where pure column-oriented layout outperforms pure row-oriented layout is quite a bit larger than what the Oracle whitepaper claims. Regarding updates and inserts, the VLDB 2005 &lt;a href="http://cs-www.cs.yale.edu/homes/dna/vldb.pdf"&gt;C-Store paper&lt;/a&gt; and OSDI 2006 &lt;a href="http://labs.google.com/papers/bigtable-osdi06.pdf"&gt;Google Bigtable paper&lt;/a&gt; discuss how to alleviate this column-store disadvantage by temporarily storing the updates in a write-optimized store.&lt;/span&gt;&lt;/p&gt;&lt;p style="color: rgb(0, 0, 153);" class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;blockquote&gt;"Oracle’s Exadata Hybrid Columnar Compression technology is a new method for organizing data within a database block."&lt;/blockquote&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;I feel that this is not giving enough credit to the &lt;a href="http://www.dia.uniroma3.it/%7Evldbproc/021_169.pdf"&gt;VLDB 2001 paper&lt;/a&gt; by Ailamaki et. al. on PAX. Even though the PAX paper didn’t apply compression to this layout, and PAX didn’t include multiple blocks inside a single compression unit, the basic idea is the same.  &lt;/span&gt;&lt;/p&gt;&lt;p style="color: rgb(0, 0, 153);" class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;blockquote&gt;One of the key benefits of the Exadata Hybrid Columnar approach is that it provides both the compression and performance benefits of columnar storage without sacrificing the robust feature set of the Oracle Database. For example, while optimized for scan-level access, because row data is self-contained within compression units, Oracle is still able to provide efficient row-level access, with entire rows typically being retrieved with a single I/O.&lt;/blockquote&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;To people who have even a rudimentary knowledge of column-stores, the juxtaposed statements “entire rows being retrieved with a single I/O” and “provides the performance benefits of columnar storage” is clearly inaccurate. The most often cited advantage of column-stores (indeed cited even more often than the compression advantages) is that if you have a query that has to scan two out of twenty columns in a table (i.e. “tell me the sum of revenue per store”), a column-store only has to read in those two columns while a row-store has to scan the entire table (because multiple entire rows are picked up with each I/O). Hence, a column-store is very I/O efficient, an increasingly important bottleneck in database performance (albeit somewhat reduced with good compression algorithms). Again, Oracle’s statement is true for needle-in-a-haystack queries, but false for the more frequent scan-based queries one finds in data warehouse workloads (which is where Exadata is primarily marketed).&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Despite the above overstatements, it is clear that Exadata’s Hybrid Columnar scheme is a big win (relative to vanilla Exadata) for most Exadata datasets and workloads, for the following reasons:&lt;/span&gt;&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Storage costs are lowered by having a smaller data footprint.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;I/O performance is improved by having to read less data from storage.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Data is kept compressed as it is scanned in the storage layer. Selections and projections performed in the storage layer can be performed directly on the compressed data. (Those of you who are familiar with my academic work will not be surprised I like this, given that I wrote a &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadisigmod06.pdf"&gt;paper on integrating compression and execution for column-oriented storage layouts&lt;/a&gt; in a SIGMOD 2006 paper). This yields all the storage costs savings and runtime I/O savings of compression without the performance disadvantage of runtime decompression.&lt;br /&gt;&lt;br /&gt;The whitepaper is vague when it comes to the question of whether&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; data remains in compressed form after it is scanned in the Exadata Storage Servers as it is shipped over the Infiniband network to Oracle RAC for further processing. It appears from the way I read the paper (I could be wrong about this) that data is decompressed before it is shipped over the network, which means that Oracle is not gaining all of the potential performance from operating directly on compressed data, as data does have to be decompressed eventually, though the amount of data that needs to be decompressed is smaller, since selections and projections have occurred. If Oracle RAC could operate directly on the columnar compressed format, then decompression would never have to occur, which would further increase performance (query results would obviously have to be decompressed, but usually the magnitude of query results is much smaller than the magnitude of data processed to produce them). Not only is Oracle missing out on potential performance by decompressing data before sending it over the network, but also network bottlenecks could be introduced.&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;If storage costs are the most important consideration (e.g. for historical data that would otherwise be archived to tape or other offline device), there’s a knob that allows you to further increase compression at the cost of having to decompress data at runtime (and this decompression is heavyweight), thereby reducing runtime performance. Furthermore, the same table can be compressed using the lightweight (high-performance) compression and heavyweight (low-performance) compression using Oracle’s partitioning feature (e.g. you could partition by date and have the historical partitions use the heavyweight compression scheme and the active partitions use the lightweight compression scheme). Note that lightweight still yields good compression --- Oracle claims up to 10X, but this is obviously going to be very dependent on your data.&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;The most interesting part of the whitepaper was the figure of a compression unit on page four. It is clear that one complication that arises from compressing each column separately is that for a fixed number of rows (e.g. tuples numbered 1-1000), the column sizes are going to vary wildly (depending on how compressible the data from each column is). Figuring out how to fit all columns from a set of rows into a fixed-size block (or set of blocks inside a compression unit) without leaving large chunks of the block empty is a nontrivial optimization problem, and something that Oracle probably had to think about in detail in their implementation.&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Anyway, if you’re interested in Oracle’s approach to hybrid columnar storage, the whitepaper is worth a read. With &lt;a href="http://www.dbms2.com/2009/12/29/this-and-that/"&gt;rumors of additional developments&lt;/a&gt; in Oracle’s hybrid columnar storage starting to surface, it is likely that this will not be my last blog post on this subject .&lt;/span&gt;&lt;/p&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-2835843758882009805?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/2835843758882009805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2010/01/exadatas-columnar-compression.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2835843758882009805'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2835843758882009805'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2010/01/exadatas-columnar-compression.html' title='Exadata&apos;s columnar compression'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-8838690019705589210</id><published>2009-12-30T07:45:00.000-08:00</published><updated>2009-12-30T08:58:27.284-08:00</updated><title type='text'>2009's top blog posts</title><content type='html'>Below are my top six blog postings of 2009, in order of the number of page views as calculated by Google Analytics:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;Announcing release of HadoopDB (longer version)&lt;/a&gt;, and (&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;shorter version&lt;/a&gt;). Combined visits: 31,228 (26,650 and 4,638 for the longer and shorter versions respectively).&lt;br /&gt;&lt;br /&gt;The post gave an overview of the HadoopDB project that was released from my research group at Yale over the summer. The basic idea is to combine the scalability and ease-of-use of Hadoop with the performance on structured data of relational database systems.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;A tour through hybrid column/row-oriented DBMS schemes&lt;/a&gt;. 2,922 visits.&lt;br /&gt;&lt;br /&gt;This might be the post I'm most proud of, from the ones on this list. It went through different ways one can combine row-store and column-store technology in a single DBMS. I believe that hybrid database systems along the lines described in the post are going to take off in 2010, and the predictions made in the post will soon come to fruition.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html"&gt;ParAccel and their puzzling TPC-H results&lt;/a&gt;. 2,499 visits.&lt;br /&gt;&lt;br /&gt;Of the posts on this list, this is the one I like the least, and the one that needs to be rewritten the most. I don't recommend reading it, but if you do, make sure you read the corrections in the comment thread in addition to the main article. However, the post is very stale at this point, since it discusses some TPC-H results from ParAccel that were challenged by a competitor and ParAccel later withdrew in September. ParAccel has not yet rerun these 30TB TPC-H benchmark results.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/watch-out-for-vectorwise.html"&gt;Watch out for VectorWise&lt;/a&gt;. 1,595 visits.&lt;br /&gt;&lt;br /&gt;This post discussed the technology behind Ingres' new column-store storage engine designed for analytical workloads, based on some research out of CWI. I am very high on this technology and my research group has had a chance to play around with it a little this winter.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/10/analysis-of-mapreduce-online-paper.html"&gt;Analysis of the "MapReduce Online" paper&lt;/a&gt;. 1,108 visits.&lt;br /&gt;&lt;br /&gt;I'm surprised this post got so many visits, since it was really geared only for readers in the research community. This post reviewed some research performed at the University of California, Berkeley, which explored how to pipeline results from different phases in MapReduce jobs to improve performance and enable early estimations of results. Rumor has it that this paper was accepted to NSDI 2010. I think the model of releasing papers as technical reports and independently reviewing and recommending them in public on venues like blogs is an interesting model to consider for the next decade, rather than the private 'accept' or 'reject' reviewing process currently used today.&lt;br /&gt;&lt;br /&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/09/kickfires-approach-to-parallelism.html"&gt;Kickfire’s approach to parallelism&lt;/a&gt;. 1,042 visits.&lt;br /&gt;&lt;br /&gt;This post takes a deeper dive into Kickfire's technology than you'll find in most other places. I find the way that they use FPGA technology to maximize the parallelism that can be achieved on analytical queries in a single-box machine to be quite impressive.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;The post from 2009 that I feel is the most underrated (meaning that the number of visits did not match up with what I felt was the quality of the post) was:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;What is the right way to measure scale?&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I really feel that people thought about scale in the wrong way in 2009. People assume that if a database can fit a lot of data inside of it (i.e. many petabytes), it must be really scalable. But if the data is not very accessible (i.e. it is stored on slow media, or it takes forever to scan through it all because there are not enough disk spindles or CPUs to process it), then the system is not nearly as scalable as it would seem.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Bottom line, if you only have time to read three of my postings from 2009, I would like them to be:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;The longer HadoopDB post&lt;/a&gt; (this is the only post about my own research)&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;The hybrid column/row-store post&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;The post on measuring scale&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;Overall, I'm pleased with the impact my blog seems to have had, and I intend to continue write posts for it in 2010.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-8838690019705589210?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/8838690019705589210/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/12/2009s-top-blog-posts.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8838690019705589210'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8838690019705589210'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/12/2009s-top-blog-posts.html' title='2009&apos;s top blog posts'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-72080392200139313</id><published>2009-11-24T19:43:00.001-08:00</published><updated>2009-11-24T20:52:17.869-08:00</updated><title type='text'>Deadlines approaching for two upcoming summits</title><content type='html'>There are two upcoming events that I suspect will be of interest to readers of this blog.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;First, the &lt;a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/index.htm"&gt;first annual &lt;span id="SPELLING_ERROR_0" class="blsp-spelling-error"&gt;ACM&lt;/span&gt; Symposium on Cloud Computing 2010&lt;/a&gt; (&lt;span id="SPELLING_ERROR_1" class="blsp-spelling-error"&gt;ACM&lt;/span&gt; &lt;span id="SPELLING_ERROR_2" class="blsp-spelling-error"&gt;SOCC&lt;/span&gt; 2010) will be held June 10&lt;span id="SPELLING_ERROR_3" class="blsp-spelling-error"&gt;th&lt;/span&gt; and 11&lt;span id="SPELLING_ERROR_4" class="blsp-spelling-error"&gt;th&lt;/span&gt; in Indianapolis, IN (&lt;span id="SPELLING_ERROR_5" class="blsp-spelling-error"&gt;co-located&lt;/span&gt; with &lt;span id="SPELLING_ERROR_6" class="blsp-spelling-error"&gt;SIGMOD&lt;/span&gt; 2010). This symposium will focus on systems and data management issues within the context of cloud computing. If the quality of the papers comes close to matching the quality of the people listed as &lt;a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/organizers.htm"&gt;organizers and program committee members&lt;/a&gt; of the event, then this will be a can't-miss highlight of 2010. If you are doing research in software as a service, &lt;span id="SPELLING_ERROR_7" class="blsp-spelling-error"&gt;virtualization&lt;/span&gt;, or scalable cloud data services, this venue will likely be a nice, high-profile place to publish your findings (the paper submission deadline is &lt;a href="http://research.microsoft.com/en-us/um/redmond/events/socc2010/submit.htm"&gt;January 15&lt;span id="SPELLING_ERROR_8" class="blsp-spelling-error"&gt;th&lt;/span&gt;, 2010&lt;/a&gt;).&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Second, on January 28&lt;span id="SPELLING_ERROR_9" class="blsp-spelling-error"&gt;th&lt;/span&gt;, 2010, the &lt;a href="http://db.csail.mit.edu/nedbday10/"&gt;third annual New England Database Summit&lt;/a&gt; will be held at MIT in Cambridge, Massachusetts. This will be an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their research and experiences. Registration for the event is free (thank you &lt;span id="SPELLING_ERROR_10" class="blsp-spelling-error"&gt;Netezza&lt;/span&gt;), and anyone is welcome to attend. The event will feature a keynote talk from &lt;span id="SPELLING_ERROR_11" class="blsp-spelling-error"&gt;Raghu&lt;/span&gt; &lt;span id="SPELLING_ERROR_12" class="blsp-spelling-error"&gt;Ramakrishnan&lt;/span&gt; (Chief Scientist for Audience &amp;amp; Cloud Computing at Yahoo!) and an invited talk from Curt &lt;span id="SPELLING_ERROR_13" class="blsp-spelling-error"&gt;Monash&lt;/span&gt; (President, &lt;span id="SPELLING_ERROR_14" class="blsp-spelling-error"&gt;Monash&lt;/span&gt; Research).&lt;br /&gt;&lt;p&gt;I'm serving as PC Chair for the event this year, but I don't expect to make any radical changes to how the summit has been run in previous years, with the day beginning with a keynote talk, followed by a series of 15-25 minute talks from different summit participants (submit a talk abstract &lt;a href="http://db.csail.mit.edu/nedbday10/htdocs/index.php"&gt;here&lt;/a&gt; if you want to give a talk --- the program committee will select the set of talks from this pool of abstracts based on what we expect will appeal most to the summit audience), followed by another invited talk and then a poster session over appetizers and drinks at the end of the day.&lt;/p&gt;&lt;p&gt;We had over 300 registered participants at last year's event, reflecting the vibrancy of database systems activity in the New England area. We expect similar numbers at this year's event. Lunch, drinks, and appetizers are all included for free.&lt;/p&gt;&lt;p&gt;Poster and talk proposal submissions must be made by January 11, 2010. All reasonable posters will be accepted; talk invitations will be made by January 20, 2010.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-72080392200139313?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/72080392200139313/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/11/deadlines-approaching-for-two-upcoming.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/72080392200139313'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/72080392200139313'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/11/deadlines-approaching-for-two-upcoming.html' title='Deadlines approaching for two upcoming summits'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-7786256984169570519</id><published>2009-10-22T07:14:00.000-07:00</published><updated>2009-10-22T07:54:55.007-07:00</updated><title type='text'>Analysis of the "MapReduce Online" paper</title><content type='html'>I recently came across a paper entitled &lt;a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html"&gt;“&lt;span id="SPELLING_ERROR_0" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; Online”&lt;/a&gt; written by Tyson &lt;span id="SPELLING_ERROR_1" class="blsp-spelling-error"&gt;Condie&lt;/span&gt;, Neil Conway, Peter Alvaro, Joe &lt;span id="SPELLING_ERROR_2" class="blsp-spelling-error"&gt;Hellerstein&lt;/span&gt;, &lt;span id="SPELLING_ERROR_3" class="blsp-spelling-error"&gt;Khaled&lt;/span&gt; &lt;span id="SPELLING_ERROR_4" class="blsp-spelling-error"&gt;Elmeleegy&lt;/span&gt;, and Russell Sears at Berkeley (University of California). Since I’m very interested in &lt;span id="SPELLING_ERROR_5" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;-related research (see my group’s &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;work on &lt;span id="SPELLING_ERROR_6" class="blsp-spelling-error"&gt;HadoopDB&lt;/span&gt;&lt;/a&gt;) and this Berkeley group have historically produced reliably good research papers, I immediately downloaded the paper and read it carefully. The paper demonstrates the impact of several improvements the group made to &lt;span id="SPELLING_ERROR_7" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; for interactive queries, and since &lt;span id="SPELLING_ERROR_8" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; is becoming increasingly popular, I expect this paper to have wide interest. Therefore, I think it might be useful to post a summary of the paper and some analysis on this blog. If you have also read this paper, I encourage discussion in the comment thread of this post.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-size:130%;"&gt;Overview&lt;/span&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The authors argue that since &lt;span id="SPELLING_ERROR_9" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt;’s (and therefore &lt;span id="SPELLING_ERROR_10" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;’s) roots are in batch processing, design decisions were made that are cause problems as &lt;span id="SPELLING_ERROR_11" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; gets used more and more for interactive query processing. The main problem the paper addresses is how data is transferred between operators --- both between ‘Map’ and ‘Reduce’ operators within a job, and also across multiple &lt;span id="SPELLING_ERROR_12" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; jobs. This main problem has two &lt;span id="SPELLING_ERROR_13" class="blsp-spelling-error"&gt;subproblems&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Data is &lt;strong&gt;pulled&lt;/strong&gt; by downstream operators from upstream operators (e.g., a Reduce task requests all the data it needs from predecessor Map tasks) instead of &lt;strong&gt;pushed&lt;/strong&gt; from upstream operators to downstream operators. &lt;/li&gt;&lt;li&gt;Every &lt;span id="SPELLING_ERROR_14" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; operator is a &lt;strong&gt;blocking&lt;/strong&gt; operator. Reduce tasks cannot begin until Map tasks are finished, and Map tasks for the next job cannot get started until the Reduce tasks from the previous job have completed. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The paper does not really distinguish between these two &lt;span id="SPELLING_ERROR_15" class="blsp-spelling-error"&gt;subproblems&lt;/span&gt;, but for the purposes of this discussion, I think it is best to separate them.&lt;br /&gt;&lt;br /&gt;&lt;span id="SPELLING_ERROR_16" class="blsp-spelling-error"&gt;Subproblem&lt;/span&gt; 2 causes issues for interactive queries for four reasons:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;It eliminates opportunities for &lt;span id="SPELLING_ERROR_17" class="blsp-spelling-error"&gt;pipelined&lt;/span&gt; parallelism (where, e.g., Map and Reduce tasks from the same job can be running simultaneously). There are some cases (such as the example query in the paper --- more on that later) where &lt;span id="SPELLING_ERROR_18" class="blsp-spelling-error"&gt;pipelined&lt;/span&gt; parallelism can significantly reduce query latency. &lt;/li&gt;&lt;li&gt;It causes spikes in network utilization (there is no network traffic until a Map task finishes, and then the entire output must be sent at once to the Reduce nodes). This is particularly problematic if Map tasks across nodes in a cluster are &lt;span id="SPELLING_ERROR_19" class="blsp-spelling-corrected"&gt;synchronized&lt;/span&gt;&lt;/li&gt;&lt;li&gt;When every operator is a blocking operator, it is impossible to get early estimations of a query result until the very end of query execution. If data instead can be &lt;span id="SPELLING_ERROR_20" class="blsp-spelling-error"&gt;pipelined&lt;/span&gt; through query operators as it is processed, then we can start receiving query results almost immediately, and these early query results might be sufficient for the end user, who can then stop the query before it runs to completion. &lt;/li&gt;&lt;li&gt;If the Map part of a &lt;span id="SPELLING_ERROR_21" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; job reads from a continuous (and infinite) data source instead of from a file, no operator ever “completes”, so a blocking operator that &lt;span id="SPELLING_ERROR_22" class="blsp-spelling-error"&gt;doesn&lt;/span&gt;’t produce results until it completes is entirely useless.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;span id="SPELLING_ERROR_23" class="blsp-spelling-error"&gt;Subproblem&lt;/span&gt; 1 (pull vs push) is problematic mostly due to performance reasons. Data that is pushed can be shipped as soon as it is produced, while it is still in cache. If it is pulled at a later time, accessing the data often incurs an index &lt;span id="SPELLING_ERROR_24" class="blsp-spelling-error"&gt;lookup&lt;/span&gt; and a disk read, which can reduce performance. There is also slightly more coordination (and therefore network traffic) across nodes in the pull model relative to the push model.&lt;br /&gt;&lt;br /&gt;The goal of the paper is to fix these &lt;span id="SPELLING_ERROR_25" class="blsp-spelling-error"&gt;subproblems&lt;/span&gt; that arise on interactive query workloads, while maintaining &lt;span id="SPELLING_ERROR_26" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;’s famous query-level fault tolerance (very little query progress is lost upon a node failure). Essentially, this means that we want to continually push data from producer operators to consumer operators on the fly as the producer operator produces output. However, in order to maintain &lt;span id="SPELLING_ERROR_27" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;’s fault tolerance, data that is pushed is also &lt;span id="SPELLING_ERROR_28" class="blsp-spelling-error"&gt;checkpointed&lt;/span&gt; locally to disk before being sent, so that if a downstream node fails, the task assigned to this node can be rescheduled on a different node, which will begin by pulling data from this upstream checkpoint just as standard &lt;span id="SPELLING_ERROR_29" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; does.&lt;br /&gt;&lt;br /&gt;The previous two sentences is the basic idea suggested in the paper. There are some additional details that are also described (1. to avoid having too many &lt;span id="SPELLING_ERROR_30" class="blsp-spelling-error"&gt;TCP&lt;/span&gt; connections between Map and Reduce nodes, not all data can be immediately pushed; 2. data that is pushed should be sent in batches and “combined” before being shipped; 3. in order to handle potential Map node failures, Reduce tasks that receive data before Map tasks have completed have to treat this data as “tentative” until the Map task “commits”), but these details do not change the basic idea: push data downstream as it is produced, but still write all data locally to disk before it is shipped.&lt;br /&gt;&lt;br /&gt;The authors implement this idea in &lt;span id="SPELLING_ERROR_31" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;, and then run some experiments to demonstrate how these changes (1) improve query latency (2) enable early approximations of query results (3) enable the use of &lt;span id="SPELLING_ERROR_32" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; for continuous data stream of data. In other words, all of the problems that &lt;span id="SPELLING_ERROR_33" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; has for interactive query workloads described above are fixed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The paper evaluates mainly &lt;span id="SPELLING_ERROR_34" class="blsp-spelling-error"&gt;subproblem&lt;/span&gt; (2): every operator in &lt;span id="SPELLING_ERROR_35" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; is a blocking operator. The main focus of the paper is, in particular, on turning the Map operator from a blocking operator to an operator that can ship query results as they are produced. This technique gets more and more useful the fewer map tasks there are. If there are more nodes than Map tasks (as is the case for every experiment in this paper) then there are entire nodes that have only Reduce tasks assigned to them, and without &lt;span id="SPELLING_ERROR_36" class="blsp-spelling-error"&gt;pipelining&lt;/span&gt;, these nodes have to sit around and do nothing until the Map tasks start to finish.&lt;br /&gt;&lt;br /&gt;Even if there are as many nodes as Map tasks (so every node can work on a Map task during the Map phase), each node can run a Map task and a Reduce task in parallel and the &lt;span id="SPELLING_ERROR_37" class="blsp-spelling-error"&gt;multicore&lt;/span&gt; resources of the node can be more fully utilized if the Reduce task has something to do before the Map task finishes. However, as there are more Map tasks assigned per node (Google reports 200,000 Map tasks for 2,000 nodes in the original &lt;span id="SPELLING_ERROR_38" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; paper), I would predict that the performance improvement from being able to start the Reduce nodes early gets steadily smaller. I also wonder if the need to send data from a Map task before it completes for the purposes of online aggregation also gets steadily smaller, since the first Map tasks finish at a relatively earlier time in query processing when nodes have to process many Map tasks.&lt;br /&gt;&lt;br /&gt;In general, the performance studies that are presented show the best case scenario for the ideas presented in the paper: (1) fewer than 1 Map task assigned per node (2) the Map task does not filter any data so there is a lot of data that must be shipped between the Map and Reduce phases (3) there is no combiner function (4) there are no concurrent &lt;span id="SPELLING_ERROR_39" class="blsp-spelling-error"&gt;MapReduce&lt;/span&gt; jobs. Although it is useful to see performance improvement in this best-case scenario, it would also be interesting to see how performance is affected in less ideal settings. I’m particularly interested in seeing what happens when there are many more Map tasks per node.&lt;br /&gt;&lt;br /&gt;Even in a less ideal setting I would still expect to see performance improvement from the ideas presented in this paper due to the authors' solution to &lt;span id="SPELLING_ERROR_40" class="blsp-spelling-error"&gt;subproblem&lt;/span&gt; 1 (switching from a pull-model to a push-model). It would be great if the experiments could be extended so that the reader can understand how much of the performance improvement is from the authors' solution to &lt;span id="SPELLING_ERROR_41" class="blsp-spelling-error"&gt;subproblem&lt;/span&gt; 2 (the Reduce nodes can start early) and how much of the performance improvement is from the disk seeks/reads saved by switching from a pull-model to a push-model.&lt;br /&gt;&lt;br /&gt;Another performance related conclusion one can draw from this paper (not mentioned by the authors, but still interesting) is that &lt;span id="SPELLING_ERROR_42" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; still has plenty of room for improvement from a raw performance perspective. The main task in this paper involved &lt;span id="SPELLING_ERROR_43" class="blsp-spelling-error"&gt;tokenizing&lt;/span&gt; and sorting a 5.5 GB file. &lt;span id="SPELLING_ERROR_44" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; had to use 60 (seriously, 60!) nodes to do this in 900 seconds. The techniques introduced in the paper allowed &lt;span id="SPELLING_ERROR_45" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; to do this in 600 seconds. But does anyone else think they can write a C program that runs on a single node that runs in half the time on 1/60&lt;span id="SPELLING_ERROR_46" class="blsp-spelling-error"&gt;th&lt;/span&gt; of the machines?&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-size:130%;"&gt;Conclusion&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The ideas presented in this paper provide clear advantages for interactive query workloads on &lt;span id="SPELLING_ERROR_47" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;. For this reason, I look forward to the modifications made by the authors making it into the standard &lt;span id="SPELLING_ERROR_48" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt; distribution. Since &lt;span id="SPELLING_ERROR_49" class="blsp-spelling-error"&gt;HadoopDB&lt;/span&gt; is able to leverage improvements made to &lt;span id="SPELLING_ERROR_50" class="blsp-spelling-error"&gt;Hadoop&lt;/span&gt;, these techniques should be great for &lt;span id="SPELLING_ERROR_51" class="blsp-spelling-error"&gt;HadoopDB&lt;/span&gt; as well. While I think the performance advantages in the paper may be a little overstated, I do believe that switching from pull to push will improve performance by at least a small amount for many queries, and the ability to support online aggregation and continuous queries is a nice contribution. My suggestion to the authors would be to spend more time focusing on online aggregation and continuous queries, and less time extolling the performance advantages of their techniques. For example, one question I had about online aggregation is the following: the running aggregation is done using the Map tasks that finish first. But the problem is that Map tasks run on congruous chunks of an &lt;span id="SPELLING_ERROR_52" class="blsp-spelling-error"&gt;HDFS&lt;/span&gt; file --- data in the same partition are often more similar to each other than data on other partitions. Hence, might data skew become a problem?&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-7786256984169570519?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/7786256984169570519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/10/analysis-of-mapreduce-online-paper.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7786256984169570519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7786256984169570519'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/10/analysis-of-mapreduce-online-paper.html' title='Analysis of the &quot;MapReduce Online&quot; paper'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-8735716178552605933</id><published>2009-10-14T11:04:00.000-07:00</published><updated>2009-10-14T11:47:39.874-07:00</updated><title type='text'>Greenplum announces column-oriented storage option</title><content type='html'>I checked Curt Monash’s blog today (as I do on a somewhat daily basis) and saw a &lt;a href="http://www.dbms2.com/2009/10/14/greenplum-hybrid-columnar/"&gt;new post announcing Greenplum’s new column-oriented storage option&lt;/a&gt;. In my opinion, this is pretty big news. I was amused to see that Curt, in his post, correctly predicted pretty much everything I was going to say, but nonetheless, I feel obligated to post my reactions to this news:&lt;br /&gt;&lt;br /&gt;Quick hit reactions:&lt;br /&gt;&lt;br /&gt;(1) Congrats to Greenplum and their strong technical team for adding&lt;br /&gt;column-oriented storage. They have essentially created a hybrid storage layer --- now you can store data in rows (write or read optimized) or in columns. I have previously written a long (and surprisingly popular --- 1,638 unique visits) &lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;blog post on hybrid column/row-oriented storage schemes&lt;/a&gt;. I would put the Greenplum scheme under the “Approach 3: fine-grained hybrids” classification. This puts them in the same category as the Vertica Flexstore scheme (more on this in a second).&lt;br /&gt;&lt;br /&gt;(2) I strongly agree with the &lt;a href="http://www.greenplum.com/news/248/231/Beyond-Rows-and-Columns-Greenplum-s-Polymorphic-Data-Storage----Part-1/d,blog/"&gt;Greenplum philosophy&lt;/a&gt; that religious wars of columns vs rows will not get you anywhere. The fact of the matter is that for some workloads, row-oriented storage can get you an order of magnitude performance improvement relative to column-stores, and for other workloads, column-oriented storage can get you an order of magnitude performance improvement relative to row-stores. Hybrid schemes can theoretically get you the best of both worlds, and that can be a big win.&lt;br /&gt;&lt;br /&gt;(3) For a few years now, I've lumped Greenplum and Aster Data together in my data warehouse market world view, even though Greenplum is slightly older and has more customers. Both are software-only solutions, and are big proponents of commodity hardware. Both partner with hardware vendors and will sell you bundled hardware with their software in a pre-optimized appliance if you prefer that to the software-only solution. Both started with vanilla PostgreSQL and turned it into a shared-nothing, MPP analytical DBMS. Both are players in the "big data" game, and heavily market their scalability. Both are west coast companies. Both support in-database MapReduce capabilities (they even announced this on the same day!). Both had research papers in the major DBMS research conferences this year. They both have embraced "the cloud". They even share some of the same customers (e.g. Fox Interactive Media).&lt;br /&gt;&lt;br /&gt;The word on the street is that the main difference between the two companies is that Greenplum has made significant modifications to the PostgreSQL code-base and the integration of the DBMS with the distribution (parallelization) layer is more "monolithic", whereas Aster Data has kept more of the original PostgreSQL code, and treats the database more like a black box. This announcement by Greenplum is further evidence that they are more than willing to edit the PostgreSQL code, as significant modifications to the storage layer was required.&lt;br /&gt;&lt;br /&gt;(4) Kudos to Greenplum for being open and honest about the limitations of their column-store option. It is a column-store at the storage layer only. This certainly is still a big win for queries that access few attributes from wide tables. However, before I had my own blog, I wrote a &lt;a href="http://databasecolumn.vertica.com/2008/12/debunking_yet_another_myth_col.html"&gt;guest article for Vertica’s blog&lt;/a&gt; explaining that column-oriented storage only gets you part of the way towards column-store performance --- writing a query executor optimized for column-oriented storage can get you an additional order of magnitude performance improvement. I explain this at length in two academic papers (see &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf"&gt;here&lt;/a&gt; and &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf"&gt;here&lt;/a&gt;). I think papers coined the terms "early materialization" and "late materialization" for explaining whether columns are being put back together into tuples (rows) at the beginning or end of execution of a query plan. In retrospect I have regretted using this term ("tuple construction" is a little more descriptive and easier to understand than "tuple materialization"). The fact that Greenplum uses the term “early materialization” to describe their column-oriented scheme is evidence that they’ve read these papers (more kudos!) and will start to implement the low-hanging fruit in the query executor to get increased performance improvement from their column-oriented storage. Hence, I expect that their column-oriented feature (while probably already useful now) will continue to get better in future releases. In the short-term, one should not expect performance approaching regular column-stores.&lt;br /&gt;&lt;br /&gt;(5) In my previous blog post on hybrid column/row-oriented storage schemes, I mentioned that it is far easier for a column-store to implement the “Approach 3: fine-grained hybrids” scheme than a row-store, since the row-store would have to make modifications at the storage, query executor, and query optimizer layers, while the column-store (that implements both early and late materialization) would only have to make modifications at the storage layer (since the query executor and optimizer are already capable of working with rows). This would lead me to believe that the Vertica FlexStore scheme is more immediately useful than the Greenplum hybrid storage scheme. But, as I have written before, my history with Vertica gives me some bias, so be careful what you do with that statement.&lt;br /&gt;&lt;br /&gt;(6) This has been a pretty big month for column-oriented storage layers. First Oracle announced their new column-oriented compression scheme (classified in the “Approach 1: PAX” category in &lt;a href="http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html"&gt;my previous blog post&lt;/a&gt;) and now Greenplum adds column-oriented storage. This should put more pressure on the other vendors in the data warehousing space (especially Microsoft and IBM) to come up with some sort of column-oriented storage option in the near future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-8735716178552605933?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/8735716178552605933/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/10/greenplum-announces-column-oriented.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8735716178552605933'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8735716178552605933'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/10/greenplum-announces-column-oriented.html' title='Greenplum announces column-oriented storage option'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6896116857872555861</id><published>2009-09-13T20:36:00.000-07:00</published><updated>2009-09-15T06:17:33.208-07:00</updated><title type='text'>Kickfire’s approach to parallelism</title><content type='html'>&lt;div&gt; I was chatting with Raj Cherabuddi, founder of Kickfire recently about Kickfire’s approach to parallelism, and I think that some of the problems they have to deal with regard to parallelizing queries are quite different from standard parallel database systems, and warrant talking about in a blog post.&lt;br /&gt;&lt;br /&gt;Parallel databases typically achieve parallelism via “data-partitioned parallelism”. The basic idea is that data is horizontally partitioned across processors, and each processor executes a query on its own local partition of the data. For example, let’s say a user wants to find out how many items were sold over $50 on August 8th 2009:&lt;br /&gt;&lt;br /&gt;SELECT count(*)&lt;br /&gt;FROM lineitem&lt;br /&gt;WHERE price &gt; $50 and date = ‘08/08/09’.&lt;br /&gt;&lt;br /&gt;If elements of the line item table are partitioned across different processors, then each processor will execute the complete query locally, computing a count of all tuples that pass the predicates on its partition of the data. These counts can then be combined in a “merge” final step into a global count. The vast majority of the query (everything except the very short final merge step) can be processed completely in parallel across processors.&lt;br /&gt;&lt;br /&gt;Let’s assume that there are no helpful indexes for this query. A naïve implementation would use the iterator model to implement this query on each processor. The query plan would consist of three operators: a table scan operator, a selection operator (for simplicity, let's assume it is separate from the scan operator), and an aggregation (count) operator. The aggregation operator would call “getNext” on its child operator (the selection operator), and the selection operator would in turn call getNext on its child operator (the scan operator). The scan would read the next tuple from the table and return control along with the tuple to the selection operator. The selection operator would then apply the predicate on the tuple. If the predicate passes, then the selection operator would return control, along with the tuple to the count operator which increments the running count. If the predicate fails, then instead of returning control to the count operator, the selection operator would instead call getNext again on its child (the scan operator) and apply the predicate to the next tuple (and keep on doing so until a tuple passes the predicate).&lt;br /&gt;&lt;br /&gt;It turns out that the iterator model is quite inefficient from an instruction cache and data cache perspective, since each operator runs for a very short time before yielding control to a parent or child operator (it just processes one tuple). Furthermore, there is significant function call overhead, as “getNext” is often called multiple times per tuple. Consequently modern systems will run each operator over batches of tuples (instead of on single tuples) to amortize initial cache miss and function call overheads over multiple tuples. Operator output is buffered, and a pointer to this buffer is sent to the parent operator when it is the parent operator’s turn to run.&lt;br /&gt;&lt;br /&gt;Whether the iterator model is used or the batched/staged model is used, there is only one operator running at once (per processor). Thus, the only form of parallelism is the aforementioned data-partitioned parallelism across processors. Even if a processor has multiple hardware contexts/cores, each processing resource will be devoted to processing the same one operator at a time (for cache performance reasons --- see e.g. &lt;a href="http://www.vldb2005.org/program/paper/tue/p49-zhou.pdf"&gt;this paper&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Kickfire, along with other companies that perform database operations directly in the hardware using FPGA technology, like Netezza and XtremeData, need to take a different approach to parallelism.&lt;br /&gt;&lt;br /&gt;Before discussing the effect on parallelism, let’s begin with a quick overview of FPGA (field programmable gate array) technology, At the lowest level, a FPGA contains a array of combinational logic, state registers, and memory, that can be configured via a mesh of wires to implement desired logical functions. Nearly any complex algorithm, including database operations such as selection, projection, and join, can be implemented in a FPGA and in doing so, can be run at the speed of the hardware. Not only is performance improved by running these algorithms in the hardware, but the chip can also be run at orders of magnitude lower clock frequencies, which result in commensurate gains in power efficiency. In many cases, operations that take hundreds to thousands of CPU instructions can be performed in a single clock cycle in FPGA logic.&lt;br /&gt;&lt;br /&gt;Kickfire therefore employs direct transistor-based processing engines in the FPGA to natively execute complete pipelines of relational database operations. The scale and density of the VLSI processes used in today FPGA’s enable a large number (order of hundreds) of these custom operations to occur in parallel, enabling the use of parallel processing algorithms that can improve performance even further.&lt;br /&gt;&lt;br /&gt;The ability to have a large number of operations occurring in parallel means that the query processing engines do not need to switch back and forth between different operators (as described in the iterator and blocked/staged schemes above). However, if you want to get the most out of the parallelism, data partitioned parallelism can only get you so far.&lt;br /&gt;&lt;br /&gt;For example, if every processing unit is devoted to performing a selection operation on a different partition of the data, then the result of the selection operator will build up, eventually exceeding the size of on-chip and off-chip buffers, thereby starving the execution engines. Consequently, Kickfire implemented efficient pipelined parallelism in addition to data partitioned parallelism, so that all operators in a query plan are running in the hardware at the same time, and data is being consumed at the approximate rate that it is being produced. Kickfire implemented advanced networking techniques in the areas of queuing and flow control to manage the data flow between the multiple producers and consumers, ensuring that the data, in most cases, stays on the chip, and only occasionally spills to the memory (off-chip buffers). Keeping the intermediate datasets live on the chip prevents memory latency and bandwidth from becoming a bottleneck.&lt;br /&gt;&lt;br /&gt;However, data-partitioned parallelism is still necessary since operators consume data at different rates. For example, if a selection predicate has 50% selectivity (1 in 2 tuples pass the predicate) followed by an aggregation operator (as in the example above), then one wants to spend approximately twice as much time doing selection as aggregation (since the aggregation operator will only have half as many tuples to process as the selection operator), so Kickfire will use data-partitioned parallelism to have twice as many selection operators as the parent operator.&lt;br /&gt;&lt;br /&gt;For example, a hardcoded Kickfire query might look like the figure below (ignoring the column-store specific operators which is a story for another day):&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img id="BLOGGER_PHOTO_ID_5381162560068249954" style="margin: 0px auto 10px; display: block; width: 406px; height: 266px; text-align: center;" alt="" src="http://1.bp.blogspot.com/_muZF7G-aiz4/Sq261nd_6WI/AAAAAAAAAAk/aJjDOuxqbjE/s320/kickfirepic.jpg" border="0" /&gt;&lt;br /&gt;&lt;br /&gt;Note that the selection operations on T1 and T2 along with the join between these two tables occurs four times in the query plan (this is data-partitioned parallelism). Since the join has a reasonably low cardinality, there is no need to have the parent selection operator also appear four times in the query plan; rather it can appear twice, with each operator processing the results from two child join operators. Similarly, since the selection operators produce fewer outputs than inputs, the parent operator only needs to appear once. Data from one operator in the query plan is immediately shipped to the next operator for processing.&lt;br /&gt;&lt;br /&gt;Kickfire also claims to be able to devote hardware to running multiple queries at the same time (inter-query parallelism). Getting the right mix of data-partitioned parallelism, pipelined parallelism, and inter-query parallelism is a tricky endeavor, and is part of Kickfire’s “secret sauce”. Clearly, this requires some amount of knowledge about the expected cardinality of each operator, and the Kickfire software uses information from the catalog to help figure all of this out. One would expect this process to get more difficult for complex queries --- it will be interesting to see how Kickfire performs on complex workloads as they continue to gain customer traction (Raj makes a compelling case that, in fact, it is the most complex queries where FPGA technology can shine the brightest).&lt;br /&gt;&lt;br /&gt;In a nutshell, Kickfire uses column-oriented storage and execution to address I/O bottlenecks (column-oriented storage has been covered extensively elsewhere in my blog, but you can read about the specifics of Kickfire’s column-store &lt;a href="http://www.kickfire.com/blog/?p=392"&gt;on their blog&lt;/a&gt;), and FPGA-based data-flow architecture to address processing and memory bottlenecks. Their “SQL chip” acts as a coprocessor and works in conjunction with the x86 processors (which runs a SQL execution engine in the software when needed, though this is usually the exception path) in their base server. By alleviating these three important bottlenecks, Kickfire is able to deliver high performance; yet still achieves tremendous power efficiency thanks to the low clock frequencies.&lt;br /&gt;&lt;br /&gt;Overall, although I have openly questioned Kickfire’s go-to-market strategy in past posts (see &lt;a href="http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html"&gt;here &lt;/a&gt;and &lt;a href="http://dbmsmusings.blogspot.com/2009/06/ceo-responds-to-my-post-on-kickfire.html"&gt;here&lt;/a&gt;), their non-technical departments seem a little disorganized at times (see &lt;a href="http://jeromepineau.blogspot.com/2009/07/adr-and-how-i-got-kicked-out-by.html"&gt;Jerome Pineau’s experience&lt;/a&gt;), and some highly visible employees are no longer with the company (notably Ravi Krishnamurthy who presented their SIGMOD paper and Justin Swanhart who did a nice job explaining the Kickfire column-store features in the aforementioned write-up), I remain a fan of their technology. If they make it through the current difficult economic climate, it will be at the virtue of their technology and the tireless work of people like Raj. As the rate of clock speed increases of commodity processors continues to slow down, being able to perform database operations in the hardware becomes an increasingly attractive proposition.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6896116857872555861?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6896116857872555861/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/09/kickfires-approach-to-parallelism.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6896116857872555861'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6896116857872555861'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/09/kickfires-approach-to-parallelism.html' title='Kickfire’s approach to parallelism'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_muZF7G-aiz4/Sq261nd_6WI/AAAAAAAAAAk/aJjDOuxqbjE/s72-c/kickfirepic.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6068952524060677104</id><published>2009-09-03T14:54:00.000-07:00</published><updated>2009-09-05T17:39:43.548-07:00</updated><title type='text'>A tour through hybrid column/row-oriented DBMS schemes</title><content type='html'>There has been a lot of talk recently about hybrid column-store/row-store database systems. This is likely due to many announcements along these lines in the past month, such as Vertica’s recent 3.5 release which contained &lt;a href="http://www.vertica.com/company/news/vertica-analytic-database-broadens-reach-with-flexstore"&gt;FlexStore&lt;/a&gt;, Oracle’s &lt;a href="http://blog.tanelpoder.com/2009/09/01/oracle-11gr2-has-been-released-and-with-column-oriented-storage-option/"&gt;recent revelation&lt;/a&gt; that Oracle Database 11g Release 2 uses column-oriented storage for the purposes of superior compression, and VectoreWise’s &lt;a href="http://dbmsmusings.blogspot.com/2009/07/watch-out-for-vectorwise.html"&gt;recent decloaking&lt;/a&gt; that also announced an optional hybrid storage layer. Furthermore, analysts like Curt Monash and Philip Howard are predicting further development in this space (see &lt;a href="http://www.dbms2.com/2009/08/04/pax-analytica-row-and-column-stores-begin-to-come-together/"&gt;here&lt;/a&gt; and &lt;a href="http://www.it-director.com/technology/data_mgmt/content.php?cid=11453"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;It’s surprising to me that it has taken this long before we started seeing commercially available implementations of hybrid systems. The research community has been publishing papers on hybrid systems for decades, with straightforward proposals that could easily be implemented in commercial systems starting to be published 8 years ago.&lt;br /&gt;&lt;br /&gt;Different approaches to building hybrid systems yield very different performance properties and solve different sets of problems. Thus, as more hybrid systems become commercially available, and as more companies consider developing their own hybrid system, it is important that people understand the different tradeoffs involved between the various hybrid schemes. My goal in this post is to educate people who are not part of the SIGMOD/VLDB research community about three well-known approaches to building hybrid systems taken by different research efforts, and give pointers to research papers where the reader can find out more detail. Each approach has its own set of advantages and disadvantages and I will try to list both sides of the tradeoff in each case. The goal of this post is not to say that one type of hybrid scheme is better than another --- each scheme can be a good fit in the right situation.&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold"&gt;Approach 1: PAX&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The PAX scheme was published in 2001 in the VLDB paper “Weaving Relations for Cache Performance” by Natassa Ailamaki, David DeWitt, Mark Hill, and Marios Skounakis. The basic idea is the following: instead of storing data row-by-row &lt;span style="FONT-WEIGHT: bold"&gt;within a disk block&lt;/span&gt;, store it column-by column. This is different than a “pure” column store, which stores each column in entirely separate disk blocks. The key difference is that if you had a table with 10 attributes, then in a “pure” column store, data from each original tuple is spread across 10 different disk blocks, whereas in PAX, all data for each tuple can be found in a single disk block. Since a disk block is the minimum granularity with which data can be read off of disk, in PAX, even if a query only accesses only 1 out of the 10 columns, it is impossible to read only this single column off of disk, since each disk block contains data for all 10 attributes of the table.&lt;br /&gt;&lt;br /&gt;To understand why this is a good idea, some context is necessary. At the time the paper was written, column-stores (called the “DSM model” in the paper) had made very limited impact on the commercial landscape (there was Sybase IQ, but it was very much a niche product at the time). It was widely believed that the reason why the DSM model had failed to take off was due to the high tuple reconstruction costs.&lt;br /&gt;&lt;br /&gt;Let’s say a query accessed three out of ten columns from a table and required some operator (like an aggregation) that required each of these three columns to be scanned fully. A row-store would have to do a complete table scan, wasting some I/O reading the 7 irrelevant columns in addition to the 3 relevant ones. But at least it would read the whole table sequentially. The column-store would only have to read the 3 relevant columns, but would have to seek back and forth between the 3 columns, doing tuple reconstruction. In 2001, servers had nowhere near the memory capacities they have today, so extensive prefetching was not an option (i.e., instead of reading one block from column 1 and then one block from column 2 and then one block from column 3 and then the next block from column 1, etc., prefetching allows you to read n blocks from column 1 and then n blocks from column 2, etc, allowing the seek cost to be amortized over a large amount of sequential reads, but you need enough memory to keep n blocks from each column in memory at once). Given how expensive seek costs are relative to sequential access, it is no accident column-stores didn’t take off until system memories increased to recent levels to allow for significant prefetching. (Research on late materialization to delay tuple reconstruction until the end of the query when fewer tuples need to be materialized also helped).&lt;br /&gt;&lt;br /&gt;Anyway, PAX was able to achieve the CPU efficiency of column-stores while maintaining the disk I/O properties of row-stores. For those without detailed knowledge of column-stores, this might seem strange: the way most column-stores pitch their products is by accentuating the disk I/O efficiency advantage (you only need to read in from disk exactly what attributes are accessed by a particular query). Why would a column-store want equivalent disk access patterns as a row-store? Well, it turns out column-stores have an oft-overlooked significant CPU efficiency as well. The aspect of CPU efficiency that the PAX paper examined was cache hit ratio and memory bandwidth requirements. It turns out that having column data stored sequentially within a block allows cache lines to contain data from just one column. Since most DBMS operators only operate on one or two columns at a time, the cache is filled with relevant data for that operation, thereby reducing CPU inefficiency due to cache misses. Furthermore, only relevant columns for any particular operation need to shipped from memory.&lt;br /&gt;&lt;br /&gt;Bottom line:&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Advantages of PAX:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Yields the majority of CPU efficiency of column-stores&lt;/li&gt;&lt;li&gt;Allows for column-oriented compression schemes, which can improve compression ratio due do increased data locality (data from the same attribute domain is stored contiguously). This can improve performance since the more data can be compressed, the less time needs to be spent reading it in from storage.&lt;/li&gt;&lt;li&gt;Is theoretically fairly easy to implement in a row-store system to get some of the column-store advantages. I say “theoretically” because no commercial row-store system actually did this (to the best of my knowledge) until Oracle 11gR2.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Disadvantages of PAX&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Equivalent I/O properties as row-stores (not counting compression) in the sense that irrelevant columns still have to read from storage in the same blocks as the needed columns for any particular query. In 2001 this was an advantage. Today, for typical analytical workloads, this is a significant disadvantage. (For less scan-oriented workloads, such as tuple lookups and needle-in-the-haystack queries, this remains an advantage).&lt;/li&gt;&lt;li&gt;The cache prefetching features on modern processors renders some of the cache efficiency of PAX obsolete (PAX no longer makes a large difference on cache hit ratio). However, it still reduces the demands on memory bandwidth, and other CPU advantages of column-stores, such as &lt;a href="http://dbmsmusings.blogspot.com/2009/07/watch-out-for-vectorwise.html"&gt;vectorized processing&lt;/a&gt;, remain possible to achieve in PAX.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold"&gt;Approach 2: Fractured Mirrors&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This scheme was published in 2002 in the VLDB paper “A Case for Fractured Mirrors” by Ravi Ramamurthy, David DeWitt, and Qi Su. (Yes, you read that right. The University of Wisconsin DBMS group lead by DeWitt authored both seminal papers on hybrid column-store/row-store systems.) The approach is essentially the following: you’re going to replicate all of your data anyway for high availability and/or disaster recovery. Why not have different storage layouts in each replica? That way, if you have a tuple-lookup or a needle-in-a-haystack query, you send it to the row-store replica. If the query is more scan-oriented (e.g. an aggregation or summarization query), you send it to the column-store replica. The implementation in the paper is a little more complicated (to avoid skew, each node contains part of the column-store and part of the row-store), but the above description is the basic idea.&lt;br /&gt;&lt;br /&gt;Most people agree (I think) that row-stores are more than an order of magnitude faster than column-stores for tuple-lookup queries, and column-stores are (at least) more than order of magnitude faster than row-stores for scan-oriented queries. Here is how one comes to this conclusion: To lookup a tuple in a row-store, one needs to read in just the one block that contains the tuple (let’s assume all relevant indexes for both the row-store and the column-store are in memory). In a column-store, one block needs to be read for each attribute. Assuming there are more than 10 attributes in the table, this is already more than an order of magnitude. On the other hand, for scan queries, if a query accesses less than 10% of the attributes of a table (the common case), column-stores get one order of magnitude improvement relative to row-stores immediately (for disk efficiency). Additionally, many have argued (see &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf"&gt;here&lt;/a&gt; and &lt;a href="http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf"&gt;here&lt;/a&gt;) that column-stores get an additional order of magnitude improvement for CPU efficiency.&lt;br /&gt;&lt;br /&gt;If you buy the above argument, then it is critical to use a scheme like fractured mirrors for mixed workloads. Given how often people talk about mixed workloads as being a key problem in enterprise data warehouses, it is surprising how long it has taken to see a commercial implementation along the lines written about in the research paper.&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Advantages of Fractured Mirrors:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Every individual query gets sent to the optimal storage for that query. Performance is thus an order of magnitude better than either a pure row-store or a pure column-store on queries that are problematic for that type of store.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Disadvantages of Fractured Mirrors:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;All data needs to be replicated. Obviously, in most cases you’re going to replicate the data anyway. But if you are already using the replication for something else (e.g. storing the data in a different sort order), then you either need to increase the replication factor or remove some of the additional sort orders in order to implement fractured mirrors.&lt;/li&gt;&lt;li&gt;If you really want to get the most out of this approach, you need to extend what is proposed in the research paper and have complete implementations of row-store and column-store systems (since column-stores have very different query execution engines and query optimizers than row-stores). This is obviously a lot of code, and precludes most companies from using the fractured mirrored approach. I am flummoxed as to why the only company with legitimate row-store and column-store DBMS products (Sybase with ASE and IQ) hasn’t implemented the fractured mirrors approach yet.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold"&gt;Approach 3: Fine-grained hybrids&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://pages.cs.wisc.edu/~jignesh/publ/dmg.pdf"&gt;VLDB 2003 paper&lt;/a&gt; by Hankins and Patel and the &lt;a href="http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_97.pdf"&gt;CIDR 2009 paper&lt;/a&gt; by Cudre-Mauroux, Wu, and Madden are examples of this approach. Here, individual tables can be divided into both row and column-oriented storage. For example, if two columns are often accessed together (e.g. order-date and ship-date), they can be stored together (in rows). The remaining columns can be stored separately. This can be done within a disk block, within a table, or even at a slightly larger grain across tables. For example, if one table is often accessed via tuple lookups (e.g. a customer table), then it can be stored in rows; while other tables that are usually scanned (e.g. lineitem) can be stored in columns.&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Advantages of fine-grained hybrids&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If correct decisions are made about what attributes (and/or tables) should be stored in rows, and what attributes should be stored in columns, then you can get all of the performance advantages of fractured mirrors without the additional replication disadvantage. &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="FONT-STYLE: italic"&gt;Disadvantages of fine-grained hybrids&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Depending on the DBMS system you start with, it can be complex to implement. You essentially have to have both row- and column-oriented execution engines, the optimizer has to know about the differences between row-storage and column-storage, and indexing code has to be updated appropriately. It turns out that it is much easier to implement in a column-store that already supports early materialization (early materialization refers to the ability to reconstruct rows from column-storage early in a query plan) than in other types of systems. This is because early-materialization requires query operators to be able to handle input in both column and row-oriented format (it will be in column format if tuples haven’t been materialized yet, otherwise it will be in row-oriented format). Hence, the execution engine and optimizer already has knowledge about the difference between rows and columns and can act appropriately.&lt;/li&gt;&lt;li&gt;It requires some knowledge about a query workload. Otherwise incorrect decisions about tuple layout will be made, leading to suboptimal performance.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold"&gt;Commercial availability&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Oracle, Vertica, and VectorWise have announced hybrid systems using one of these schemes (I have no inside knowledge about any of these implementations, and only know what’s been published publicly in all cases). It appears that Oracle (see &lt;a href="http://kevinclosson.wordpress.com/2009/09/01/oracle-switches-to-columnar-store-technology-with-oracle-database-11g-release-2/"&gt;Kevin Closson’s helpful responses&lt;/a&gt; to my questions in a comment thread to one of his blog posts) and VectorWise use the first approach, PAX. Vertica uses fine-grained hybrids (approach 3), though they probably could use their row-oriented storage scheme to implement fractured mirrors (approach 2) as well, if they so desire. Given that two out of the three authors of the fractured mirrors paper have been reunited at Microsoft, I would not be surprised if Microsoft were to eventually implement a fractured mirrors hybrid scheme.&lt;br /&gt;&lt;br /&gt;&lt;span style="FONT-WEIGHT: bold"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There was nearly a decade of delay, but at long last we’re starting to see hybrid row/column-stores hit the marketplace. Row-stores and column-stores have very different performance characteristics, and they tend to struggle where the alternative excels. As workloads get more complex, hybrid systems will increase in importance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6068952524060677104?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6068952524060677104/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html#comment-form' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6068952524060677104'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6068952524060677104'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html' title='A tour through hybrid column/row-oriented DBMS schemes'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6569571668704888349</id><published>2009-08-19T09:45:00.000-07:00</published><updated>2009-08-19T09:45:00.241-07:00</updated><title type='text'>Netezza TwinFin: A step towards a potential acquisition?</title><content type='html'>Ever since Netezza became a public company, every once and a while someone tries to start a rumor that Netezza is on the verge of being acquired (likely started by people who want a quick return on their Netezza stock buy). These rumors usually involve a company like Oracle buying Netezza, which never made a lot of sense to me, since Oracle has their own DBMS product and has very little reason to buy a much smaller competitor like Netezza and maintain two lines of code that target the same market. This is why it wasn’t surprising that Microsoft chose to acquire DATAllegro instead of Netezza, even though Netezza was much farther along than DATAllegro and had a larger customer base. DATAllegro essentially left the DBMS engine in tact, with its key technological assets sitting on top of the DBMS, turning many single-node Ingres instances into a large, shared-nothing, MPP DBMS. Since DATAllegro used a nice, modular architecture, Microsoft was able to replace Ingres with SQL Server, and use DATAllegro’s technology to turn SQL Server into a MPP DBMS without significant modifications to the core SQL Server DBMS engine (see Microsoft’s Project Madison).&lt;br /&gt;&lt;br /&gt;But two events now make me wonder if Netezza might actually end up being acquired by a vendor that currently sells a competing DBMS product (likely either IBM with DB2 or HP with NeoView).&lt;br /&gt;&lt;br /&gt;First, there was the release Oracle Database Machine. Oracle openly admits that the Oracle Database Machine frequently gets a factor of between 10 and 70 performance improvement relative to previous Oracle offerings (i.e. Oracle RAC) on scan-heavy analytical workloads. But the center of the Oracle Database Machine is …. Oracle RAC! So how does it get the order of magnitude performance improvement relative to RAC? By connecting RAC (using Infiniband) to a shared-nothing storage layer (Exadata) that can perform database scans at extremely high speeds and do some basic database operations like tuple selection and projection. Since scan-oriented queries are limited by the speed with which the scan can occur, simply connecting RAC to a storage layer that can do scans really well yields significant improvement.&lt;br /&gt;&lt;br /&gt;Perhaps Netezza’s greatest asset is its ability to achieve high performance on table scans. By using FPGAs to perform decompression, selection, and projection as data is read off of disk, Netezza is able to perform scans faster than what competitors (at least row-store competitors) can do on commodity hardware. If the Oracle Database Machine is successful (Larry Ellison said at a &lt;a href="http://seekingalpha.com/article/144939-oracle-corporation-f4q09-qtr-end-05-31-09-earnings-call-transcript?page=-1"&gt;recent earnings call&lt;/a&gt; that it "is shaping up to be our most exciting and successful new product introduction in Oracle’s 30 year history"), I would expect its competitors to follow suit --- and connect their DBMS engines to a high performance storage layer the way Oracle did with Exadata.&lt;br /&gt;&lt;br /&gt;Second, Netezza’s recent move to re-architect their appliance via TwinFin (announced a few weeks ago) is a clear embrace of commodity hardware components. Before this redesign, Netezza was a monolithic appliance. As &lt;a href="http://www.computerweekly.com/Articles/2009/08/07/237248/nyse-euronext-to-support-data-growth-with-netezza-upgrade.htm"&gt;detailed by ComputerWeekly&lt;/a&gt;, if you wanted to upgrade storage or processing capacity, you had to wait for the next Netezza release and replace the whole appliance with the Netezza’s next generation. Now, the core part of the Netezza technology can be placed in the “sidecar” expansion slot in the standard IBM BladeServer family of servers. This allows customers to upgrade the IBM blades independently of the Netezza technology.&lt;br /&gt;&lt;br /&gt;Looking at it a different way: the technology behind Netezza’s stellar scan performance can now be found in a nice modular component, the “DB Accelerator” card, that can be placed in standard expansion slots in blade servers. The move towards a more modular architecture is reminiscent of the DATAllegro architecture that allowed Microsoft to replace Linux with Windows and Ingres with SQL Server and keep the majority of the rest of the DATAllegro technology. DATAllegro was sold for $275 million to Microsoft when it only had 3-4 customers.&lt;br /&gt;&lt;br /&gt;Netezza’s current market cap is currently $550 million and it has orders of magnitude more customers than DATAllegro did (and is currently profitable). Hence it seems like a prime candidate for an acquisition. Its recent architectural redesign allow it to be acquired even by a company with a competing data warehouse product, since its core technology can be used in the storage layer as a drop in accelerator for table scans and used in a similar way that Oracle uses Exadata. IBM seems like a natural fit given their close partnership on TwinFin. Otherwise HP seems like an option since NeoView seems like it is having trouble getting off of the ground. Time will tell, but I will no longer ignore Netezza acquisition rumors the way I once did.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6569571668704888349?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6569571668704888349/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/08/netezza-twinfin-step-towards-potential.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6569571668704888349'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6569571668704888349'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/08/netezza-twinfin-step-towards-potential.html' title='Netezza TwinFin: A step towards a potential acquisition?'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-4109368587338973423</id><published>2009-08-04T21:18:00.000-07:00</published><updated>2009-08-04T21:40:40.995-07:00</updated><title type='text'>Netezza's competitors open fire</title><content type='html'>It's fairly unusual to see a company openly attack a competitor in a public forum. I don't know the exact reason, but I presume it has something to do with the old dictum "there's no such thing as bad publicity", so giving a competitor free publicity of any kind (even the negative sort) is deemed a bad idea. Or maybe it's because the attack might backfire --- if the attack is not made using solid reasoning, the company might come off looking foolish. Or maybe it's because attacking a competitor is, in a way, a tacit validation of their position as an equal --- small, insignificant companies would be ignored; an attack is acknowledging that the two companies see each other often in competitive situations, and might encourage a potential customer to consider the competitor in the same POC when they might have not done so otherwise.&lt;br /&gt;&lt;br /&gt;This is why the back and forth between Netezza and its competitors has been so jarring. By my estimation, it started when Larry Ellison positioned the new Oracle Exadata release back in September 2008 against Netezza, questioning Netezza's fault tolerance, DW functionality, and DBMS know-how (&lt;a href="http://www.rittmanmead.com/2008/09/24/live-blogging-from-the-larry-ellison-keynote-oow-2008/"&gt;http://www.rittmanmead.com/2008/09/24/live-blogging-from-the-larry-ellison-keynote-oow-2008/&lt;/a&gt;). This was then responded to by Netezza, who became absolutely obsessed with Oracle Exadata, attacking them in multiple postings on their Data Liberators blog (see &lt;a href="http://www.dataliberators.com/do-i-sense-fear-of-innovation"&gt;here&lt;/a&gt;, &lt;a href="http://www.dataliberators.com/be-all-that-you-can-be-but-can-you-be-everything-to-everybody"&gt;here&lt;/a&gt;, &lt;a href="http://www.dataliberators.com/giving-some-ink-to-teradata"&gt;here&lt;/a&gt;, and &lt;a href="http://www.dataliberators.com/oracle-sun-what-about-hp"&gt;here&lt;/a&gt;) even resorting to name calling ("Oracle Exaggerdata").&lt;br /&gt;&lt;br /&gt;In the last few days, the attacks on Netezza have increased in intensity. First, I came across an &lt;a href="http://blogs.oracle.com/datawarehousing/2009/08/a_not_so_fabulous_new_release_1.html"&gt;Oracle blog post&lt;/a&gt; that basically claimed that Oracle Exadata is better than Netezza along every possible dimension: storage, CPU power, memory, interconnect, load performance, query performance, and architecture.&lt;br /&gt;&lt;br /&gt;Then, I came across an &lt;a href="http://www.asterdata.com/blog/index.php/2009/08/03/netezzas-change-in-architecture-move-towards-commodity/"&gt;Aster Data blog post&lt;/a&gt; which made wild claims such as Netezza's new release is an indication that Netezza regrets building their DBMS around FPGAs and is now desperately trying to abandon ship and switch to mainstream CPUs.&lt;br /&gt;&lt;br /&gt;Be careful reading both of these attacks. I find them both full of FUD and disagree with much of the premise of both of them. Oracle's blog post omits comparing Netezza and Oracle along perhaps the two most important dimensions: price/performance and total cost of ownership. Oracle brags that their database machine uses a 20Gb infiniband interconnect while Netezza only uses 1Gb ethernet. But presumably the price of the expensive interconnect gets passed on to the customer --- it could easily be argued that Netezza's use of 1GB ethernet is an indication that their architecture might be superior --- Oracle needs infiniband to connect the storage and computation layers of their system; Netezza's ability to push computation to the data allows them to avoid having to include the high cost interconnect. I would guess that Netezza's price/performance is significantly superior to Oracle's, but trying to calculate the price of Oracle's database machine is far too complicated to put some meat behind this statement. Furthermore, Netezza's superior total cost of ownership relative to Oracle is common knowledge, I would be surprised to see someone argue otherwise.&lt;br /&gt;&lt;br /&gt;I also find the Aster Data post full of FUD. Claiming that Netezza is trying to abandon their FPGA approach is ridiculous (in my opinion). They have invested a huge amount into doing decompression, projections, selections, and other DBMS operations inside the FPGA, and there are performance advantages in doing so. The redesign of their architecture was necessary to be able to improve caching of data in memory (to improve repeated scans of the same table) and to add more commodity components to their system, allowing them to take better advantage of upgrades to the disk and CPU technology they incorporate.&lt;br /&gt;&lt;br /&gt;Though the Oracle and Aster Data reactions are misleading, I'm still a little worried about Netezza. Like everyone else, I was looking forward to their "big" announcement at TDWI, and was disappointed when I found out what it was. Sure, the internal architectural redesign is big news to Netezza internally, but ultimately, all it means to the customers is that Netezza can now do some things that its competitors already can do. Sure lower prices and a better ability to handle mixed workloads are nice, but I was expecting something a little more radical. I guess the lesson to be learned is that it is never a good idea to prepare people for a big announcement --- it just leaves lots of potential for disappointment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-4109368587338973423?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/4109368587338973423/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/08/netezzas-competitors-open-fire.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4109368587338973423'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4109368587338973423'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/08/netezzas-competitors-open-fire.html' title='Netezza&apos;s competitors open fire'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-7429587609529266907</id><published>2009-07-29T05:00:00.000-07:00</published><updated>2009-07-29T05:00:03.721-07:00</updated><title type='text'>Watch out for VectorWise</title><content type='html'>&lt;p&gt;In the last few years, there have been so many new analytical DBMS startups de-cloaking that it’s difficult to keep track of them all. Just off the top of my head, I would put Aster Data, DATAllegro, Dataupia, Exasol, Greenplum, Infobright, Kickfire, ParAccel, Vertica, and XtremeData in that category. Once you add the new analytical DBMS products from established vendors (Oracle Exadata and HP Neoview) we’re at a dozen new analytical DBMS options to go along with older analytical DBMS products like Teradata (industry leader), Netezza, and Sybase IQ. Finally, with the free and open source &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;HadoopDB&lt;/a&gt;, we now have (at least) sixteen analytical DBMS solutions to choose from.&lt;br /&gt;&lt;br /&gt;Given the current overwhelming number of analytical DBMS solutions, I suspect that VectorWise’s sneak preview happening this week is not going to get the attention that it deserves. VectorWise isn’t making a splash with a flashy customer win (like Aster Data did with MySpace), or a TPC-H benchmark win (like Kickfire and ParAccel did), or an endorsement from a DBMS legend (like Vertica did). They’re not even going to market with their own solution; rather, they’re teaming up with Ingres for a combined solution (though the entire Ingres storage-layer and execution engine has been ripped out and replaced with VectorWise). But I’m telling you: VectorWise is a company to watch.&lt;br /&gt;&lt;br /&gt;Here are the reasons why I like them:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;They are a column-store. I strongly believe that column-stores are the right solution for the analytical DBMS market space. They can get great compression ratios with lightweight compression algorithms, and are highly I/O efficient. In my opinion, the only reason why there are companies on the above list that are not column-stores is that they wanted to accelerate time to market by extending previously existing DBMS code, and the most readily available DBMS code at the time was a row-store. Any DBMS built from scratch for the (relational, structured data) analytical DBMS market should be a column-store.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Column-stores are so I/O efficient that CPU and/or memory usually become bottlenecks very quickly. Most column-stores do very careful optimizations to eliminate these bottlenecks. But to me, VectorWise has gone the extra mile. The query operators are run via a set of query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instructions. Vectors of 100-1000 values within a column get pipelined through a set of query operations on that column, with many values typically being processed in parallel by SIMD (single instruction, multiple data) capabilities of modern CPU chips. Most database systems are unable to take advantage of SIMD CPU capabilities --- the tuple-at-a-time (iterator) processing model of most database systems is just too hard for compilers to translate to SIMD instructions. VectorWise has gone to great lengths to make sure their code results in vectorized CPU processing. Their execution primitives are also written to allow CPUs to do efficient out-of-order instruction execution via loop-pipelining (although compilers are supposed to discover opportunities for loop-pipelining on their own, without carefully written code, this doesn’t happen in practice as often as it should). So with highly optimized CPU-efficient code, along with (1) operator pipelining to keep the active dataset in the cache and (2) column-oriented execution reducing the amount of data that must be shipped from memory to the CPU, VectorWise reduces the CPU and memory bottlenecks in a major way. The bottom line is that VectorWise is disk efficient AND memory efficient AND CPU efficient. This gets you the total performance package.&lt;br /&gt;&lt;li&gt;Their founders include Peter Boncz and Marcin Zukowski from CWI. I generally think highly of the DBMS research group at CWI, except for one of their papers which ... actually ... maybe it’s better if I don’t finish this sentence. I have spoken highly about them in previous posts on my blog (see &lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;here &lt;/a&gt;and &lt;a href="http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;li&gt;It looks likely that their solution will be released open source. I was unable to get a definite commitment from Boncz or Zukowski one way or another, but the general sense I got was that an open source release was likely. But please don’t quote me on that.&lt;br /&gt;&lt;li&gt;If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;storage engine for HadoopDB&lt;/a&gt;. I have already requested an academic preview edition of their software to play with. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;In the interest of full disclosure, here are a few limitations of VectorWise &lt;/p&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;It is not a shared-nothing, MPP DBMS. It runs on a single machine. This limits its scalability to low numbers of terabytes. However, VectorWise is targeting the same “mass market” that Kickfire is, where the vast majority of data warehouses are less than 10TB. Furthermore, as mentioned above, it is a great candidate to be turned into a shared-nothing, parallel DBMS via the HadoopDB technology, and I look forward to investigating this opportunity further.&lt;br /&gt;&lt;li&gt;In my experience, having large amounts of low-level, CPU optimized code is hard to maintain over time, and might limit how nimble VectorWise can be to take advantage of new opportunities. Portability might also become a concern (in the sense that not all optimizations will work equally well on all CPUs). However, I would not put anything past such a high quality technical team.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Two final notes:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;I like their go-to-market strategy. Like Infobright and Kickfire they are after the low-priced, high volume analytical DBMS mass market. But the problem with the mass market is that you need a large global sales and support team to handle the many opportunities and customers. Startups that target the high-end have it much easier in that they can get through the early stages of the company with a few high-priced customers and don’t need to invest as much in sales and support. By getting into bed with Ingres, VectorWise gets to immediately take advantage of the global reach of Ingres, a key asset if they want to target the lower end of the market.&lt;br /&gt;&lt;li&gt;CWI is also the creator of the open source MonetDB column-store. VectoreWise is a completely separate codeline, and makes several philosophical departures from MonetDB. According to VectorWise, MonetDB’s materialization of large amounts of intermediate data (e.g., from running operators to completion) makes it less scalable (more suited for in-memory data sets) than VectorWise. VectorWise has superior pipelined parallelism, and vectorized execution. I have not checked with the MonetDB group to see if they dispute these claims, but my knowledge from reading the MonetDB research papers is generally in line with these statements, and my understanding is that the MonetDB and VectorWise teams remain friendly.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-7429587609529266907?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/7429587609529266907/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/watch-out-for-vectorwise.html#comment-form' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7429587609529266907'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/7429587609529266907'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/watch-out-for-vectorwise.html' title='Watch out for VectorWise'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-1094148618368198949</id><published>2009-07-20T08:04:00.000-07:00</published><updated>2009-07-20T10:28:26.570-07:00</updated><title type='text'>Announcing release of HadoopDB (longer version)</title><content type='html'>If you have a short attention span see the &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;shorter blog post&lt;/a&gt;.&lt;br /&gt;If you have a large attention span, see the complete &lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;12-page paper&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see &lt;a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/"&gt;the end of this write-up&lt;/a&gt; for some links to large data warehouses).&lt;br /&gt;&lt;br /&gt;The second trend is what I talked about in &lt;a href="http://dbmsmusings.blogspot.com/2009/07/sybase-iq-throws-its-hat-into-in-dbms.html"&gt;my last blog post&lt;/a&gt;: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.&lt;br /&gt;&lt;br /&gt;I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)&lt;br /&gt;&lt;br /&gt;Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the &lt;a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"&gt;enormous eBay 6.5 petabyte database&lt;/a&gt; (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as &lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;I wrote about previously&lt;/a&gt;, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., &lt;a href="http://perspectives.mvdirona.com/2009/01/15/TheCaseForLowCostLowPowerServers.aspx"&gt;the work by James Hamilton&lt;/a&gt;), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.&lt;br /&gt;&lt;br /&gt;Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.&lt;br /&gt;&lt;br /&gt;Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.&lt;br /&gt;&lt;br /&gt;Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.&lt;br /&gt;&lt;br /&gt;In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the &lt;a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/"&gt;Facebook Hadoop deployment&lt;/a&gt; as an example). Hadoop simply scales better than any currently available parallel DBMS product.  Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)&lt;br /&gt;&lt;br /&gt;The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/benchmarks-sigmod09.pdf"&gt;significantly outperformed by parallel database systems on structured data analysis tasks&lt;/a&gt;. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.&lt;br /&gt;&lt;br /&gt;Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.&lt;br /&gt;&lt;br /&gt;That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed &lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html"&gt;HadoopDB&lt;/a&gt;. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.&lt;br /&gt;&lt;br /&gt;Our paper (that will be presented at the upcoming &lt;a href="http://vldb2009.org/"&gt;VLDB conference&lt;/a&gt; in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).&lt;br /&gt;&lt;br /&gt;Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.&lt;br /&gt;&lt;br /&gt;The initial release of the source code for HadoopDB can be found at &lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html"&gt;http://db.cs.yale.edu/hadoopdb/hadoopdb.html&lt;/a&gt;. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-1094148618368198949?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/1094148618368198949/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html#comment-form' title='28 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/1094148618368198949'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/1094148618368198949'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html' title='Announcing release of HadoopDB (longer version)'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>28</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-8618588290805600555</id><published>2009-07-20T07:15:00.000-07:00</published><updated>2009-07-20T16:12:09.857-07:00</updated><title type='text'>Announcing release of HadoopDB (shorter version)</title><content type='html'>I'm pleased to announce the release of HadoopDB. HadoopDB is:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A hybrid of DBMS and MapReduce technologies targeting analytical query workloads&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Designed to run on a shared-nothing cluster of commodity machines, or in the cloud&lt;/li&gt;&lt;li&gt;An attempt to fill the gap in the market for a free and open source parallel DBMS&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Much more scalable than currently available parallel database systems and  DBMS/MapReduce hybrid systems (see &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;longer blog post&lt;/a&gt;).&lt;/li&gt;&lt;li&gt;As scalable as Hadoop, while achieving superior performance on structured data analysis workloads&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;For more, see:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html"&gt;Project Webpage&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;More detailed blog post&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;Complete 12-page VLDB paper&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-8618588290805600555?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/8618588290805600555/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8618588290805600555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8618588290805600555'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html' title='Announcing release of HadoopDB (shorter version)'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-2096679221335255452</id><published>2009-07-15T19:45:00.000-07:00</published><updated>2009-07-15T21:30:35.421-07:00</updated><title type='text'>Sybase IQ throws its hat into the in-DBMS analytics ring</title><content type='html'>Sybase IQ announced this week the availability of Sybase IQ 15.1. The &lt;a href="http://www.businesswire.com/portal/site/google/?ndmViewId=news_view&amp;amp;newsId=20090714005651&amp;amp;newsLang=en"&gt;press release &lt;/a&gt;made it clear that this version is all about in-DBMS analytics. Perhaps the most notable addition is the integration of the DB Lytix (a product from Fuzzy Logix) analytics library into Sybase IQ, so DB Lytix functions can be run inside of the DBMS.&lt;br /&gt;&lt;br /&gt;It's possible that I'm looking in the wrong places, but I have seen very little attention paid to this announcement. I take this as a symptom of a very good thing: so many DBMS vendors are adding in-DBMS analytics features to their products, that announcements such as this are not really news any more. In the last 18 months we've had the announcement of the Teradata-SAS partnership (where a number of SAS functions are being implemented inside Teradata and run in parallel on Teradata's shared-nothing architecture), Netezza opening up their development platform so that Netezza partners and customers can implement complex analytical functions inside the Netezza system, and the announcement of in-database MapReduce functionality by Greenplum, Aster Data, and Vertica (though, &lt;a href="http://www.b-eye-network.com/view/10786"&gt;as explained by Colin White&lt;/a&gt;, the vision of when MapReduce should be used --- e.g., for ETL or user queries --- varies across these three companies). Though not announced yet, I'm told that Microsoft Project Madison (the shared-nothing version of SQL Server to be released in 2010) will natively run windowed analytics functions in parallel inside the DBMS.&lt;br /&gt;&lt;br /&gt;As a DBMS researcher, this is great news, as the DBMS is starting to become the center of the universe, the location where everything happens. Though some would argue that in-DBMS analytics has been available for decades via user-defined functions (UDFs), many agree that UDF performance has been disappointing at best, and shipping data out of the DBMS to perform complex analytics has been common practice. The reasons for this are many-fold: query optimizers have trouble estimating the cost of UDFs, arbitrary user written code is hard to automatically run in parallel, and various security and implementation bugs manifest themselves (see, e.g., Section 4.3.5 of the "&lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/benchmarks-sigmod09.pdf"&gt;A Comparison of Approaches to Large Scale Data Analysis&lt;/a&gt;" paper).&lt;br /&gt;&lt;br /&gt;One interesting trend to note is that there seem to be two schools of thought emerging with different opinions on how to allow complex in-DBMSs analytics without resorting to regular UDFs. Teradata, Microsoft, Sybase, and, to an extent, Netezza, all seem to believe that providing a library of preoptimized functions distributed with the software is the way to go. This allows the vendor to build into the system the ability to run these functions in parallel across all computing resources (a shared-nothing MPP cluster in all cases except Sybase) and to make sure these functions can be planned appropriately along with other data processing operations. The other school of thought is adopted by vendors that allow customers more freedom to implement their own functions, but constrain the language in which this code is written (such as MapReduce or LINQ) to facilitate the automatic parallelization of this code inside the DBMS.&lt;br /&gt;&lt;br /&gt;Obviously, these camps are not mutually exclusive. As in-DBMS analytics continues to grow in popularity, it is likely we'll start to see vendors adopt both options.&lt;br /&gt;&lt;br /&gt;Whatever school of thought you prefer, it is clear that the last 18 months has seen tremendous progress for in-database analytics. Shipping data out of the DBMS for analysis never made a lot of sense, and finally, viable alternative options are emerging. Database systems are becoming increasingly powerful platforms for data processing, and this is good for everyone.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-2096679221335255452?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/2096679221335255452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/sybase-iq-throws-its-hat-into-in-dbms.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2096679221335255452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2096679221335255452'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/sybase-iq-throws-its-hat-into-in-dbms.html' title='Sybase IQ throws its hat into the in-DBMS analytics ring'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-2080165155106811553</id><published>2009-07-06T20:43:00.000-07:00</published><updated>2009-09-24T10:01:55.029-07:00</updated><title type='text'>ParAccel and their puzzling TPC-H results</title><content type='html'>[&lt;span style="font-style: italic;"&gt;Warning: the ParAccel TPC-H results referred to in this post have since been challenged by a competitor and found to be in violation of certain TPC rules (I cannot find any public disclosure of which specific rules were violated). These results have since been removed from the TPC-H Website, as of Sep 24th 2009.]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At SIGMOD last week, I was chatting with Mike Stonebraker (chatting might be the wrong verb; it was more like downloading knowledge and advice). ParAccel had a paper at SIGMOD, and the week before they had announced that they topped the TPC-H benchmark at the 30TB scale. Naturally, this resulted in ParAccel coming up in our conversation. Mike asked me a question about the ParAccel TPC-H results that floored me --- I was really disappointed I didn't think of this question myself.&lt;br /&gt;&lt;br /&gt;Before telling you the question, let me first give you some background. My knowledge about ParAccel's TPC-H results came from reading two blog posts about it. First, there was Merv Adrian's &lt;a href="http://mervadrian.wordpress.com/2009/06/22/paraccel-rocks-the-tpc-h-will-see-added-momentum/"&gt;positive post&lt;/a&gt; on the subject, and then there was Curt Monash's &lt;a href="http://www.dbms2.com/2009/06/22/the-tpc-h-benchmark-is-a-blight-upon-the-industry/"&gt;negative post&lt;/a&gt;. Monash's negativity stemmed largely from the seemingly unrealistic configuration of having nearly a petabyte of disk to store 30TB of data (a 32:1 disk/data ratio). This negativity was responded to in a comment by Richard Gostanian who has firsthand knowledge since he helped ParAccel get these benchmark results. The relevant excerpt of his comment is below:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-size:85%;"&gt;"Why did ParAccel require 961 TB of disk space? The answer is they didn’t. They really only needed about 20TB (10TB compressed times 2 for mirroring). But the servers they used, which allowed for the superb price-performance they achieved, came standard with 961 TB of storage; there simply was no way to configure less storage. These Sun servers, SunFire X4540s, are like no other servers on the market. They combine (1) reasonable processing power (two Quad Core 2.3 GHz AMD Opteron processors) (2) large memory (64 GB), (3) very high storage capacity (24 or 48 TB based on 48 x 500GB or 1TB SATA disks) and 4) exceptional I/O bandwidth (2 - 3 GB/sec depending upon whether you’re working near the outer or inner cylinders) all in a small (4 RU), low cost (~$40K), power efficient (less than 1000 watts) package.&lt;br /&gt;&lt;br /&gt;What any DBMS needs to run a disk-based TPC-H benchmark is high I/O throughput. The only economical way to achieve high disk throughput today is with a large number of spindles. But the spindles that ship with today’s storage are much larger than what they were, say, 5 years ago. So any benchmark, or application, requiring high disk throughput is going to waste a lot of capacity. This will change over the next few years as solid-state disks become larger, cheaper and more widely used. But for now, wasted storage is the reality, and should not be viewed in a negative light."&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I remember thinking to myself "sounds reasonable" and agreeing with both sides. Yes, there is wasted space, but you need the large number of spindles to get the high I/O throughput and most TPC-H configurations resemble this one.&lt;br /&gt;&lt;br /&gt;Then along came Mike Stonebraker, who asked me: "Why does ParAccel need such high I/O throughput? They're a column-store!"&lt;br /&gt;&lt;br /&gt;This question immediately triggered my memory of all the experiments I ran on C-Store. I generally ran C-Store on a system with 2 CPUs and 4 disks getting an aggregate I/O bandwidth of 180MB/s (i.e. 90MB/s per CPU) and I rarely saw a disk-bottlenecked query. Why? (1) Because column-stores are already super I/O efficient by only reading the relevant columns for each query and (2) Because they compress data really well (I usually saw at least 5:1) so the effective I/O bandwidth was 90MB/s * 5 = 450MB/s per CPU which, for any reasonably interesting query, is faster than the CPU can generally keep up with.&lt;br /&gt;&lt;br /&gt;Meanwhile, the ParAccel TPC-H configuration consisted of nodes with orders of magnitude more disk I/O (3GB/s from the 48 disks) yet just 4 times the CPU processing power. Doing the math: 3GB/s divided by 8 CPU cores = 375MB/s per CPU core. Gostanian said that there was a 3:1 compression ratio, so we're talking about an astounding effective 1GB/s per CPU core.&lt;br /&gt;&lt;br /&gt;There's simply no way TPC-H queries are simple enough (except for maybe query 1) for the CPUs to process them at 1GB/s (see my &lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;related post on this subject&lt;/a&gt;). So it must be the case that part of that 1GB/s is being wasted reading data that will ultimately not be processed (e.g. like unused columns in a row-store). But ParAccel is a column-store! Hence the Mike Stonebraker conundrum.&lt;br /&gt;&lt;br /&gt;At this point I was already ready to bet the house that ParAccel does not have the I/O efficiency of a standard column-store. But one who is unconvinced could still argue that disks are cheap and ParAccel was willing to pay the small amount extra for the high I/O bandwidth (which would help the simplest queries that are not CPU intensive) even though most queries would be CPU limited and the extra bandwidth would not help them.&lt;br /&gt;&lt;br /&gt;To test this theory I wasted some time slogging through the 82-page ParAccel &lt;a href="http://www.tpc.org/results/FDR/tpch/sunfire_x4540_paraccel_tpc-h_30tb.fdr.pdf"&gt;full disclosure report&lt;/a&gt; on the 30TB TPC-H results (seriously!). Query 1 is arguably the easiest query in TPC-H (it scans the single lineitem table, accesses 7 of its attributes, applies a single predicate and several very basic aggregations). There are no joins in this query and the aggregations are really easy since there are only 4 unique groups, so the aggregations can be done during the scan via a hash aggregation. Since it is so basic, this query should be disk limited on most systems. A standard column-store should be able to run this query in the time it takes to read the 7 columns off disk. There are 180 billion rows in the lineitem table, and each attribute should take up about 4 bytes (this is conservative --- with compression this number should be much less). So a total of 180 billion x 7 x 4 = approximately 5TB needs to be read off disk for this query (in reality this number will likely be smaller due to compression and memory caching). So given that there were 43 servers, each with 3GB/s I/O bandwidth, a conservative estimate of how fast this query should be run using a standard column-store is (5TB / (43*3GB/s)) or approximately 40 seconds. But this query takes ParAccel on average 275 seconds according to the report. This is further evidence that ParAccel does not have the I/O efficiency of a standard column-store.&lt;br /&gt;&lt;br /&gt;Hence, ever since I looked into this, I've been just totally flummoxed by these puzzling numbers.&lt;br /&gt;&lt;br /&gt;Finally, today I finally got around to reading their SIGMOD paper describing the internals of their system. This paper was sadly sparse on details (it was only 4 pages when the SIGMOD limit was 14 and these 4 pages were not that meaty). The main thing I learned from the paper is that ParAccel has heavy PostgreSQL roots. It seems like they started with PostgreSQL and did their best to transform it into a MPP column-store.  The one area where there was some amount of meat was their new optimizer's ability to handle plans with "thousands of joins". This of course raised more questions for me. Why would a start-up (with so many things that need to be implemented) be worried about queries with that many joins? Maybe there are a handful of customers that need this many, but the vast majority need far less. But then I got to Figure 2, when they ran queries from TPC-H and TPC-DS and the x-axis had queries with up to 42 joins. But unless I'm sorely mistaken, no TPC-H or TPC-DS query has anywhere near that number of joins.&lt;br /&gt;&lt;br /&gt;So then I was really confused. I had four questions about ParAccel, each of which are troubling.&lt;br /&gt;&lt;br /&gt;(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker's question)&lt;br /&gt;(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?&lt;br /&gt;(3) Why are they worried about queries with thousands of joins?&lt;br /&gt;(4) Why do they think TPC-H/TPC-DS queries have 42 joins?&lt;br /&gt;&lt;br /&gt;And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on "&lt;a href="http://cs-www.cs.yale.edu/homes/dna/talks/abadi-nedbday.pdf"&gt;How to create a new column-store DBMS product in a week&lt;/a&gt;". They're not a column-store. They're a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel's case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store's regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).&lt;br /&gt;&lt;br /&gt;This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store's tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.&lt;br /&gt;&lt;br /&gt;If my theory is correct (and remember, I have no inside knowledge), then ParAccel has ignored everything I have done in my research on column-stores the past 6 years. My whole &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf"&gt;PhD dissertation&lt;/a&gt; is about the order of magnitude performance improvement you get by building a query executer specifically for column-stores (instead of using a row-store query executer), which &lt;a href="http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod-award.pdf"&gt;I talked about at SIGMOD last week&lt;/a&gt;. I can't help but to take it a little personally that they have not read my papers on this subject, like my &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf"&gt;column-stores vs row-stores paper&lt;/a&gt; from last year. ParAccel would be antithetical to everything I stand for.&lt;br /&gt;&lt;br /&gt;Given how neatly my theory explains my four questions, I am pessimistic that someone will be able to recover my perception of ParAccel's product. But I would be happy if someone from ParAccel could confirm or deny what I have said, and if there are alternative explanations to my four questions, I'd love to hear them in an open forum such as comments on this blog. Please though, let's try to keep the conversation polite and civil.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-2080165155106811553?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/2080165155106811553/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html#comment-form' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2080165155106811553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/2080165155106811553'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html' title='ParAccel and their puzzling TPC-H results'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-3929105633019354344</id><published>2009-06-29T08:51:00.000-07:00</published><updated>2009-06-29T09:43:25.330-07:00</updated><title type='text'>More on node scalability</title><content type='html'>I'm at SIGMOD this week, so I'll make this blog post quick:&lt;br /&gt;&lt;br /&gt;A week and a half ago, I blogged on &lt;a href="http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html"&gt;analytical database scalability&lt;/a&gt;. The basic thesis was:&lt;br /&gt;(1) People are overly focused on raw data managed by the DBMS as an indicator of scalability.&lt;br /&gt;(2) If you store a lot of data but don't have the CPU processing power to process it at the speed it can be read off of disk, this is a potential indication of a scalability &lt;strong&gt;problem&lt;/strong&gt; not a scalability &lt;strong&gt;success&lt;/strong&gt;.&lt;br /&gt;&lt;br /&gt;I insinuated that:&lt;br /&gt;(1) Scalability in terms of number of nodes might be a better measure than scalability in size of data (assuming of course, you can't get &lt;em&gt;true&lt;/em&gt; scalability experiments as specified by the academic definition of scalability)&lt;br /&gt;(2) Scaling the number of nodes is harder than people realize.&lt;br /&gt;&lt;br /&gt;I am not so presumptuous as to assume that my blog carries enough weight to affect the actions of Aster Data's marketing department, but I was nonetheless pleased to see Aster's &lt;a href="http://www.asterdata.com/product/appliance.php"&gt;new marketing material&lt;/a&gt; (which I believe was just put online today) which specifically shows how their new appliance has more CPU processing power per usable storage space than a subset of their competitors (I'm assuming they either only showed the competitors they can beat in this metric, or their other competitors don't publish this number).&lt;br /&gt;&lt;br /&gt;I do want to point out one interesting thing, however, about Aster Data's scalability (at least what limited knowledge we can deduce from &lt;a href="http://www.asterdata.com/resources/downloads/datasheets/appliance_ds.pdf"&gt;their marketing material&lt;/a&gt;). For 6.25TB, 12.5TB, 25TB, 50TB, and 100TB, Aster Data's appliance has 8, 16, 32, 63, and 125 worker nodes, and 64, 128, 256, 504, and 1000 processing cores respectively. So basically, it's completely linear: double the amount of data, then double the amount of nodes and processing cores respectively. But then going from 100TB to 1PB of data (10X more data) they increase the number of worker nodes by less than 2X (to 330) and processing cores by a little less than 3X (2640). So after 100TB/100 nodes, their processing power per storage space drops off a cliff. (Aster has a note for the 1PB saying they are assuming 3.75X compression of user data at 1PB, but compression at 1PB shouldn't be any different than compression at 100TB; and if they are assuming uncompressed data at 100TB, then the comparison in processing power per storage space relative to other vendors is misleading since everyone else compresses data.)&lt;br /&gt;&lt;br /&gt;Bottom line: scaling to 100 nodes is one thing (I know that Teradata, Vertica, and Greenplum can do this in addition to Aster Data, and I'm probably forgetting some other vendors). Scaling to 1000 nodes (and keeping performance constant) is much harder. This might explain why Aster Data only doubles the number of nodes/cores for 10X more data after they hit 100 nodes.&lt;br /&gt;&lt;br /&gt;I will have more on why it is difficult to scale to 1000 nodes in future posts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-3929105633019354344?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/3929105633019354344/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/more-on-node-scalability.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/3929105633019354344'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/3929105633019354344'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/more-on-node-scalability.html' title='More on node scalability'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6777611285132028938</id><published>2009-06-18T14:08:00.000-07:00</published><updated>2009-06-24T18:26:34.006-07:00</updated><title type='text'>What is the right way to measure scale?</title><content type='html'>Berkeley Professor Joe Hellerstein wrote a &lt;a href="http://databeta.wordpress.com/2009/05/14/bigdata-node-density/"&gt;really interesting blog post&lt;/a&gt; a month ago, comparing two wildly different hardware architectures for performing data analysis. He looked at the &lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html"&gt;architecture Yahoo used&lt;/a&gt; to sort a petabyte of data with Hadoop, and the &lt;a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"&gt;architecture eBay uses&lt;/a&gt; to do web and event log analysis on 6.5 petabytes of data with Greenplum.&lt;br /&gt;&lt;br /&gt;Given just the information I gave you above, that Hadoop and Greenplum can be used to process over one petabyte of data, what can we learn about the scalability of these systems? If you ask your average DBMS academic, you’ll get an answer along the lines of “nothing at all”. They will likely tell you that the gold standard is linear scalability: if you add 10X the amount of data and 10X the hardware, your performance will remain constant. Since you need to measure how performance changes as you add data/hardware, you can’t conclude anything from a single data point.&lt;br /&gt;&lt;br /&gt;But come on, we must be able to conclude something, right? One petabyte is a LOT of data. For example, if each letter in this blog post is about 3 millimeters wide on your computer screen (or mobile device or whatever) and each letter is stored in 1-byte ASCII, and assuming that you’ve disabled text-wrapping so that the entire blog post is on a single line, how long (in distance) would I have to make this blog post to reach 1 petabyte of data? A mile? 100 miles? Maybe it could cross the Atlantic Ocean? No! This blog post would make it all the way to the moon. And back. And then go back and forth three more times. &lt;span style="color:#ff0000;"&gt;[Edit: Oops! It's actually more than 3400 more times; you could even make it to the sun and back ten times]&lt;/span&gt; And eBay’s database has 6.5 times that amount of data! A petabyte is simply a phenomenal amount of data. A data analysis system must surely be able to scale reasonably well if people are using it to manage a petabyte of data. If it scaled significantly sublinearly, then it would be prohibitively expensive to add enough hardware to the system in order to get reasonable performance.&lt;br /&gt;&lt;br /&gt;At the end of his &lt;a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/"&gt;post a month ago on the Facebook 2.5 petabyte data warehouse managed by Hadoop&lt;/a&gt;, Curt Monash gives the list of the largest wareshouses he’s come across as an analyst (at least those that are not NDA restricted). The DBMS software used in these warehouses were: Teradata, Greenplum, Aster Data, DATAllegro, and Vertica. What’s the commonalty? There’re all shared-nothing MPP database systems. Hadoop is also shared-nothing and MPP, though most would not call it a database system. So without detailed scalability experiments, can we use this data to verify the &lt;a href="http://www.databasecolumn.com/2007/10/database-parallelism-choices.html"&gt;claim of senior database system researchers&lt;/a&gt; that shared-nothing MPP architectures scale better than alternative approaches (like shared-disk or shared-memory)? Not directly, but it’s pretty good evidence.&lt;br /&gt;&lt;br /&gt;Let’s get back to the two systems Hellerstein highlighted in his blog post: the Greenplum 6.5 petabyte database and the Hadoop 1 petabyte database. One might use similar reasoning as used above to say that Greenplum scales better Hadoop. Or at least it doesn’t seem to scale worse. But let’s dig a little deeper. The architecture Hadoop used was 3800 “nodes” where each node consisted of 2 quad-core Xeons at 2.5ghz, 16GB RAM, and 4 SATA disks. The architecture Greenplum used contained only 96 nodes. Assuming each node is as Hellerstein insinuates (&lt;a href="http://www.sun.com/servers/x64/x4540/"&gt;SunFire X4540&lt;/a&gt;), then each node contained 2 quad-core AMD Opterons at 2.3 GHz, 32-64GB RAM, and 48 SATA disks. So the Hadoop cluster has about 40X the number of nodes and 40X the amount of processing power (while just 3X the number of SATA disks).&lt;br /&gt;&lt;br /&gt;Now, let’s dig even deeper. Let’s assume that each SATA disk can read data at 60MB/s. Then each SunFire node in the Greenplum cluster can scan (sequentially) a table at a rate of 48 disks X 60 MB/s = just under 3GB/s. This is inline with Sun’s claims that data can be read from a SunFire node from disk into memory at a rate of 3GB/s, so this seems reasonable. But don’t forget that Greenplum compressed eBay’s data at a rate of 70% (6.5 petabytes user data compressed to 1.95 petabytes). So that 3GB/s of bandwidth is actually an astonishing 10GB/s of effective read bandwidth.&lt;br /&gt;&lt;br /&gt;So what can the two quad-core Opteron processors do to this 10GB/s fire hose? Well, first they have to decompress the data, and then they need to do whatever analysis is required via the SQL query (or in Greenplum’s case, alternatively a MapReduce task). The minimum case is maybe a selection or an aggregation, but with MapReduce the analysis could be arbitrarily complex. The CPUs need to do all this analysis at a rate of 10GB/s in order to keep up with the disks.&lt;br /&gt;&lt;br /&gt;There’s an ICDE paper I highly recommend by Marcin Zukowski et. al., &lt;a href="http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.150"&gt;Super-Scalar RAM-CPU Cache Compression&lt;/a&gt;, that looks at this point. They ran some experiments on a single Opteron 2GHz core and found that state-of-the-art fast decompression algorithms such as LZRW1 or LZOP usually obtain 200-500MB/s decompression throughput on the 2GHz Opteron core. They introduced some super-fast light-weight decompression schemes that can do an order of magnitude better (around 3GB/s decompression on the same CPU). They calculated (see page 5) that given an effective disk bandwidth of 6GB/s, a decompression rate of 3GB/s gives them 5 CPU cycles per tuple that can be spent on additional analysis after decompression. 5 cycles! Is that even enough to do an aggregation?&lt;br /&gt;&lt;br /&gt;The SunFire node has approximately eight times the processing power (8 Opteron cores rather than 1), but given their near-entropy compression claims, they are likely using heavier-weight compression schemes rather than the light-weight schemes of Zukowski et. al., which removes that factor of 8 more processing power with a factor of 10 slower decompression performance. So we’re still talking around 5 cycles per tuple for analysis.&lt;br /&gt;&lt;br /&gt;The bottom line is that if you want to do advanced analysis (e.g. using MapReduce), the eBay architecture is hopelessly unbalanced. There’s simply not enough CPU power to keep up with the disks. You need more “nodes”, like the Yahoo architecture.&lt;br /&gt;&lt;br /&gt;So which scales better? Is using the number of nodes a better proxy than size of data? Hadoop can “scale” to 3800 nodes. So far, all we know is that Greenplum can “scale” to 96 nodes. Can it handle more nodes? I have an opinion on this, but I’m going to save it for my HadoopDB post.&lt;br /&gt;&lt;br /&gt;(I know, I know, I’ve been talking about by upcoming HadoopDB post for a while. It’s coming. I promise!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6777611285132028938?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6777611285132028938/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6777611285132028938'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6777611285132028938'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/what-is-right-way-to-measure-scale.html' title='What is the right way to measure scale?'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6050980294639274098</id><published>2009-06-12T14:47:00.000-07:00</published><updated>2009-06-12T15:53:18.582-07:00</updated><title type='text'>SenSage crosses a line</title><content type='html'>&lt;a href="http://2.bp.blogspot.com/_muZF7G-aiz4/SjLNSzIKCcI/AAAAAAAAAAU/eRkbe0UWruA/s1600-h/sensage.jpg"&gt;&lt;img id="BLOGGER_PHOTO_ID_5346561430487960002" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 400px; CURSOR: hand; HEIGHT: 300px" alt="" src="http://2.bp.blogspot.com/_muZF7G-aiz4/SjLNSzIKCcI/AAAAAAAAAAU/eRkbe0UWruA/s400/sensage.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;OK, I need to vent.&lt;br /&gt;&lt;br /&gt;Sensage is another column-store product that I've known about for a long time. Mike Stonebraker even cited them in the original C-Store paper. But their marketing department really crossed a line in my opinion. Yesterday, a &lt;a href="http://www.sensage.com/blog/2009/06/11/mapreduce-made-easy-the-future-of-database-analytics/"&gt;blog post &lt;/a&gt;under the CEO's name (I doubt the CEO actually wrote this post for reasons that will be obvious in a second) appeared that attacked MapReduce in a similar, but less inflammatory way than &lt;a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html"&gt;DeWitt and Stonebraker's famous blog post&lt;/a&gt; written a year and a half ago in January 2008. The basic premise of both posts:&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;MapReduce is not new&lt;/li&gt;&lt;li&gt;Everything you can do in MapReduce you can do with group-by and aggregate UDFs&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;The overlap of the argument already bothered me a little. But then I saw the text circled above:&lt;br /&gt;&lt;br /&gt;"[...] map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average or sum) that is computed over all the rows with the same group-by attribute. "&lt;br /&gt;&lt;br /&gt;It seems like an unusual writing style to say that Map is "like" group-by and Reduce is "analogous" to aggregate. Either they're both "like" or both "analogous". But then, here's a sentence in the original blog post by DeWitt and Stonebraker:&lt;br /&gt;&lt;br /&gt;"[...] map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute."&lt;br /&gt;&lt;br /&gt;This is WAY too similar in my opinion.&lt;br /&gt;&lt;br /&gt;Come on SenSage. Have an original opinion. And don't copy year-and-a-half old text that most of us have already read.&lt;br /&gt;&lt;br /&gt;Stonebraker cited you in the C-Store paper. You could at least cite him and DeWitt in return when you are basically rehashing their ideas.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6050980294639274098?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6050980294639274098/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/sensage-crosses-line.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6050980294639274098'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6050980294639274098'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/sensage-crosses-line.html' title='SenSage crosses a line'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_muZF7G-aiz4/SjLNSzIKCcI/AAAAAAAAAAU/eRkbe0UWruA/s72-c/sensage.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-8183623406303633556</id><published>2009-06-10T10:22:00.001-07:00</published><updated>2009-06-10T13:20:08.790-07:00</updated><title type='text'>SIGMOD 2009: Get There Early and Leave Late</title><content type='html'>&lt;a href="http://www.sigmod09.org/index.shtml"&gt;SIGMOD 2009&lt;/a&gt; (a top tier DBMS research conference) is being held in Providence, RI, at the end of this month (the week of June 29th). Column-stores seem to be taking off as a mainstream research topic, as I've counted at least 5 talks at SIGMOD on column-stores (including an entire session!). Make sure you get to SIGMOD by (at least) Tuesday morning though, as "&lt;a href="http://nms.csail.mit.edu/%7Estavros/pubs/SSD_sigmod09.pdf"&gt;Query Processing Techniques for Solid State Drives&lt;/a&gt;", by Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul Shah, Janet Wiener, and Goetz Graefe, is being presented in the first research session slot on Tuesday (research session 2). It doesn't look like it from the title, but trust me, it's a column-store paper. The &lt;a href="http://www.computerworld.com/action/article.do?command=viewArticleBasic&amp;amp;taxonomyName=Databases&amp;amp;articleId=9131526&amp;amp;taxonomyId=173&amp;amp;pageNumber=1"&gt;controversial Vertica / parallel DBMS vs  Hadoop benchmark paper&lt;/a&gt; is being presented in research session 5 (Sam Madden, an excellent speaker, is presenting the paper), and the aforementioned &lt;a href="http://www.sigmod09.org/program_sigmod.shtml#res8"&gt;column-store session&lt;/a&gt; (with 3 column-store papers) is in research session 8 at the end of the day Tuesday.&lt;br /&gt;&lt;br /&gt;But if you get there early don't leave early! I will be giving a 30-minute talk on column-stores in the awards session at the very end of the conference (I don't believe the talks in this session have been officially announced yet, but I'm indeed giving one of the talks) which goes until 5:30PM on Thursday, July 2nd.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-8183623406303633556?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/8183623406303633556/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/sigmod-2009-get-there-early-and-leave.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8183623406303633556'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/8183623406303633556'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/sigmod-2009-get-there-early-and-leave.html' title='SIGMOD 2009: Get There Early and Leave Late'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-6070635696086459042</id><published>2009-06-09T08:18:00.001-07:00</published><updated>2009-06-09T09:25:46.157-07:00</updated><title type='text'>CEO responds to my post on the Kickfire market</title><content type='html'>I had the unexpected and pleasant surprise of having the Kickfire CEO himself responding to my previous post on the Kickfire market (this is no doubt thanks to Curt Monash being kind enough to &lt;a href="http://www.dbms2.com/2009/06/07/daniel-abadi-kickfire/"&gt;point his large readership to it&lt;/a&gt;). Even though my Kickfire post was positive on the whole, I did raise some questions about their go-to-market strategy, and I want to give more prominence to the response (beyond just a comment thread), especially since it corrected an inaccuracy in my original post. At the end of this blog posting, I will provide my own comments on the Kickfire response.&lt;br /&gt;&lt;br /&gt;Here is the response of Kickfire CEO Bruce Armstrong, in his own words:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span style="color:#999999;"&gt;"Thanks for the post on Kickfire, Daniel.&lt;br /&gt;Some comments:&lt;br /&gt;&lt;br /&gt;1) We actually came out of stealth mode as a company at the April 2008 MySQL User Conference, where we announced our world records for TPC-H at 100GB and 300GB;&lt;br /&gt;&lt;br /&gt;2) Our product went GA at the end of 2008, which we formally announced at the April 2009 MySQL User Conference along with one of our production reference customers, Mamasource, a Web 2.0 online community doing clickstream analysis that hit performance and scalability limitations with MySQL at 50GB;&lt;br /&gt;&lt;br /&gt;3) Our focus is on the data warehouse "mass market" with databases ranging from gigabytes to low terabytes, where over 75% of deployments are today according to IDC/Computerworld survey 2008;&lt;br /&gt;&lt;br /&gt;4) We chose MySQL as a key component because it has emerged as a standard (12 million deployments) and 3rd-most deployed database for data warehousing according to IDC;&lt;br /&gt;&lt;br /&gt;5) While we do take over much of the processing with our column-store pluggable storage engine and parallel-processing SQL chip, we feel it's important to minimize any changes to a customer's database schema and/or application and to allow transparent interoperability with third-party tools;&lt;br /&gt;&lt;br /&gt;6) Having come from 15 years at Teradata (and after that Sybase and Broadbase), I know that the high-end of data warehousing is very, very different from the mass market - both are techincally challenging in their own right and require very different product and go-to-market approaches;&lt;br /&gt;&lt;br /&gt;7) Finally, regarding Oracle and whether they would "embrace" Kickfire (the question I was asked by Jason on the Frugal Friday show), we believe the data warehouse mass market could create several winners - and having recently raised $20M from top-tier silicon valley investors, we believe we have the resources to be one of them.&lt;br /&gt;&lt;br /&gt;Thanks again for the post - we look forward to more from you and the&lt;br /&gt;community&lt;br /&gt;Bruce"&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;My comments on this response (note: please read the tone as positive and collaborative --- I might need a job one day):&lt;br /&gt;&lt;br /&gt;(1 and 2) I went back to the TPC-H Website, and believe I was indeed incorrect about Kickfire topping TPC-H in 2007 (I might have been thinking about ParAccel instead of Kickfire). According to the Website, Kickfire topped TPC-H in April of 2008 (though assumedly the product being tested was finished sometime earlier than that in order to leave time for auditing the results, etc). That said, there still does seem to be a double launch. The second sentence of &lt;a href="http://www.businesswire.com/portal/site/google/?ndmViewId=news_view&amp;amp;newsId=20080414006295&amp;amp;newsLang=en"&gt;the press release from April 14th 2008&lt;/a&gt; said the company "officially launches this week" while the 1st sentence of &lt;a href="http://www.reuters.com/article/pressRelease/idUS122285+15-Apr-2009+BW20090415"&gt;the press release from April 15th 2009&lt;/a&gt; announces the launching again. But I think what Bruce is saying is that in one case it was the company and in the other case it was the product.&lt;br /&gt;&lt;br /&gt;(3 and 4) The point of my post was that I think the market is smaller than these numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, as pointed out in the comment thread of the previous post, a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect.&lt;br /&gt;&lt;br /&gt;(5) Agreed.&lt;br /&gt;&lt;br /&gt;(6) I'd argue that Bruce's experience at Teradata gave him a lot of knowledge about the high-end market. I'm not sure it automatically gives him a lot of knowledge about the mass market. That said, he probably has more knowledge about the mass market than an academic at Yale :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-6070635696086459042?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/6070635696086459042/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/ceo-responds-to-my-post-on-kickfire.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6070635696086459042'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/6070635696086459042'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/ceo-responds-to-my-post-on-kickfire.html' title='CEO responds to my post on the Kickfire market'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-9080199379566840799</id><published>2009-06-08T15:23:00.000-07:00</published><updated>2009-06-08T16:38:49.573-07:00</updated><title type='text'>Quick Thoughts on the Greenplum EDC Announcement</title><content type='html'>The big news from today seems to be Greenplum's &lt;a href="http://www.prweb.com/releases/2009/06/prweb2505854.htm"&gt;Enterprise Data Cloud announcement&lt;/a&gt;. Here are some quick thoughts about it:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I totally agree that the future of high-end data warehousing is going to move away from data warehouse appliances and into software-only solutions on top of commodity hardware served from private or public clouds (&lt;a href="http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html"&gt;appliances still might be an option in the low-end&lt;/a&gt;).&lt;/li&gt;&lt;li&gt;I also totally agree that data silos and data marts are an unfortunate reality. So you may as well get them into a centralized infrastructure first, and then worry about data modelling later.&lt;/li&gt;&lt;li&gt;I also agree with the self-service nature of the vision.&lt;/li&gt;&lt;li&gt;I wonder what Teradata has to say. This seems very much counter to their centralized "model-first" enterprise data warehouse pitch they've been espousing for years.&lt;/li&gt;&lt;li&gt;I wonder how Oliver Ratzesberger feels about Greenplum essentially rebranding eBay's &lt;a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"&gt;virtual data mart idea&lt;/a&gt; and claiming it for themselves (though admittedly there is slightly more to the EDC vision than virtual data marts). I agree with Curt Monash when he says that &lt;a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/"&gt;you're probably not going to want to copy the data for each new self-service data mart, in which case good workload management is a must&lt;/a&gt;. Teradata is probably the only data warehouse system that already has the workload management needed for the EDC vision. NeoView might also have good enough workload management, but it hasn't been around very long.&lt;/li&gt;&lt;li&gt;I wonder if Greenplum felt a little burned from their experience with their MapReduce announcement. In that case, they implemented it, tested it, and then announced it; but unfortunately they then had to share the spotlight with Aster Data which announced a nearly identical in-database MapReduce feature the same day. This time around, they've apparently decided to make the announcement first, and then do the implementation afterwards.&lt;/li&gt;&lt;li&gt;It appears that the only part of the EDC initiative that Greenplum's new version (3.3) has implemented is online data warehouse expansion (you can add a new node and the data warehouse/data mart can incorporate it into the parallel storage/processing without having to go down). All this means is that Greenplum has finally caught up to Aster Data along this dimension. I'd argue that since Aster Data also has a public cloud version and has customers using it there, they're actually farther along the EDC initiative than Greenplum is (Greenplum says that the public cloud availability is on its road map). If I wasn't trying to avoid talking too much about Vertica in this blog (due to a potential bias) I'd go in detail about their virtualized and cloud versions at this point, but I'll stop here.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;(Note: I am not associated with Aster Data or Greenplum in any way)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-9080199379566840799?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/9080199379566840799/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/quick-thoughts-on-greenplum-edc.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/9080199379566840799'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/9080199379566840799'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/quick-thoughts-on-greenplum-edc.html' title='Quick Thoughts on the Greenplum EDC Announcement'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-4396766975201422693</id><published>2009-06-07T07:50:00.000-07:00</published><updated>2009-06-07T09:50:27.401-07:00</updated><title type='text'>Is betting on the "MySQL mass market for data warehousing" a good idea?</title><content type='html'>I came across a podcast the other day where host Ken Hess interviewed the CEO of Kickfire, Bruce Armstrong (&lt;a href="http://www.blogtalkradio.com/FrugalFriday/2009/05/22/Frugal-Friday-with-guest-Bruce-Armstrong-CEO-KickFire"&gt;http://www.blogtalkradio.com/FrugalFriday/2009/05/22/Frugal-Friday-with-guest-Bruce-Armstrong-CEO-KickFire&lt;/a&gt; --- note: Armstrong doesn't come on until the 30 minute mark and I suggest skipping to that since the discussion at the 17 minute mark made me a little uncomfortable). Kickfire intrigues me since they are currently at the top of TPC-H for price performance (&lt;a href="http://www.tpc.org/tpch/results/tpch_price_perf_results.asp"&gt;http://www.tpc.org/tpch/results/tpch_price_perf_results.asp&lt;/a&gt;) at the 100 and 300GB data warehouse sizes (admittedly these are pretty small warehouses these days, but Kickfire feels that the market for small data warehouses is nothing to sneeze at). Although TPC-H has many faults, it is the best benchmark we have (as far as I know), and I've used it as the benchmark in several of my research papers.&lt;br /&gt;&lt;br /&gt;In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker's voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore's law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore's law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past (Netezza is a good example of a business succeeding in selling proprietary DBMS hardware, though of course they will tell you that they use all commodity components in their hardware).&lt;br /&gt;&lt;br /&gt;Anyway, the main sales pitch for Kickfire is that they want to do for data warehousing what Nvidia did for graphics processing: sell a specialized chip for data analysis applications currently running MySQL. The basic idea is that you would switch out your Dell box running MySQL with the Kickfire box, and everything else would stay the same, since Kickfire looks to the application like a simple storage engine for MySQL. You would get 100-1000X the performance of MySQL (assuming a standard storage engine like MyISAM or InnoDB) at only about twice the price of the Dell box. And, by the way, they are a column-store, which I'm a huge fan of.&lt;br /&gt;&lt;br /&gt;But at the 50 minute mark of the above mentioned podcast, Armstrong started talking about potentially being acquired by Oracle. Although he did use the term "down the road", it struck me as a little weird to start talking about acquisition as such a young startup (it seems to me like if you want to maximize the purchase price, you need to establish yourself in the market before being acquired). It made me start wondering, maybe things aren't going as well as the Kickfire CEO makes it seem. If I remember correctly, they burst onto the scene in 2007 in topping the TPC-H rankings, launched at a MySQL conference somewhere around the middle of 2008, didn't make any customer win announcements for the whole year (as far as I recall), and then relaunched at another MySQL conference in the middle of 2009 (last month) along with (finally) a customer win announcement (Mamasource).&lt;br /&gt;&lt;br /&gt;Maybe Kickfire is doing just fine and I'm reading way too much into the words of the Kickfire CEO. But if not, why would a company with what seems to be a high quality product be struggling? The conclusion might be: the go-to-market strategy. Kickfire has decided to target the "MySQL data warehousing mass market" and their whole strategy depends on there really being such a market. But do people really use MySQL for their data warehousing needs? My research group's experience with using MySQL to run a data warehousing benchmark for our HadoopDB project (I'll post about that later) was very negative. It didn't seem capable of high performance for the complex joins we needed in our benchmark.&lt;br /&gt;&lt;br /&gt;Meanwhile, Infobright and Calpont have chosen similar go-to-market strategies. I don't have much more knowledge about Calpont than can be found in Curt Monash's blog (e.g., &lt;a href="http://www.dbms2.com/2009/04/20/calpont-update-you-read-it-here-first/"&gt;http://www.dbms2.com/2009/04/20/calpont-update-you-read-it-here-first/&lt;/a&gt;), but I've been hearing about them for years (since they are also a column-store), and I haven't heard about any customer wins from them either. Meanwhile Infobright (another column-store that I like, and their technical team --- lead by VP of Engineering Victoria Eastwood --- are high quality and were very helpful when my research group played around with Infobright for one of our projects) recently open sourced their software which is either an act of desperation or their plan all along, depending on who you ask.&lt;br /&gt;&lt;br /&gt;The bottom line is that I've having doubts about whether there really is a MySQL data warehousing mass market. I know this blog is still very young and does not have many readers, so there are unlikely to be any comments, but if you do have thoughts on this subject, I'd be interested to hear them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-4396766975201422693?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/4396766975201422693/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html#comment-form' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4396766975201422693'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/4396766975201422693'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/is-betting-on-mysql-mass-market-for.html' title='Is betting on the &quot;MySQL mass market for data warehousing&quot; a good idea?'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8899645800948009496.post-5114313128754569688</id><published>2009-06-01T10:43:00.000-07:00</published><updated>2009-06-01T11:33:11.721-07:00</updated><title type='text'>About: DBMS Musings Blog</title><content type='html'>I've been inspired by &lt;a href="http://databeta.wordpress.com/"&gt;Joe Hellerstein's blog&lt;/a&gt;, &lt;a href="http://perspectives.mvdirona.com/"&gt;James Hamilton's blog&lt;/a&gt;, and &lt;a href="http://www.dbms2.com/"&gt;Curt Monash's &lt;/a&gt;blog, and have decided to start one of my own. The goal of this blog is to:&lt;br /&gt;&lt;br /&gt;(1) Describe some of my research in small, easy to digest entries (as an alternative to some of the more in-depth research papers that I write).&lt;br /&gt;&lt;br /&gt;(2) Present my views on various developments in the DBMS field, both in research and in industry. My particular areas of expertise are in data warehousing/analytics, parallel databases, and cloud computing. I wrote my &lt;a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf"&gt;PhD dissertation &lt;/a&gt;on query execution in column-store databases and I'm reasonably knowledgeable about the commercial products in this space, including Calpont, Exasol, Infobright, Kickfire, ParAccel, Sensage, Sybase IQ, and Vertica. Due to my research history with column-stores, my bias is probably a little bit in their favor, but there are also some non-column-oriented products in the same data analytics space, such as Aster Data, Dataupia, DB2, Exadata, Greenplum, Kognitio, Microsoft's Project Madison, NeoView, Netezza, Teradata that have interesting approaches to database management and will receive some attention in this blog.&lt;br /&gt;&lt;br /&gt;(3) Comment on and recommend some papers that I read over the course of my everyday activities as a DBMS researcher at Yale University.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8899645800948009496-5114313128754569688?l=dbmsmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbmsmusings.blogspot.com/feeds/5114313128754569688/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/about-dbms-musings-blog.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5114313128754569688'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8899645800948009496/posts/default/5114313128754569688'/><link rel='alternate' type='text/html' href='http://dbmsmusings.blogspot.com/2009/06/about-dbms-musings-blog.html' title='About: DBMS Musings Blog'/><author><name>Daniel Abadi</name><uri>http://www.blogger.com/profile/16753133043157018521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry></feed>
