Tuesday, August 31, 2010

The problems with ACID, and how to fix them without going NoSQL

(This post is coauthored by Alexander Thomson and Daniel Abadi)

It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.

Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.

ACID, scalability and replication

For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).

The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.

In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).

Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.

Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.

In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

The NoACID solution

Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.

Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.

Fixing ACID without going NoSQL

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.

More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.

If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.

This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)

A deterministic DBMS prototype

In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.

We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.

Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.

Wednesday, August 11, 2010

Defending Oracle Exadata

I recently came across a whitepaper from Teradata, written by a senior consultant for Teradata, Richard Burns. This is a very well written piece, and has one of the best overviews of Exadata I’ve seen. I did not notice any obvious inaccuracies in the description of Exadata itself, and even the anti-Exadata arguments (presented after the overview), though at times biased and misleading, do not have many clear factual errors. Hence, it is quite a professionally done whitepaper, even though it is devoted to attacking a competitor. Reading it will probably make you smarter.

That said, even though the facts are more or less correct, the inferences that are made from these facts are certainly up for debate, and I feel an urge to defend Exadata against some of these allegations, even though I have no personal stake in either side of the Exadata-Teradata feud.

I will not overview Exadata again in this blog post. If you are not familiar with Exadata, I encourage you to read the overview in the Teradata whitepaper, or from Oracle’s own marketing material. I have covered the columnar compression feature separately in this blog. Hence, I will jump to the arguments that Teradata makes against Exadata, and respond to each one in turn:

“Exadata is NOT Intelligent Storage; Exadata is NOT Shared-Nothing”

Teradata argues that since the Exadata storage only performs selections, projections, and some form of basic joins, and the rest of the query must be performed in the Oracle database server sitting above Exadata storage (which is typically Oracle RAC), the architecture is a whole lot closer to shared-disk than shared-nothing. Factually, this is correct. Exadata storage is indeed shared-nothing, but since only very basic query operations are performed there, it is fair to view the system as Oracle RAC treating Exadata storage as a shared-disk.

However, it is one thing to point out that Exadata is closer to shared-disk than shared-nothing, but quite another thing to claim that as a result of this, “Exadata does nothing to reduce or eliminate the structural contention for shared resources that fundamentally limits the scalability –of data, users, and workload – of Oracle data warehouses.” This statement is incorrect and unfair. Yes, it is true that contention for shared resources is the source of scalability problems in analytical database systems, and that this is why shared-nothing is widely believed to scale the best (because each compute node contains its own CPU, memory and disk, so neither disk nor memory is shared across the cluster). But shared-disk is very similar to shared-nothing. The only difference is the shared access to disk storage system. If you are going to make an argument that shared-disk causes scalability problems, you have to make the argument that contention for the one shared resource in a shared-disk system is high enough to cause a performance bottleneck in the system --- namely, you have to argue that the network connection between the servers and the shared-disk is a bottleneck.

At no point in the entire 18-page whitepaper did Teradata make the argument that the Infiniband connection between Exadata storage and the database servers is a bottleneck. Furthermore, even if you believe that there is a bottleneck in this connection, you still must admit that by doing some of the filtering in the Exadata storage layer, some of this bottleneck is alleviated. Hence, it is entirely inaccurate to say “Exadata does nothing to reduce or eliminate the structural contention for shared resources ….” --- at the very least it does something by doing this filtering.

In fact, I think that the scalability differences between shared-nothing and shared-disk are overblown (I’m not arguing that they scale equally well, just that the gap between them is not as large as people think; even if filtering is not pushed down to the shared-disk like in Exadata). This was eloquently explained by Luke Lonergan from Greenplum at the UW/MSR retreat on cloud computing in the first week of August. In essence, he argued that thanks to 10 Gigabit Ethernet, Fibre Channel over Ethernet, and a general flattening of the network into two-tier switching designs, it takes an enormous number of disks to cause network to become a bottleneck. Furthermore, with 40 Gigabit Ethernet around the corner, and 100 Gigabit Ethernet on its way, network is becoming even less of a bottleneck. And by the way, shared-disk has a variety of advantages that shared-nothing does not, including the ability to move around virtual machines executing database operators for improved load balancing and fault tolerance. (This would be a good time to point out that Greenplum was recently acquired by EMC, so one obviously has to be aware of pro-shared-disk bias, but I found the argument quite compelling.)

Teradata also attempts to argue that the striping of data across all available disks on every Exadata cell (using Oracle’s Automatic Storage Manager, ASM) causes “potential contention on each disk for disk head location and I/O bandwidth” when many DBMS workers running in parallel for the same query request data from the same set of disks. However, it is not pointed out until later in the paper that Exadata defaults to 4MB chunks. 4MB blocks should easily amortize the disk seek costs across multiple worker requests.

Exadata does NOT Enable High Concurrency

If I were to summarize this section, I would say that Teradata is basically arguing that the default block size in Exadata is too large, and this reduces concurrency, since 30 concurrent I/Os/sec * 4MB block size saturates the disks maximum 120MB/sec bandwidth. This argument is clouded by the fact that for scan-based workloads, any database system would have the same concurrency limits. Hence, the only place where Exadata’s large block size would reduce concurrency relative to alternative systems (such as Teradata) would be for non-scan based workloads when there are a lot of tuple lookups and random I/O. Oracle would argue that its memory and flash cache would take care of the tuple lookups and needle-in-a-haystack type queries. Teradata, in turn, argues that the size of cache is far too small to completely take care of this problem. This argument is reasonable, but I do believe that the bulk of the size of a data warehouse is the historical data, and that tuple lookups and non-scan-based queries only touch a much smaller portion of the more recent data, so that caching should do a decent job. But this is definitely a “your mileage will vary” type argument, with the effectiveness of the cache highly dependent on particular data sets and workloads.

Exadata does NOT Support Active Data Warehousing

Teradata points out that the process of checking which version of the data is the correct version to return for a particular query (used in its MVCC concurrency control scheme) is performed inside the database servers, and so the Exadata storage cannot do its typical filtering of the data in the storage layer for actively updated tables (since there might be multiple potentially correct versions of the data to return). Teradata therefore points out: “While Exadata still performs parallel I/O for the query, the largest benefit provided by Exadata, early filtering of columns and rows meeting the query specification, which may drastically reduce query data volume, is not useful for tables or partitions being actively updated.” While this might be true, the impact of this issue is overstated. The percentage of data that is actively updated is typically a tiny percentage of the total data set size. The historical data, and the data that has been recently appended (but not updated) will not suffer from this problem. Hence, having less than optimal I/O performance for this tiny fraction of the data is not a big deal.

Exadata does NOT Provide Superior Query Performance

Teradata points out that since Exadata can only perform basic operations in the Exadata storage layer (selection, projection, some simple joins), then as a query gets more complex, more and more of it is performed in the database server, instead of in Exadata storage. Teradata gives an example of a simple query, where Oracle can perform 28% of the query steps inside Exadata, and a more complex one, where Oracle can perform only 22% of the query steps inside Exadata. Again, this is factually correct, but it is misleading to assume the speedup you get from Exadata is linearly correlated with the percentage of steps that are performed within Exadata. For database performance, it’s all about bottlenecks, and eliminating a bottleneck can have a disproportionate effect on query performance. In scan-based workloads, disk I/O is often a bottleneck, and Exadata alleviates this bottleneck equally well for both simple and complex queries. Hence, while the benefit of Exadata does decrease for complex queries, it is misleading to assume that this benefit decreases linearly with complexity.

Exadata is Complex; Exadata is Expensive

It is hard to argue with these points. However, it is amusing to note that Teradata is willing to point out in this section that they are only (approximately) 11% cheaper than Oracle, and they show numbers such as Teradata costing 194K per terabyte. Both Oracle and Teradata are too expensive for large parts of the analytical database market.

Conclusion

The truth, as is usually the case, is somewhere in middle, between the claims of Oracle and Teradata. Teradata is probably right when it asserts “Exadata is far from the groundbreaking innovation that Oracle claims”, and that Oracle “throws a lot of hardware” at problems that are solvable in software, but many of the claims and inferences made in the paper about Exadata are overstated, and the reader needs to be careful not to be mislead into believing in the existence problems that don’t actually present themselves on realistic datasets and workloads.

Monday, August 2, 2010

Thoughts on Kickfire’s apparent demise

There have been some recent conflicting reports on the future prospects of Kickfire’s analytical database technology. Forbes reported a couple of months ago that Kickfire sold $5 million worth of boxes in their first year of existence (they launched their product in April 2009), and was extremely positive about Kickfire’s outlook. Then, a couple of months later, Curt Monash reported that Kickfire was discontinuing their product, and selling their IP and engineers. Obviously, Monash is the more reliable source here (and I independently heard a rumor from a reputable source that Teradata was acquiring Kickfire at a “firesale” price).

In my interactions with the company, I have been impressed with Raj Cherabuddi and members of the Kickfire technical team, and it is always sad when good technology fails to gain any traction in the marketplace. I’m also sad (though obviously to a lesser extent) to see the many thousands of words I have written about Kickfire in this blog (including an in-depth post on their technology) become largely obsolete. In fact, one of the first posts I wrote for this blog was on the subject of Kickfire --- a mostly positive post --- but questioning their go-to-market strategy. In that post, I took issue with the assumption that there is a “mass market” for data warehousing in the MySQL ecosystem, especially for a proprietary approach like Kickfire's. The CEO of Kickfire kindly took the time to respond to this original post, quoting a bunch of IDC numbers about the size of the MySQL data warehouse market. I chose to respond to this comment in a separate post, in which I said (amongst other things):

"The point of my post was that I think the [MySQL data warehouse] market is smaller than these [IDC] numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, […] a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect."


I think the above paragraph lists two of the three most probable reasons why Kickfire seems to have failed:

(1) Building a propriety database stack of hardware and software around a MySQL codebase that attributes much of its success to being open and free is a poor cultural match.

(2) Trying to make it in the "Big Data Era" without a scalable MPP product is a recipe for disaster. It is well known that over 95% of data warehouses are smaller than 5TB, and that MPP is not strictly necessary for less than 5TB, so it is easy to get into the trap of Kickfire’s thinking that the mass market is addressable without building a MPP product. However, businesses are looking forward, and seeing much more data in their future (whether this is wishful or realistic thinking is entirely irrelevant), and can often be reluctant to select a product with known scalability limits.

(The third alternative reason why Kickfire might have failed is the TPC-H benchmark orientation. It is really easy to spend a lot of time working on an analytical database to get it to run TPC-H --- even optimizing it for TPC-H --- before realizing that the marketing benefit that the product gets from stellar TPC-H numbers does not justify the time investment of getting it to run --- and in fact find out that many of the features that were added for TPC-H are not actually used by real-life customers.)

It is tempting to add a fourth reason for Kickfire’s demise --- the long list of failed hardware-accelerated DBMS companies and Kickfire’s obvious inclusion in this group. However, I believe that Netezza’s success is a demonstration of the potential of the benefits of hardware acceleration and the appliance approach in the modern era where the rate of performance improvements with each successive processor generation is slowing significantly.

Anyway, RIP Kickfire (assuming the rumors are correct). Good technology. Bad go-to-market strategy. Tough fit for the “Big Data” era.