Tuesday, August 31, 2010

The problems with ACID, and how to fix them without going NoSQL

(This post is coauthored by Alexander Thomson and Daniel Abadi)

It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.

Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.

ACID, scalability and replication

For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).

The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.

In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).

Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.

Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.

In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

The NoACID solution

Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.

Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.

Fixing ACID without going NoSQL

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.

More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.

If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.

This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)

A deterministic DBMS prototype

In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.

We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.

Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.

Monday, August 2, 2010

Thoughts on Kickfire’s apparent demise

There have been some recent conflicting reports on the future prospects of Kickfire’s analytical database technology. Forbes reported a couple of months ago that Kickfire sold $5 million worth of boxes in their first year of existence (they launched their product in April 2009), and was extremely positive about Kickfire’s outlook. Then, a couple of months later, Curt Monash reported that Kickfire was discontinuing their product, and selling their IP and engineers. Obviously, Monash is the more reliable source here (and I independently heard a rumor from a reputable source that Teradata was acquiring Kickfire at a “firesale” price).

In my interactions with the company, I have been impressed with Raj Cherabuddi and members of the Kickfire technical team, and it is always sad when good technology fails to gain any traction in the marketplace. I’m also sad (though obviously to a lesser extent) to see the many thousands of words I have written about Kickfire in this blog (including an in-depth post on their technology) become largely obsolete. In fact, one of the first posts I wrote for this blog was on the subject of Kickfire --- a mostly positive post --- but questioning their go-to-market strategy. In that post, I took issue with the assumption that there is a “mass market” for data warehousing in the MySQL ecosystem, especially for a proprietary approach like Kickfire's. The CEO of Kickfire kindly took the time to respond to this original post, quoting a bunch of IDC numbers about the size of the MySQL data warehouse market. I chose to respond to this comment in a separate post, in which I said (amongst other things):

"The point of my post was that I think the [MySQL data warehouse] market is smaller than these [IDC] numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, […] a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect."

I think the above paragraph lists two of the three most probable reasons why Kickfire seems to have failed:

(1) Building a propriety database stack of hardware and software around a MySQL codebase that attributes much of its success to being open and free is a poor cultural match.

(2) Trying to make it in the "Big Data Era" without a scalable MPP product is a recipe for disaster. It is well known that over 95% of data warehouses are smaller than 5TB, and that MPP is not strictly necessary for less than 5TB, so it is easy to get into the trap of Kickfire’s thinking that the mass market is addressable without building a MPP product. However, businesses are looking forward, and seeing much more data in their future (whether this is wishful or realistic thinking is entirely irrelevant), and can often be reluctant to select a product with known scalability limits.

(The third alternative reason why Kickfire might have failed is the TPC-H benchmark orientation. It is really easy to spend a lot of time working on an analytical database to get it to run TPC-H --- even optimizing it for TPC-H --- before realizing that the marketing benefit that the product gets from stellar TPC-H numbers does not justify the time investment of getting it to run --- and in fact find out that many of the features that were added for TPC-H are not actually used by real-life customers.)

It is tempting to add a fourth reason for Kickfire’s demise --- the long list of failed hardware-accelerated DBMS companies and Kickfire’s obvious inclusion in this group. However, I believe that Netezza’s success is a demonstration of the potential of the benefits of hardware acceleration and the appliance approach in the modern era where the rate of performance improvements with each successive processor generation is slowing significantly.

Anyway, RIP Kickfire (assuming the rumors are correct). Good technology. Bad go-to-market strategy. Tough fit for the “Big Data” era.