It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.
Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.
ACID, scalability and replication
For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).
The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.
In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).
Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.
Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.
In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.
The NoACID solution
Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.
Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.
Fixing ACID without going NoSQL
In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.
Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.
The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.
This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.
More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.
If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.
This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.
The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)
A deterministic DBMS prototype
In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.
We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.
Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.