Wednesday, October 28, 2015

Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item

(This post is co-authored by Daniel Abadi and Jose Faleiro and cross-posted on Jose's blog)

NoSQL systems such as MongoDB, Cassandra, HBase, DynamoDB, and Riak have made many things easier for application developers. They generally have extremely flexible data models, that reduce the burden of advance prediction of how an application will change over time. They support a wide variety of data types, allow nesting of data, and dynamic addition of new attributes. Furthermore, on the whole, they are relatively easy to install, with far fewer configuration parameters and dependencies than many traditional database systems.

On the other hand, their lack of support for traditional atomic transactions is a major step backwards in terms of ease-of-use for application developers. An atomic transaction enables a group of writes (to different items in the database) to occur in an all-or-nothing fashion --- either they will all succeed and be reflected in the database state, or none of them will. Moreover, in combination with appropriate concurrency control mechanisms, atomicity guarantees that concurrent and subsequent transactions either observe all of the completed writes of an atomic transaction or none of them. Without atomic transactions, application developers have to write corner-case code to account for cases in which a group of writes (that are supposed to occur together) have only partially succeeded or only partially observed by concurrent processes. This code is error-prone, and requires complex understanding of the semantics of an application.

At first it may seem odd that these NoSQL systems, that are so well-known for their developer-friendly features, should lack such a basic ease-of-use tool as an atomic transaction. One might have thought that this missing feature is a simple matter of maturity --- these systems are relatively new and perhaps they simply haven't yet gotten around to implementing support for atomic transactions. Indeed, Cassandra's "batch update" feature could be viewed as a mini-step in this direction (despite the severe constraints on what types of updates can be placed in a "batch update"). However, as we start to approach a decade since these systems were introduced, it is clear that there is a more fundamental reason for the lack of transactional support in these systems.

Indeed, there is a deeper reason for their lack of transactional support, and it stems from their focus on scalability. Most NoSQL systems are designed to scale horizontally across many different machines, where the data in a database is partitioned across these machines. The writes in a (general) transaction may access data in several different partitions (on several different machines). Such transactions are called "distributed transactions". Guaranteeing atomicity in distributed transactions requires that the machines that participate in the transaction coordinate with each other. Each machine must establish that the transaction can successfully commit on every other machine involved in the transaction. Furthermore, a protocol is used to ensure that no machine involved in the transaction will fail before the writes that it was involved in for that transactions are present in stable storage. This avoids scenarios where one set of nodes commit a transaction's writes, while another set of nodes abort or fail before the transaction is complete (which violates the all-or-nothing guarantee of atomicity).

This coordination process is expensive, both, in terms of resources, and in terms of adding latency to database requests. However, the bigger issue is that other operations are not allowed to read the writes of a transaction until this coordination is complete, since the all-or-nothing nature of transaction execution implies that these writes may need to be rolled-back if the coordination process determines that some of the writes cannot complete and the transaction must be aborted. The delay of concurrent transactions can cause further delay of other transactions that have overlapping read- and write-sets with the delayed transactions, resulting in overall "cloggage" of the system. The distributed coordination that is required for distributed transactions thus has significant drawbacks for overall database system performance --- both in terms of the throughput of transactions per unit time that the system can process, and in terms of the latency of transactions as they get caught up in the cloggage (this cloggage latency often dominates the latency of the transaction coordination protocol itself). Therefore, most NoSQL systems have chosen to disallow general transactions altogether rather than become susceptible to the performance pitfalls that distributed transactions can entail.

MongoDB, Riak, HBase, and Cassandra all provide support for transactions on a single key. This is because all information associated with a single key is stored on a single machine (aside from replicas stored elsewhere). Therefore, transactions on a single key are guaranteed not to involve the types of complicated distributed coordination described above.

Given that distributed transactions necessitate distributed coordination, it would seem that there is a fundamental tradeoff between scalable performance and support for distributed transactions. Indeed, many practitioners assume that this is the case. When they set out to build a scalable system, they immediately assume that they will not be able to support distributed atomic transactions without severe performance degradation.

This is in fact completely false. It is very much possible for a scalable system to support performant distributed atomic transactions.

In a recent paper, we published a new representation of the tradeoffs involved in supporting atomic transactions in scalable systems.  In particular, there exists a three-way tradeoff between fairness, isolation, and throughput (FIT). A scalable database system which supports atomic distributed transactions can achieve at most two out of these three properties. Fairness corresponds to the intuitive notion that the execution of any given transaction is not deliberately delayed in order to benefit other transactions.  Isolation provides each transaction with the illusion that it has the entire database system to itself. In doing so, isolation guarantees that if any pair of transactions conflict, then one transaction in the pair will always observe the writes of the other. As a consequence, it alleviates application developers from the burden of reasoning about complex interleavings of conflicting transactions' reads and writes. Throughput refers to the ability of the database to process many concurrent transactions per unit time (without hiccups in performance due to clogging).

The FIT tradeoff dictates that there exist three classes of systems that support atomic distributed transactions:
  1. Those that guarantee fairness and isolation, but sacrifice throughput, 
  2. Those that guarantee fairness and throughput, but sacrifice isolation, and 
  3. Those that guarantee isolation and throughput, but sacrifice fairness.

In other words, not only is it possible to build scalable systems with high throughput distributed transactions, but there actually exist two classes of systems that can do so: those that sacrifice isolation, and those that sacrifice fairness. We discuss each of these two alternatives below.

(Latency is not explicitly mentioned in the tradeoff, but systems that give up throughput also give up latency due to cloggage, and systems that give up fairness yield increased latency for those transactions treated unfairly.)

Give up on isolation
As described above, the root source of the database system cloggage isn't the distributed coordination itself. Rather, it is the fact that other transactions that want to access the data that a particular transaction wrote have to wait until after the distributed coordination is complete before reading or writing the shared data. This waiting occurs due to strong isolation, which guarantees that one transaction in a pair of conflicting must observe the writes of the other. Since a transaction's writes are not guaranteed to commit until after the distributed coordination process is complete, concurrent conflicting transactions cannot make progress for the duration of this coordination.

However, all of this assumes that it is unacceptable for transactions writes to not be immediately observable by concurrent conflicting transactions If this "isolation" requirement is dropped, there is no need for other transactions to wait until the distributed coordination is complete before executing and committing.

While giving up on strong isolation seemingly implies that distributed databases cannot guarantee correctness (because transactions execute against potentially stale database state), it turns out that there exists a class of database constraints that can be guaranteed to hold despite the use of weak isolation among transactions. For more details on the kinds of guarantees that can hold on constraints despite weak isolation, Peter Bailis's work on Read Atomic Multi-Partition (RAMP) transactions provides some great intuition.

Give up on fairness
The underlying motivation for giving up isolation in systems is that distributed coordination extends the duration for which transactions with overlapping data accesses are unable to make progress. Intuitively, distributed coordination and isolation mechanisms overlap in time.  This suggests that another way to circumvent the interaction between isolation techniques and distributed coordination is to re-order distributed coordination such that its overlap with any isolation mechanism is minimized. This intuition forms the basis of Isolation-Throughput systems (which give up fairness).  In giving up fairness, database systems gain the flexibility to pick the most opportune time to pay the cost of distributed coordination.  For instance, it is possible to perform coordination outside of transaction boundaries so that the additional time required to do the coordination does not increase the time that conflicting transactions cannot run. In general, when the system does not need to guarantee fairness, it can deliberately prioritize or delay specific transactions in order to benefit overall throughput.

G-Store is a good example of an Isolation-Throughput system (which gives up fairness).  G-Store extends a (non-transactional) distributed key-value store with support for multi-key transactions.  G-Store restricts the scope of transactions to an application defined set of keys called a KeyGroup. An application defines KeyGroups dynamically based on the set of keys it anticipates will be accessed together over the course of some period of time. Note that the only restriction on transactions is that the keys involved in the transaction be part of a single KeyGroup. G-Store allows KeyGroups to be created and disbanded when needed, and therefore effectively provides arbitrary transactions over any set of keys.

When an application defines a KeyGroup, G-Store moves the constituent keys from their nodes to a single leader node. The leader node copies the corresponding key-value pairs, and all transactions on the KeyGroup are executed on the leader. Since all the key-value pairs involved in a transaction are stored on a single node (the leader node), G-Store transactions do not need to execute a distributed commit protocol during transaction execution.

G-Store pays the cost of distributed coordination prior to executing transactions. In order to create a KeyGroup, G-Store executes an expensive distributed protocol to allow a leader node to take ownership of a KeyGroup, and then move the KeyGroup's constituent keys to the leader node. The KeyGroup creation protocol involves expensive distributed coordination, the cost of which is amortized across the transactions which execute on the KeyGroup.

The key point is that while G-Store still must perform distributed coordination, this coordination is done prior to transaction execution --- before the need to be concerned with isolation from other transactions. Once the distributed coordination is complete (all the relevant data has been moved to a single master node), the transaction completes quickly on a single node without forcing concurrent transactions with overlapping data accesses to wait for distributed coordination. Hence, G-Store achieves both high throughput and strong isolation.

However, the requirement that transactions restrict their scope to a single KeyGroup favors transactions that execute on keys which have already been grouped. This is "unfair" to transactions that need to execute on a set of as yet ungrouped keys. Before such transactions can begin executing, G-Store must first disband existing KeyGroups to which some keys may belong, and then create the appropriate KeyGroup --- a process with much higher latency than if the desired KeyGroup already existed.

The fundamental reason for the poor performance of conventional distributed transactions is the fact that the mechanisms for guaranteeing atomicity (distributed coordination), and isolation overlap in time. The key to enabling high throughput distributed transactions is to separate these two concerns. This insight leads to two ways of separating atomicity and isolation mechanisms. The first option is to weaken isolation such that conflicting transactions can execute and commit in parallel. The second option is to re-order atomicity and isolation mechanisms so that they do not overlap in time, and in doing so, give up fairness during transaction execution.

(Edit: MongoDB and HBase both have (or will soon have) limited support for multi-key transactions as long as those keys are within the same partition. However, hopefully it is clear to the reader that this post is discussing the difficulties of implementing distributed --- cross-partition --- transactions). 


  1. Great article Daniel.

    Interesting you mentioned the idea of weakening isolation.

    Relational databases have a way to do exactly that. In MySQL and Oracle for example you can set your isolation level to:


    And its available for *exactly* the reasons you outline. What's more it has concomitant affects on performance.

    Thx for the post!

    1. I half agree. Indeed many traditional database systems have the reduced isolation options you listed. And the main reason for them is to improve throughout at the cost of isolation. However these options existed before distributed database systems became a reality. They are focused on the single-node isolation-throughput tradeoff. In contast, this post talks about the three-way fairness, isolation, throughput tradeoff for distributed transactions.

  2. "Distributed Transactions for Google App Engine" also addresses this problem:
    Per the recommendation of one of our co-authors, Ryan Barrett, the lead on the GAE datastore, we used optimistic concurrency which helps the throughput of concurrent transactions accessing the same data; our algorithm is also tolerant of timeouts. -- Daniel S. Wilkerson

    1. Unfortunately, OCC does not get around the FIT tradeoff. While locking protocols cause cloggage due to transactions waiting for locks, OCC causes cloggage due to transaction aborts. The key thing is that when you extend transactions through coordination across the different partitions, then isolation protocols such as locking and OCC will slow down throughput.

    2. Still, with proper schema/denormalization OCC is perfect for many use-cases. No transactions at all is the worst you can get (in terms of transactions)...

  3. I am using a hosted NoSQL database over the Internet. The only reason I am using it is because it is a JSON document store, and I really dont care about transaction isolation levels or distributed transactions. I care that I dont have to pay for a database server license fee and dont have to deal with installing clients because it is all online accessible through the browser. - L Mohan Arun

  4. RavenDB seem to support transactions, not sure what the tradeoffs are, just heard about it recently (probably because of the price ;)

  5. ZODB does transactions across multiple objects correctly. If they are in the same file storage no problem. If they are in different shards, then it uses two-phase commits.

  6. You have spent thousand words just to figure out issue of CAP theorem and have no even reference to it.

    1. FIT and CAP are totally different sets of tradeoffs. I have talked about CAP extensively on this blog. Most notably, see my post from a few years ago:

  7. I have been following this line of work since reading the "The problems with ACID, and how to fix them without going NoSQL" post in 2010. It has been good to see the published progress on Calvin, and I hope to see the technology gain adoption (whether through Calvin when it is ready or integration into existing solutions) in the coming years.

  8. First PACELC and now this. Thank you for making it easier to understand and discuss distributed systems.

    1. Thanks Mark. That means a lot coming from you.

  9. Not sure how can we sacrifice isolation. If one transaction updates (increments/decrements) two counters, other transaction can come in and set its own values to counters breaking the logic in the first transaction?

    1. It depends on the application. Some applications require stronger isolation guarantees than others. However, please note that when we talk about "sacrificing" isolation, we don't necessarily mean running at zero isolation. We're talking about running at weaker isolation guarantees than full serializability.

  10. Great work to help understand distributed transactions! Thank you.

    I have one question about the definition of "fairness". I am confused by "the execution of any given transaction is not deliberately delayed in order to benefit other transactions". I am not sure whether your definition is really about the fairness. I think there are two concepts. (1) Harm: A is harmed by B, so we say it is unfair for A. (2)Fairness of Harm: A is harmed by B, and B is equally harmed by A. We can say it is fair for both A and B. For example, time sharing in OS scheduling can give fairness, although "the execution of a process is deliberately delayed to benefit other processes".

  11. Thanks for the blog post, very interesting. I'd be keen to hear what you think of other NoSQL solutions such as Datomic.