Friday, May 3, 2019

Introduction to Transaction Isolation Levels

For decades, database systems have given their users multiple isolation levels to choose from, ranging from some flavor of “serializability” at the high end down to “read committed” or “read uncommitted” at the low end. These different isolation levels expose an application to markedly different types of concurrency bugs. Nonetheless, many database users stick with the default isolation level of whatever database system that they are using, and do not bother to consider which isolation level is optimal for their application. This practice is dangerous—the vast majority of widely-used database systems—including Oracle, IBM DB2, Microsoft SQL Server, SAP HANA, MySQL, and PostgreSQL—do not guarantee any flavor of serializability by default. As we will detail below, isolation levels weaker than serializability can lead to concurrency bugs in an application and negative user experiences. It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.
In this post we give a tutorial on database isolation levels, the advantages that come with lower isolation levels, and the types of concurrency bugs that these lower levels may allow. Our main focus in this post is the difference between serializable isolation levels and lower levels that expose an application to specific types of concurrency anomalies. There are also important differences within the category of serializable isolation (e.g., “strict serializability” makes a different set of guarantees than “one-copy serializability”). However, in keeping with the title of this post as an “introduction” to database isolation levels, we will ignore the subtle differences within the class of serializable isolation, and focus on the commonality of all elements within this class—referring to this commonality as “serializable”. In a future, less introductory, post, we will investigate the category of serializable isolation in more detail.

What is an “Isolation Level”?

Database isolation refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.
There is such a thing as perfect isolation (we will define this below). Unfortunately, perfection usually comes at a performance cost—in terms of transaction latency (how long before a transaction completes) or throughput (how many transactions per second can the system complete). Depending on how a particular system is architected, perfect isolation becomes easier or harder to achieve. In poorly designed systems, achieving perfection comes with a prohibitive performance cost, and users of such systems will be pushed to accept guarantees significantly short of perfection. However, even in well designed systems, there is often a non-trivial performance benefit achieved by accepting guarantees short of perfection. Therefore, isolation levels came into existence: they provide the user of a system the ability to trade off isolation guarantees for improved performance.
As we will see as we proceed in this discussion, there are many subtleties that lead to confusion when discussing isolation levels. This confusion is exacerbated by the existence of a SQL standard that fails to accurately define database isolation levels and database vendors that attach liberal and non-standard semantics to particular named isolation levels. Nonetheless, a database user is obligated to do due diligence on the different options provided by a particular system, and choose the right level for their application.

Perfect Isolation

Let’s begin our discussion of database isolation levels by providing a notion of what “perfect” isolation is. We defined isolation above as the ability of a database system to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality that can be a large number of concurrently running transactions). What does it mean to be perfect in this regard? At first blush, it may appear that perfection is impossible. If two transactions both read and write the same data item, it is critical that they impact each other. If they ignore each other, then whichever transaction completes the write last could clobber the first transaction, resulting in the same final state as if it never ran.
The database system was one of the first scalable concurrent systems, and has served as an archetype for many other concurrent systems developed subsequently. The database system community—many decades ago—developed an incredibly powerful (and yet perhaps underappreciated) mechanism for dealing with the complexity of implementing concurrent programs.
The idea is as follows: human beings are fundamentally bad at reasoning about concurrency. It’s hard enough to write a bug-free non-concurrent program. But once you add concurrency, there are a near-infinite array of race conditions that could occur—if one thread reaches line 17 of a program before another thread reaches line 5, but after it reaches line 3, a problem could occur that would not exist under any other concurrent execution of the program. It is nearly impossible to consider all the different ways that program execution in different threads could overlap with each other, and how different types of overlap can lead to different final states.
Instead, database systems provide a beautiful abstraction to the application developer—a “transaction”. A transaction may contain arbitrary code, but it is fundamentally single-threaded.
An application developer only needs to focus on the code within a transaction—to ensure it is correct when there are no other concurrent processes running in the system. Given any starting state of the database, the code must not violate the semantics of the application. Ensuring correctness of code is non-trivial, but it is much easier to ensure the correctness of code when it is running by itself, than ensuring the correctness of code when it is running alongside other code that may attempt to read or write shared data. 
If the application developer is able to ensure the correctness of their code when no other concurrent processes are running, a system that guarantees perfect isolation will ensure that the code remains correct even when there is other code running concurrently in the system that may read or write the same data. In other words, a database system enables a user to write code without concern for the complexity of potential concurrency, and yet still process that code concurrently without introducing new bugs or violations of application semantics.
Implementing this level of perfection sounds difficult, but it’s actually fairly straightforward to achieve. We have already assumed that the code is correct when run without concurrency over any starting state. Therefore, if transaction code is run serially — one after the other — then the final state will also be correct. Thus, in order to achieve perfect isolation, all the system has to do is to ensure that when transactions are running concurrently, the final state is equivalent to a state of the system that would exist if they were run serially. There are several ways to achieve this—such as via locking, validation, or multi-versioning—that are out of scope for this article. The key point for our purposes is that we are defining “perfect isolation” as the ability of a system to run transactions in parallel, but in a way that is equivalent to as if they were running one after the other. In the SQL standard, this perfect isolation level is called serializability.
Isolation levels in distributed systems get more complicated. Many distributed systems implement variations of the serializable isolation level, such as one copy-serializability (1SR), strict serializability (strict 1SR) or update serializability (US). However, as we mentioned above, in order to focus the discussion in this post on core concepts behind database isolation, we will defer discussion of these more advanced concepts to a future post.

Anomalies in Concurrent Systems

The SQL standard defines several isolation levels below serializability. Furthermore, there are other isolation levels commonly found in commercial database systems—most notably snapshot isolation—which are not included in the SQL standard. Before we discuss these different levels of isolation, let’s discuss some well-known application bugs/anomalies that can occur at isolation levels below serializability. We will describe these bugs using a retail example.
Let’s assume that whenever a customer purchases a widget, the following “Purchase” transaction is run:
  1. Read old inventory
  2. Write new inventory which is one less than what was read in step (1)
  3. Insert new order corresponding to this purchase into the orders table
If such Purchase transactions run serially, then all initial inventory will always be accounted for. If we started with 42 widgets, then at all times, the sum of all inventory remaining plus orders in the order table will be 42.
But what if such transactions run concurrently at isolation levels below serializability?
For example, let’s say that two transactions running concurrently read the same initial inventory (42), and then both attempt to write out the new inventory of one less than the value that they read (41) in addition to the new order. In such a case, the final state is an inventory of 41, yet there are two new orders in the orders table (for a total of 43 widgets accounted for). We created a widget out of nothing! Clearly, this is a bug. It is known as the lost-update anomaly
As another example, let’s say these same two transactions are running concurrently but this time the second transaction starts in between steps (2) and (3) of the first one. In this case, the second transaction reads the value of inventory after it has been decremented - i.e. it reads the value of 41 and decrements it to 40, and writes out the order. In the meantime, the first transaction aborted when writing out the order (e.g. because of a credit card decline). In such a case, during the abort process, the first transaction reverts back to the state of the database before it began (when inventory was 42). Therefore the final state is an inventory of 42, and one order written out (from the second transaction). Again, we created a widget out of nothing! This is known as the dirty-write anomaly (because the second transaction overwrote the value of the first transaction’s write before it decided whether it would commit or abort).
As a third example, let’s say a separate transaction performs a read of both the inventory and the orders table, in order to make an accounting of all widgets that ever existed. If it runs between steps (2) and (3) of a Purchase transaction, it will see a temporary state of the database in which the widget has disappeared from the inventory, but has not yet appeared as an order. It will appear that a widget has been lost—another bug in our application. This is known as the dirty-read anomaly, since the accounting transaction was allowed to read the temporary (incomplete) state of the purchase transaction.
As a fourth example, let’s say that a separate transaction checks the inventory and acquires some more widgets if there are fewer than 10 widgets left:
  1. IF (READ(Inventory) = (10 OR 11 OR 12))
  2.    Ship some new widgets to restock inventory via standard shipping
  3. IF (READ(Inventory) < 10)
  4.    Ship some new widgets to restock inventory via express shipping
Note that this transaction reads the inventory twice. If the Purchase transaction runs in between step (1) and (3) of this transaction, then a different value of inventory will be read each time. If the initial inventory before the Purchase transaction ran was 10, this would lead to the same restock request to be made twice—once with standard shipping and once with express shipping. This is called the non-repeatable read anomaly.
As a fifth example, imagine a transaction that scans the orders table in order to calculate the maximum price of an order and then scans it again to find the average order price. In between these two scans, an extremely expensive order gets inserted that skews the average so much that it becomes higher than the maximum price found in the previous scan. This transaction returns an average price that is larger than the maximum price—a clear impossibility and a bug that would never happen in a serializable system. This bug is slightly different than the non-repeatable read anomaly since every value that the transaction read stayed the same between the two scans—the source of the bug is that additional records were inserted in between these two scans. This is called the phantom read anomaly.
As a final example, assume that the application allows the price of the widget to change based on inventory. For example, many airlines increase ticket price as the inventory for a flight decreases. Assume that the application uses a formula to place a constraint on how these two variables interrelate—e.g. 10I + P >= $500 (where I is inventory and P is price). Before allowing a purchase to succeed, the purchase transaction has to check both the inventory and price to make sure the above constraint is not violated. If the constraint will not be violated, the update of inventory by that Purchase transaction may proceed. Similarly, a separate transaction that implements special promotional discounts may check both the inventory and price to make sure that the constraint is not violated when updating the price as part of a promotion. If it will not be violated, the price can be updated. Now: imagine these two transactions are running at the same time—they both read the old value of I and P and independently decide that their updates of inventory and price respectively will not violate the constraint. They therefore proceed with their updates. Unfortunately, this may result in a new value of I and P that violates the constraint! If one had run before the other, the first one would have succeeded and the other would would have read the value of I and P after the first one finished and detected that their update would violate the constraint and therefore not proceed. But since they were running concurrently, they both see the old value and incorrectly decide that they can proceed with the update. This bug is called the write skew anomaly because it happens when two transactions read the same data but update disjoint subsets of the data that was read.

Definitions in The ISO SQL Standard

The SQL standard defines reduced isolation levels in terms of which of these anomalies are possible. In particular, it contains the following table:
Isolation LevelDirty ReadNon-repeatable readPhantom Read
READ UNCOMMITTEDPossiblePossiblePossible
READ COMMITTEDNot PossiblePossiblePossible
REPEATABLE READNot PossibleNot PossiblePossible
SERIALIZABLENot PossibleNot PossibleNot Possible

There are many, many problems which how the SQL standard defines these isolation levels. Most of these problems were already pointed out in 1995, but inexplicably, revision after revision of the SQL standard have been released since that point without fixing these problems.
The first problem is that the standard only uses three types of anomalies to define each isolation level—dirty read, non-repeatable read, and phantom read. However, there are many types of concurrency bugs that can appear in practice—many more than just these three anomalies. In this post alone we have already described six unique types of anomalies. The SQL standard makes no mention about whether the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels are susceptible to the lost update anomaly, the dirty-write anomaly, or the write-skew anomaly. As a result, each commercial system is free to decide which of these other anomalies these reduced isolation levels are susceptible to—and in many cases these vulnerabilities are poorly documented, leading to confusion and unpredictable bugs for the application developer.
A second (related) problem is that using anomalies to define isolation levels only gives the end user a guarantee of what specific types of concurrency bugs are impossible. It does not give a precise definition of the potential database states that are viewable by any particular transaction. There are several improved and more precise definitions of isolation levels given in the academic literature. Atul Adya’s PhD thesis gives a precise definition of the SQL standard isolation levels based on how reads and writes from different transactions may be interleaved. However these definitions are given from the point of view of the system. The recent work by Natacha Crooks et. al gives elegant and precise definitions from the point of view of the user.
A third problem is that the standard does not define, nor provide correctness constraints on one of the most popular reduced isolation levels used in practice: snapshot isolation (nor any of its many variants—PSI, NMSI, Read Atomic, etc). By failing to provide a definition of snapshot isolation, differences in concurrency vulnerabilities allowed by snapshot isolation have emerged across systems. In general, snapshot isolation performs all reads of data as of a particular snapshot of the database state which contains only committed data. This snapshot remains constant throughout the lifetime of the transaction, so all reads are guaranteed to be repeatable (in addition to being only of committed data). Furthermore, concurrent transactions that write the same data detect conflicts with each other and typically resolve this conflict via aborting one of the conflicting transactions. This prevents the lost-update anomaly. However, conflicts are only detected if the conflicting transactions write an overlapping set of data. If the write sets are disjoint, these conflicts will not be detected. Therefore snapshot isolation is vulnerable to the write skew anomaly. Some implementations are also vulnerable to the phantom read anomaly.
A fourth problem is that the SQL standard seemingly gives two different definitions of the SERIALIZABLE isolation level. First, it defines SERIALIZABLE correctly: that the final result must be equivalent to a result that could have occured if there were no concurrency. But then, it presents the above table, which seems to imply that as long as an isolation level does not allow dirty reads, non-repeatable reads, or phantom reads, it may be called SERIALIZABLE. Oracle, has historically leveraged this ambiguity to justify calling its implementation of snapshot isolation “SERIALIZABLE”. To be honest, I think most people who read the ISO SQL standard would come away believing the more precise definition of SERIALIZABLE given earlier in the document (which is the correct one) is the intention of the authors of the document. Nonetheless, I guess Oracle’s lawyers have looked at it and determined that there is enough ambiguity in the document to legally justify their reliance on the other definition. If any of my readers are aware of any real lawsuits that came from application developers who believed they were getting a SERIALIZABLE isolation level, but experienced write skew anomalies in practice, I would be curious to hear about them. Or if you are an application developer and this happened to you, I would also be curious to hear about it.
The bottom line is this: it is nearly impossible to give clear definitions of the different isolation levels available to application developers, because vagueness and ambiguities in the SQL standard has led to semantic differences across implementations/systems. 

What Isolation Level Should You Choose?

My advice to application programmers is the following: reduced isolation levels are dangerous. It is very hard to figure out which concurrency bugs may present themselves. If every system defined their isolation levels using the methodology of Crooks et. al. that I cited above, at least you would have a precise and formal definition of their associated guarantees. Unfortunately, the formalism of the Crooks paper is too advanced for most database users, so it is unlikely that database vendors will adopt these formalisms in their documentation any time soon. In the meantime, the definition of reduced isolation levels remain vague in practice and risky to use.
Furthermore, even if you could know exactly which concurrency bugs are possible for a particular isolation level, writing an application in a way that these bugs will not happen in practice (or if they do, that they will not cause negative experiences for the users of the application) is also very challenging. If your database system gives you a choice, the right choice is usually to avoid lower isolation levels than serializable isolation (for the vast majority of database systems, you actually have to go and change the defaults to accomplish this!).
However, there are three caveats:

  1. As I mentioned above, some systems use the word “SERIALIZABLE” isolation to mean something weaker than true serializable isolation.  Unfortunately, this means that simply choosing the “SERIALIZABLE” isolation level in your database system may not be sufficient to actually ensure serializability. You need to check the documentation to ensure that it defines “SERIALIZABLE” in the following way: that the visible state of the database is always equivalent to a state that could have occurred if there was no concurrency. Otherwise, your application will likely be vulnerable to the write-skew anomaly.
  2. As mentioned above, serializable isolation level comes with a performance cost. Depending on the quality of the system architecture, the performance cost of serializability may be large or small. In a recent research paper that I wrote with Jose Faleiro and Joe Hellerstein, we showed that in a well-designed system, the performance difference between SERIALIZABLE  and READ COMMITTED can be negligible … and in some cases it is possible for the SERIALIZABLE isolation level to (surprisingly) outperform the READ COMMITTED isolation level. If you find that the cost of serializable isolation in your system is prohibitive, you should probably consider using a different database system earlier than you consider settling for a reduced isolation level.
  3. In distributed systems, there are important anomalies that can (and do) emerge even within the class of serializable isolation levels. For such systems, it is important to understand the subtle differences between the elements of the class of serializable isolation (strict serializability is known to be the most safe). We will shed more light into this matter in a future post.

Friday, January 25, 2019

It’s Time to Move on from Two Phase Commit

The two-phase commit protocol (2PC) has been used in enterprise software systems for over three decades. It has been an an incredibly impactful protocol for ensuring atomicity and durability of transactions that access data in multiple partitions or shards. It is used everywhere --- both in older “venerable” distributed systems, database systems, and file systems such as Oracle, IBM DB2, PostgreSQL, and Microsoft TxF (transactional NTFS), and in younger “millennial” systems such as MariaDB, TokuDB, VoltDB, Cloud Spanner, Apache Flink, Apache Kafka, and Azure SQL Database. If your system supports ACID transactions across shards/partitions/databases, there’s a high probability that it is running 2PC (or some variant thereof) under the covers. [Sometimes it’s even “over the covers” --- older versions of MongoDB required users to implement 2PC for multi-document transactions in application code.]

In this post, we will first describe 2PC: how it works and what problems it solves. Then, we will show some major issues with 2PC and how modern systems attempt to get around these issues. Unfortunately, these attempted solutions cause other problems to emerge. In the end, I will make the case that the next generation of distributed systems should avoid 2PC, and how this is possible.

Overview of the 2PC protocol

There are many variants of 2PC, but the basic protocol works as follows:

Background assumption:The work entailed by a transaction has already been divided across all of the shards/partitions that store data accessed by that transaction. We will refer to the effort performed at each shard as being performed by the “worker” for that shard. Each worker is able to start working on its responsibilities for a given transaction independently of each other. The 2PC protocol begins at the end of transaction processing, when the transaction is ready to “commit”. It is initiated by a single, coordinator machine (which may be one of the workers involved in that transaction).

The basic flow of the 2PC protocol is shown in the figure below. [The protocol begins at the top of the figure and then proceeds in a downward direction.]

Phase 1: A coordinator asks each worker whether they have successfully completed their responsibilities for that transaction and are ready to commit. Each worker responds ‘yes’ or ‘no’.

Phase 2: The coordinator counts all the responses. If every worker responded ‘yes’, then the transaction will commit. Otherwise, it will abort. The coordinator sends a message to each worker with the final commit decision and receives an acknowledgement back.

This mechanism ensures the atomicity property of transactions: either the entire transaction will be reflected in the final state of the system, or none of it. If even just a single worker cannot commit, then the entire transaction will be aborted. In other words: each worker has “veto-power” for a transaction.

It also ensures transaction durability. Each worker ensures that all of the writes of a transaction have been durably written to storage prior to responding ‘yes’ in phase 1. This gives the coordinator freedom to make a final decision about a transaction without concern for the fact that a worker may fail after voting ‘yes’. [In this post, we are being purposefully vague when using the term “durable writes” --- this term can either refer to writing to local non-volatile storage or, alternatively, replicating the writes to enough locations for it to be considered “durable”.]

In addition to durably writing the writes that are directly required by the transaction, the protocol itself requires additional writes that must be made durable before it can proceed. For example, a worker has veto power until the point it votes ‘yes’ in phase 1. After that point, it cannot change its vote. But what if it crashes right after voting ‘yes’? When it recovers it might not know that it voted ‘yes’, and still think it has veto power and go ahead and abort the transaction. To prevent this, it must write its vote durably before sending the ‘yes’ vote back to the coordinator. [In addition to this example, in standard 2PC, there are two other writes that are made durable prior to sending messages that are part of the protocol.]

The problems with 2PC

There are two major problems with 2PC. The first is well known, and discussed in every reputable textbook that presents 2PC. The second is much less well known, but a major problem nonetheless.

The well-known problem is referred to as the “blocking problem”. This happens when every worker has voted ‘yes’, but the coordinator fails before sending a message with the final decision to at least one worker. The reason why this is a problem is that by voting ‘yes’, each worker has removed its power to veto the transaction. However, the coordinator still has absolute power to decide the final state of a transaction. If the coordinator fails before sending a message with the final decision to at least one worker, the workers cannot get together to make a decision amongst themselves --- they can’t abort because maybe the coordinator decided to commit before it failed, and they can’t commit because maybe the coordinator decided to abort before it failed. Thus, they have to block --- wait until the coordinator recovers --- in order to find out the final decision. In the meantime, they cannot process transactions that conflict with the stalled transaction since the final outcome of the writes of that transaction are yet to be determined.

There are two categories of work-arounds to the blocking problem. The first category of work-around modifies the core protocol in order to eliminate the blocking problem. Unfortunately, these modifications reduce the performance --- typically by adding an extra round of communication --- and thus are rarely used in practice. The second category keeps the protocol in tact but reduces the probability of the types of coordinator failure than can lead to the blocking program --- for example, by running 2PC over replica consensus protocols and ensuring that important state for the protocol is replicated at all times. Unfortunately, once again, these work-arounds reduce performance, since the protocol requires that these replica consensus rounds occur sequentially, and thus they may add significant latency to the protocol.

The lesser-known problem is what I call the “cloggage problem”. 2PC occurs after transaction is processed, and thus necessarily increases the latency of the transaction by an amount equal to the time it takes to run the protocol. This latency increase alone can already be an issue for many applications, but a potentially larger issue is that worker nodes do not know the final outcome of a transaction until mid-way through the second phase. Until they know the final outcome, they have to be prepared for the possibility that it might abort, and thus they typically prevent conflicting transactions from making progress until they are certain that the transaction will commit. These blocked transactions in turn block other transactions from running, and so on, until 2PC completes and all of the blocked transactions can resume.  This cloggage further increases the average transaction latency and also decreases transactional throughput.

To summarize the problems we discussed above: 2PC poisons a system along four dimensions: latency (the time of the protocol plus the stall time of conflicting transactions), throughput (because it prevents conflicting transactions from running during the protocol), scalability (the larger the system, the more likely transactions become multi-partition and have to pay the throughput and latency costs of 2PC), and availability (the blocking problem we discussed above).  Nobody likes 2PC, but for decades, people have assumed that it is a necessary evil.

It’s time to move on

For over three decades, we’ve been stuck with two-phase commit in sharded systems. People are aware of the performance, scalability, and availability problems it introduces, but nonetheless continue on, with no obvious better alternative.

The truth is, if we would just architect our systems differently, the need for 2PC would vanish. There have been some attempts to accomplish this --- both in academia (such as this SIGMOD 2016 paper) and industry. However, these attempts typically work by avoiding multi-sharded transactions in the first place, such as by repartitioning data in advance of a transaction so that it is no longer multi-sharded. Unfortunately, this repartitioning reduces performance of the system in other ways.

What I am calling for is a deeper type of change in the way we architect distributed systems. I insist that systems should still be able to process multi-sharded transactions --- with all the ACID guarantees and what that entails --- such as atomicity and durability --- but with much simpler and faster commit protocols.

It all comes down to a fundamental assumption that has been present in our systems for decades: a transaction may abort at any time and for any reason. Even if I run the same transaction on the same initial system state … if I run it at 2:00PM it may commit, but at 3:00 it may abort.

There are several reasons why most architects believe we need this assumption. First, a machine may fail at anytime --- including in the middle of a transaction. Upon recovery, it is generally impossible to recreate all of the state of that transaction that was in volatile memory prior to the failure. As a result, it is seemingly impossible to pick up where the transaction left off prior to the failure. Therefore, the system aborts all transactions that were in progress at the time of the failure. Since a failure can occur at any time, this means that a transaction may abort at any time.

Second, most concurrency control protocols require the ability to abort a transaction at any time. Optimistic protocols perform a “validation” phase after processing a transaction. If validation fails, the transaction aborts.  Pessimistic protocols typically use locks to prevent concurrency anomalies. This use of locks may lead to deadlock, which is resolved by aborting (at least) one of the deadlocked transactions. Since deadlock may be discovered at any time, the transaction needs to retain the ability to abort at any time.

If you look carefully at the two-phase commit protocol, you will see that this arbitrary potential to abort a transaction is the primary source of complexity and latency in the protocol. Workers cannot easily tell each other whether they will commit or not, because they might still fail after this point (before the transaction is committed) and want to abort this transaction during recovery. Therefore, they have to wait until the end of transaction processing (when all important state is made durable) and proceed in the necessary two phases: in the first phase, each worker publically relinquishes its control to abort a transaction, and only then can the second phase occur in which a final decision is made and disseminated.

In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction whenever it wants during its execution. Only logic within a transaction should be allowed to cause a transaction to abort. If it is theoretically possible to commit a transaction given an current state of the database, that transaction must commit, no matter what types of failures occur. Furthermore, there must not be race conditions relative to other concurrently running transactions that can affect the final commit/abort state of a transaction.

Removing abort flexibility sounds hard. We’ll discuss soon how to accomplish this. But first let’s observe how the commit protocol changes if transactions don’t have abort flexibility.

What a commit protocol looks like when transactions can’t abort arbitrarily

Let’s look at two examples:

In the first example, assume that the worker for the shard that stores the value for variable X is assigned a single task for a transaction: change the value of X to 42. Assume (for now) that there are no integrity constraints or triggers defined on X (which may prevent the system from setting X to 42). In such a case, that worker is never given the power to be able to abort the transaction. No matter what happens, that worker must change X to 42. If that worker fails, it must change X to 42 after it recovers. Since it never has any power to abort, there is no need to check with that worker during the commit protocol to see if it will commit.

In the second example, assume that the worker for the shard that stores the value for variables Y and Z is assigned two tasks for a transaction: subtract 1 from the previous value of Y and set Z to the new value of Y. Furthermore, assume that there is an integrity constraint on Y that states that Y can never go below 0 (e.g., if it represents the inventory of an item in a retail application). Therefore, this worker has to run the equivalent of the following code:

          IF (Y > 0)
             Subtract 1 from Y
              ABORT the transaction
          Z = Y

This worker must be given the power to abort the transaction since this required by the logic of the application. However, this power is limited. Only if the initial value of Y was 0 can this worker abort the transaction. Otherwise, it has no choice but to commit. Therefore, it doesn’t have to wait until it has completed the transaction code before knowing whether it will commit or not. On the contrary: as soon as it has finished executing the first line of code in the transaction, it already knows its final commit/abort decision. This implies that the commit protocol will be able to start much earlier relative to 2PC.

Let’s now combine these two examples into a single example in which a transaction is being performed by two workers --- one of them is doing the work described in the first example, and the other one doing the work described in the second example. Since we are guaranteeing atomicity, the first worker cannot simply blindly set X to 42. Rather, it’s own work must also be dependent on the value of Y. In effect, it’s transaction code becomes:

     temp = Do_Remote_Read(Y)
     if (temp > 0)
        X = 42

Note that if the first worker’s code is written in this way, the code for the other worker can be simplified to just:

     IF (Y > 0)
        Subtract 1 from Y
        Z = Y

By writing the transaction code in this way, we have removed explicit abort logic from both workers. Instead, both workers have if statements that check for the constraint that would have caused the original transaction to abort. If the original transaction would have aborted, both workers end up doing nothing. Otherwise, both workers change the values of their local state as required by the transaction logic.

The important thing to note at this point is that the need for a commit protocol has been totally eliminated in the above code. The system is not allowed to abort a transaction for any reason other than conditional logic defined by application code on a given state of the data. And all workers condition their writes on this same conditional logic so that they can all independently decide to “do nothing” in those situations where a transaction cannot complete as a result of current system state. Thus, all possibility of a transaction abort has been removed, and there is no need for any kind of distributed protocol at the end of transaction processing to make a combined final decision about the transaction. All of the problems of 2PC have been eliminated. There is no blocking problem because there is no coordinator. And there is no cloggage problem, because all necessary checks are overlapped with transaction processing instead of after it completes.

Moreover, as long as the system is not allowed to abort a transaction for any reason other than the conditional application logic based on input data state, it is always possible to rewrite any transaction as we did above in order to replace abort logic in the code with if statements that conditionally check the abort conditions. Furthermore, it is possible to accomplish this without actually rewriting application code. [The details of how to do this are out of scope for this post, but to summarize at a high level: shards can set special system-owned boolean flags when they have completed any conditional logic that could cause an abort, and it is these boolean flags that are remotely read from other shards.]

In essence: there are two types of aborts that are possible in transaction processing systems: (1) Those that are caused by the state of the data and (2) Those that are caused by the system itself (e.g. failures or deadlocks). Category (1) can always be written in terms of conditional logic on the data as we did above. So if you can eliminate category (2) aborts, the commit protocol can be eliminated.

So now, all we have to do is explain how to eliminate category (2) aborts.

Removing system-induced aborts

I have spent almost an entire decade designing systems that do not allow system-induced aborts. Examples of such systems are Calvin, CalvinFS, Orthrus, PVW, and a system that processes transactions lazily. The impetus for this feature came from the first of these projects --- Calvin --- because of its status of being a deterministic database system. A deterministic database guarantees that there is only one possible final state of the data in the database given a defined set of input requests. It is therefore possible to send the same input to two distinct replicas of the system and be certain that the replicas will process this input independently and end up in the same final state, without any possibility of divergence.

System-induced aborts such as system failure or concurrency control race conditions are fundamentally nondeterministic events. It is very possible that one replica will fail or enter a race condition while the other replica will not. If these nondeterministic events were allowed to result in an a transaction to abort, then one replica may abort a transaction while the other one would commit --- a fundamental violation of the deterministic guarantee. Therefore, we had to design Calvin in a way that failures and race conditions cannot result in a transaction to abort. For concurrency control, Calvin used pessimistic locking with a deadlock avoidance technique that ensured that the system would never get into a situation where it had to abort a transaction due to deadlock. In the face of a system failure, Calvin did not pick up a transaction exactly where it left off (because of the loss of volatile memory during the failure). Nonetheless, it was able to bring the processing of that transaction to completion without having to abort it. It accomplished this via restarting the transaction from the same original input.

Neither of these solutions --- neither deadlock avoidance nor transaction restart upon a failure --- are limited to being used in deterministic database systems. [Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.] Indeed, some of the other systems I linked to above are nondeterministic systems. Once we realized the power that comes with removing system-level aborts, we built this feature into every system we built after the Calvin project --- even the nondeterministic systems.


I see very little benefit in system architects making continued use of 2PC in sharded systems moving forward. I believe that removing system-induced aborts and rewriting state-induced aborts is the better way forward. Deterministic database systems such as Calvin or FaunaDB  always remove system-induced aborts anyway, and thus usually can avoid 2PC as we described above. But it is a huge waste to limit this benefit to only deterministic databases. It is not hard to remove system-induced aborts from nondeterministic systems. Recent projects have shown that it is even possible to remove system-induced aborts in systems that use concurrency control techniques other than pessimistic concurrency control. For example, both the PVW and the lazy transaction processing systems we linked to above use a variant of multi-versioned concurrency control. And FaunaDB uses a variant of optimistic concurrency control.

In my opinion there is very little excuse to continue with antiquated assumptions regarding the need for system-induced aborts in the system. In the old days when systems ran on single machines, such assumptions were justifiable. However, in modern times, where many systems need to scale to multiple machines that can fail independently of each other, these assumptions require expensive coordination and commit protocols such as 2PC. The performance problems of 2PC has been a major force behind the rise of non-ACID compliant systems that give up important guarantees in order to achieve better scalability, availability, and performance. 2PC is just too slow --- it increases the latency of all transactions --- not just by the length of the protocol itself, but also by preventing transactions that access the same set of data from running concurrently. 2PC also limits scalability (by reducing concurrency) and availability (the blocking problem we discussed above). The way forward is clear: we need to reconsider antiquated assumptions when designing our systems and say “good-bye” to two phase commit!

Friday, December 14, 2018

Partitioned consensus and its impact on Spanner’s latency

In a post that I published in September, I described two primary approaches for performing consensus in distributed systems, and how the choice of approach affects the consistency guarantees of the system. In particular, consensus can either be unified, such that all writes in the system participate in the same distributed consensus protocol, or it can be partitioned, such that different subsets of the data participate in distinct consensus protocols.

The primary downside of partitioned consensus was that achieving global consistency is much more complicated. Consistency guarantees require that requests submitted after previous requests complete will never “go back in time” and view a state of the system that existed prior to the completed request. Such guarantees are hard to enforce in partitioned consensus systems since different partitions operate independently from each other: Consistency requires a notion of “before” and “after” --- even for events on separate machines or separate partitions. For partitions that operate completely independently, the most natural way to define “before” and “after” is to use real time --- the time on the clocks of the different partitions. However, clocks tend to skew at the millisecond granularity, and keeping clocks in sync is nontrivial. We discussed how Google has a hardware solution that aids in clock synchronization, while other solutions attempt to use software-only clock synchronization algorithms.

In contrast, unified consensus results in a global order of all requests. This global order can be used to implement the notion of “before” and “after” without having to rely on local clock time, which entirely avoids the need for clock synchronization. This results in stronger consistency guarantees: unified consensus systems can guarantee consistency at all times, while partitioned consensus systems can only guarantee consistency if the clock skew stays within an expected bound. For software-only implementations, it is hard to avoid occasionally violating the maximum clock skew bound assumption, and the violations themselves may not be discoverable. Therefore, unified consensus is the safer option.

The post led to several interesting debates, most of which are beyond the scope of this post. However, there was one interesting debate I’d like to explore more deeply in this post. In particular, the question arose: Are there any fundamental latency differences between unified-consensus systems and partitioned-consensus systems? When I read the comments to my post (both on the post itself and also external discussion boards), I noticed that there appears to be a general assumption amongst my readers that unified consensus systems must have higher latency than partitioned consensus systems. One reader even accused me of purposely avoiding discussing latency in that post in order to cover up a disadvantage of unified consensus systems. In this post, I want to clear up some of the misconceptions and inaccurate assumptions around these latency tradeoffs, and present a deeper (and technical) analysis on how these different approaches to consensus have surprisingly broad consequences on transaction latency. We will analyze the latency tradeoff from three perspectives: (1) Latency for write transactions, (2) Latency for linearizable read-only transactions and (3) Latency for serializable snapshot transactions.

Latency for write transactions

The debate around write transactions is quite interesting since valid arguments can be made for both sides.

The partitioned-consensus side points out two latency downsides of unified consensus: (1) As mentioned in my previous post, in order to avoid scalability bottlenecks, unified consensus algorithms perform consensus batch-at-a-time instead of on individual requests. They, therefore, have to pay the additional latency of collecting transactions into batches prior to running consensus. In the original Calvin paper, batch windows were 10ms (so the average latency would be 5ms); however, we have subsequently halved the batch window latency in my labs at Yale/UMD. FaunaDB (which uses Calvin’s unified consensus approach) also limits the batch window to 10ms. (2) For unified consensus, there will usually be one extra network hop to get the request to the participant of the consensus protocol for its local region. This extra hop is local, and therefore can be assumed to take single-digit milliseconds.  If you combine latency sources (1) and (2), the extra latency incurred by the preparation stage for unified consensus is approximately 10-15ms.

On the other hand, the unified-consensus side points out that multi-partition atomic write transactions require two-phase commit for partitioned-consensus systems. For example, let’s say that a transaction writes data in two partitions: A and B. In a partitioned-consensus system, the write that occurred in each partition achieves consensus separately. It is very possible that the consensus in partition A succeeds while in B it fails. If the system guarantees atomicity for transactions, then the whole transaction must fail, which requires coordination across the partitions --- usually via two-phase commit. Two-phase commit can result in significant availability outages (if the coordinator fails at the wrong time) unless it runs on top of consensus protocols. Thus Spanner and Spanner-derivative systems all run two-phase commit over partitioned consensus for multi-partition transactions. The latency cost of the Raft/Paxos protocol itself (once it gets started) is the same for unified and partitioned consensus, but two-phase commit  requires two rounds of consensus to commit such transactions. A single round of consensus may take between 5ms and 200ms, depending on how geographically disperse the deployment is. Since Spanner requires two sequential rounds, the minimum transaction latency is double that --- between 10ms for single-region deployments to 400ms for geographically disperse deployments.

In practice, this two-phase commit also has an additional issue: a transaction cannot commit until all partitions vote to commit. A simple majority is not sufficient --- rather every partition must vote to commit. A single slow partition (for example, a partition undergoing leader election) stalls the entire transaction. This increases the observed long tail latency in proportion to transaction complexity.

In contrast, unified consensus systems such as Calvin and its derivatives such as FaunaDB do not require two-phase commit. [Side note: a discussion of how to avoid two-phase commit in a unified consensus system can be found in this VLDB paper. FaunaDB’s approach is slightly different, but the end result is the same: no two-phase commit.] As a result, unified consensus systems such as Calvin and FaunaDB only require one round of consensus to commit all transactions --- even transactions that access data on many different machines.

The bottom line is that the better latency option between unified or partitioned consensus for write transactions is somewhat workload dependent. Unified consensus increases latency for all transactions by a little, but partitioned consensus can increase latency by a lot more for multi-partition transitions. For most applications, it is impossible to avoid multi-partition transactions. For example, many applications allow transactional interactions between arbitrary users (payments between users, exchanges between users, “friendship” status updates between users, gaming interactions, etc.). Although it is possible to group users into partitions such that many of their interactions will be with other users within that partition (e.g. partition by a user’s location), as long as arbitrary interactions are allowed, there will always be interactions between users in different partitions. These multi-partition interactions are much slower in partitioned-consensus systems. Thus, for most workloads, unified-consensus is the better latency option for write transactions.

Latency for linearizable read-only transactions

Linearizable read-only transactions are generally sent to the consensus leader’s region and performed (or at least receive a timestamp) there [other options exist, but this is what Spanner and other systems mentioned in my previous post do]. In unified-consensus, there is only one leader region for the whole system. Linearizable read transactions that initiate from near this region will be processed with low latency, while transactions that initiate from farther away will observe correspondingly higher latency.

Meanwhile, in partitioned-consensus, there is one leader per partition, and these leaders can be located in different regions. The partitioned-consensus supporters argue that this can lead to lower latency in an array of common scenarios. An application developer can specify a location-based partitioning algorithm. All data that is usually accessed from region X should be placed in the same partition, with a consensus leader located in region X. All data that is usually accessed from region Y should be placed in the same partition, with a consensus leader located in region Y. In doing so, a larger number of read-only transactions will observe lower latency.

The downside of this approach is that it breaks the abstraction of the consensus protocol as a separate component of the system --- now the consensus protocol and data storage layer become much more intertwined, increasing the monolithicity of the system. Furthermore, consensus protocols run leader elections after a lease expires, and would have to reduce the democratic nature of this protocol in order to ensure the leader remains in the closest possible region. Finally, it increases complexity and reduces the flexibility of the partitioning protocol. As far as I know, the most well-known example of a partitioned-consensus system --- Spanner --- does not take advantage of this potential optimization for these reasons.

Consequently, although in theory, there is a potential latency advantage for partitioned-consensus systems for linearizable read-only transactions, in practice this advantage is not realized.

On the other hand, there is a fundamental latency advantage for unified-consensus systems in the presence of multi-partitioned transactions. A multi-partition transaction in a partitioned-consensus system must involve more than one leader. The leaders of each partition accessed by the read transaction must communicate with each other in order to figure out a timestamp at which this linearizable read can be safely performed (see sections 4.1.4 and 4.2.2 of the Spanner paper). Alternatively, a timestamp sufficiently into the future (at the end of the clock skew uncertainty bound) can be chosen at which to perform the read. Either way ---- whether it is communication across leaders (that may be located in different geographic regions) or whether it is waiting until the end of the clock skew uncertainty bound --- multi-partition reads pay an additional latency cost in order to ensure linearizability. In contrast, unified consensus systems have only a single leader region, and can perform linearizable reads across the servers in this region, without any communication with other regions or waiting for clock skew uncertainty windows to close.

Latency for serializable snapshot read-only transactions

Many applications --- even if they require linearizable write transactions --- do not require linearizable read-only transactions, and instead are content to read from a recent snapshot of the database state. However, such snapshot reads must be serializable --- they should reflect the database state as of a particular transaction in the serial order, and none of the writes from transactions after that transaction.

Recall that transactions may write data on different machines / partitions. For example, a transaction, T, may write data on partition A and partition B. A serializable snapshot that includes data from both partition A and partition B must therefore include both of T’s writes to those partitions, or neither. In particular, it should include both of T’s writes if the snapshot is as of a point in time “after” T, and neither of T’s writes if the snapshot is as of a point in time “before” T. Note that this notion of “point in time” must exist --- even across partitions.Therefore, once again, there needs to be a global notion of “before” and “after” --- even for writes across different machines. As long as this notion of “before” and “after” exists, such snapshot reads can be sent to any replica to be processed there, and do not require any consensus or interaction with consensus leader regions. This is critically important to support low latency reads in a geographically disperse deployment.

As mentioned in the introductory paragraph of this post, both unified- and partitioned-consensus systems are capable of generating global notions of “before” and “after”, and thus both types of systems are able to achieve the performance advantage of being able to perform these reads from any replica. However, as we mentioned above, unified-consensus systems can achieve this global notion of “before” and “after” without any clock synchronization, while partitioned-consensus systems use clock synchronization. Thus, unified-consensus can always achieve correct serializable snapshot reads, while partitioned-consensus can only achieve the same result if the maximum clock skew bound assumption is not violated.


The latency debate between unified vs. partitioned consensus is an intricate one. However, it is clear that multi-partition transactions exacerbate the disadvantages of partitioned-consensus transactions in (at least) three dimensions:

  1. Multi-partition transactions require two-phase commit on top of the consensus protocol in partitioned-consensus systems.  In many deployments, consensus across replicas is the latency bottleneck. By requiring two-phase commit on top of consensus, partitioned-consensus systems result in (approximately) double the latency (relative to unified-consensus) in such deployments, and higher long tail latency.
  2. Multi-partition transactions require communication across leaders or waiting out clock skew uncertainty bounds for linearizable transactions --- even for linearizable read-only transactions.
  3. Partitioned-consensus systems require clock synchronization in order to achieve low latency snapshot reads (in addition to all linearizable operations). Any incorrect assumptions of the maximum clock skew across machines may result in serializability violations (and thus incorrect results being returned).

As we discussed above, it is usually impossible to avoid multi-partition transactions in most real-world applications. Furthermore, as an application scales, the number of partitions must increase, and thus the probability of multi-partition transactions is also likely to increase. Therefore, the disadvantages of partitioned-consensus systems relative to unified-consensus systems accentuate as the application scales.

(Daniel Abadi is an advisor at FaunaDB)