Monday, July 15, 2019

The dangers of conditional consistency guarantees

The ease of writing an application on top of infrastructure that guarantees consistency can not be overstated. It’s just so much easier to write an application when you can be sure that every read will return the most recent state of the data --- no matter where in the world writes to data state occur, and no matter what types of failure may occur.

Unfortunately, consistency in distributed systems comes at a cost. Other desirable system properties may be traded off in order to uphold a consistency guarantee. For example, it is well known (see the CAP theorem) that in the event of a network partition, it is impossible to guarantee both consistency and availability. One must be traded off for the other. Furthermore, in a system deployed across a wide geography, the laws of physics prevent the communication required to maintain consistency from happening instantaneously. This leads to the PACELC tradeoff: when there is a network partition (the P in PACELC), the system must decide how it will tradeoff availability (A) for consistency (C). Else (E) --- during normal operation of the system --- the system must decide how it will tradeoff latency (L) for consistency (C).

There are four points in the PACELC tradeoff space: (1) PA/EC (when there is a partition, choose availability; else, choose consistency), (2) PC/EC (choose consistency at all times), (3) PA/EL (prioritize availability and latency over consistency), and (4) PC/EL (when there is a partition, choose consistency; else, choose latency).

If you were to ask a student who just learned about the CAP theorem in a distributed systems class what guarantees they would want from a distributed system deployed for an application that needs 99.999% availability, the student would likely answer: “That’s easy! I would choose a system that prioritizes availability over consistency in the event of a network partition. But in the absence of a network partition, it is possible for a system to guarantee both availability and consistency, so I would want both those guarantees in the system I would deploy.”

This answer is natural and obvious. But at the same time, it is naive and incorrect.

The system that the student described would be a PA/EC system from PACELC --- a system that does not guarantee consistency during a network partition, but otherwise guarantees consistency. There are many applications that need several “nines” of availability. If the right PACELC system for those applications was PA/EC, then there would be many vendors competing with each other, attempting to sell the best PA/EC system. Yet in practice, the opposite is true --- it is extremely rare to find PA/EC systems in the wild. In fact, it took me several years until I came across the IMDG (in-memory data grid) market before I could find examples of real PA/EC systems.

The reason for this is that PA/EC systems are much less useful than one might initially expect. To understand why this is the case, one must understand the fundamental difference between a real guarantee and a conditional guarantee.


Real guarantees vs. conditional guarantees


Of the three desirable properties that we’ve mentioned thus far as being desirable in a distributed system --- availability, consistency, and latency --- only one of them can ever be a real guarantee: consistency. No system will ever guarantee 100% availability, nor will any system guarantee that 100% of requests will complete within a particular latency bound. Such guarantees would be impossible to uphold. A “PA” system in PACELC prioritizes availability, but cannot guarantee 100% availability. Similarly, an “EL” system in PACELC prioritizes latency, but does not make any particular 100% guarantees.

In contrast, consistency is a real guarantee. Systems that guarantee consistency are actually capable of upholding the promise of 100% consistency, and many systems do in fact provide it.

Real guarantees form the foundation of the success of modern computing. Computers are extremely complex systems. Even a simple program must involve a large number of complex components --- from registers and logic gates to cache and memory to operating systems and security all the way up to the run time environment of the program. A simple “hello world” program ought to take weeks to develop. The reason why it doesn’t is because of the power of “abstraction” in computer science. Each level of a computer system builds on top of an underlying level that abstracts away the complex details of its implementation. Instead, the underlying level presents an interface to the higher level, and inviolable guarantees about what is returned through this interface.

For example, when a program reads from a particular place in memory, it can assume that it will see the data that was last written there. In reality, the processor inside a computer may choose to execute instructions within a program out of order. Furthermore, electric or magnetic interference may cause bits to spontaneously flip in memory. In practice, complicated solutions are used to detect and correct such bit corruption, including using redundant memory bits, additional circuitry, and error correcting codes. All of this is hidden to the higher level program that wants to read this location in memory. The higher level program assumes that it will receive the data that was most recently written to that location 100% of the time. [Side note: In theory, all of the error correcting logic in storage systems still cannot yield an actual 100% guarantee. However, the chances of a corrupted data value being returned to the end user is so infinitesimally small that this guarantee is still treated like 100% for all practical purposes.]

Imagine the extra complexity that would be involved in writing a program if you could not assume that instructions from a program are processed in the specified order, or the data in memory has not been corrupted. Almost every line of code that reads data from a variable would have to be surrounded by ‘if-statements’ that check for what might be corruption in that particular context and that specifies what should be done if the read appears to not be correct. This extra code would likely increase the development cycle for a program by one to two orders of magnitude!

For a program to be “bug-free”, this guarantee must be 100%. If a read returns the correct result 99% of the time, or even 99.99% or 99.9999% of the time, the above described ‘if-statements’ and conditional logic will still be necessary. Only if the program can assume that the correct result will always be returned can the extra conditional logic be avoided. Thus, from the perspective of the developer, there is an enormous difference between a complete guarantee and an incomplete guarantee. Within the space of incomplete guarantees, it is certainly the case that 99.9999% accuracy is better than 99% accuracy. But the big jump in complexity savings for the developer is the jump between anything less than 100% and a 100% guarantee.

This is why a PA/EC system has limited utility to an application developer. A PA/EC system is a system that guarantees consistency in all cases …. except when there is a network partition. In other words, a PA/EC makes a conditional guarantee of consistency: it guarantees consistency in all cases as long as a particular condition is met: that there is no network partition.

However, we just said that the big jump in complexity savings for the developer is a jump between anything less than 100% and a 100% guarantee. Any guarantee that comes with conditions that are out of the control of the developer (such as ensuring that there will never be a network partition), is not that helpful. The developer still has to write code to handle the cases where the guarantee is not upheld.

I am not arguing that all systems must guarantee consistency. I am only pointing out that there is a big difference in developer complexity between developing on top of a system that guarantees consistency fully, and one that guarantees consistency only under certain conditions. Without a full guarantee of consistency, there is a significant amount of extra effort required to ensure the absence of corner-case bugs that may emerge days, weeks, or even years after the application is first deployed.


Conditional guarantees and PACELC


We just explained that a PA system in PACELC results in a large amount of extra effort for the application developer, in order to ensure that the application remains bug-free even when inconsistent reads occur during a network partition. If so, during normal operation, where there is a tradeoff between consistency and latency, why choose consistency? All of the effort in building an application that does not break in the face of inconsistent reads has already been paid. Why not benefit from all that effort during normal operation and get better latency as well?

All of this suggests that application developers should decide whether they want to develop on top of a system that guarantees consistency or not. If they want consistency, they should deploy on top of a PC/EC system in PACELC --- i.e. a system that guarantees consistency 100% of the time. If they are willing to go to all of the extra effort to build on top of a system that does not guarantee consistency, they should be looking to get as many benefits as possible from this effort, and should deploy on top of PA/EL system, and thereby get better availability and latency properties (even if those properties do not come with guarantees).


New consistency guarantees in Hazelcast


As I pointed out in a previous post, when deploying within a single geographic region, the latency vs. consistency tradeoff is not significant. In such environments, there is not much difference between PA/EC and PA/EL systems. Since in-memory data grids are typically deployed in a single region, PA/EC became the standard PACELC option for the IMDG market.

Nonetheless, it is interesting to note how Hazelcast --- one of the most widely used IMDG implementations --- is moving away from the PA/EC default configuration and towards the more natural PC/EC and PA/EL pairings.

In 2017, I wrote an in-depth analysis of Hazelcast’s consistency and availability guarantees. At around the same time, Kyle Kingsbury published a linearizability analysis of Hazelcast using his Jepsen testing library. These analyses led to what CEO Greg Luck tells me was 3 person-years of effort to release the first PC/EC configuration option in Hazelcast.

A quick reminder about what Hazelcast does: Many Java programs store program state inside Java collections and data structures such as Queue, Map, or Multimap. As a program scales --- both in terms of the number of clients accessing data and in terms of the amount of data stored, these data structures may get too large to fit in memory on a single server. Hazelcast provides a distributed implementation of these popular data structures, thereby enabling programs that leverage them to scale. Java programs interact with Hazelcast identically to how they interacted with the local data Java structures. However, under the covers, Hazelcast is distributing and replicating them across a cluster of machines.

Hazelcast also provides distributed implementations of Java concurrency structures such as AtomicLong, AtomicReference, Lock. Using such structures without a consistency guarantee is downright dangerous. For example, in situations for which consistency is not upheld, it is possible for multiple different processes to acquire the same lock (concurrently) or for non-atomic actions to occur on supposedly atomic data structures. In other words, lack of consistency violates the fundamental premise of these backbone Java concurrency structures.

With a PC/EC configuration option, you can now deploy Hazelcast’s distributed version of these concurrency tools --- IAtomicLong, IAtomicReference, and ILock --- and get identical correctness guarantees relative to what they provide when being used by a regular single-server Java program.

Hazelcast’s PC/EC configuration performs distributed agreement on data structure state via the Raft consensus protocol. Raft is known for achieving strong availability. Even when a network partition occurs, the majority of servers can remain available (the consistency guarantee only requires that the minority partition must lose availability).

Separately from the consistency issue, another well-known problem with distributed implementations of lock data structures is liveness. A server can acquire the lock and then fail for an indefinite period of time while holding the lock. Meanwhile the rest of the distributed system remains alive, but unable to acquire the lock and make progress. Therefore, in practice, distributed locks typically include timeouts, in order to prevent servers that fail while holding the lock from causing the rest of the system to stall indefinitely. Unfortunately, as soon as it is possible for a lock to timeout, it is possible for a process to get timed out while holding the lock, and yet be temporarily unaware that the lock timed out. In such a scenario, it is possible for two different processes to concurrently believe they have the lock, and perform concurrent side-effects that the lock was designed to prevent. Hazelcast’s distributed lock implementation includes a fencing mechanism which enables downstream code --- code that is designed to be run while locks are being held --- to participate in the locking protocol and be able to distinguish which process is correct when multiple processes concurrently believe that they hold the lock.

As far as the PA/EL option in Hazelcast, they have recently added PN Counters --- a CRDT counter implementation that enables some correctness guarantees despite the potential for inconsistent reads. In particular, when there is no consistency guarantee, it is possible for a server to read a stale value of the counter. For example, process A may update the value of a counter from 4 to 5. Afterwards, process B sees the stale value of the counter (4) and also adds one to it (to get 5). In versions of Hazelcast prior to 3.10, the final value of the counter --- even after whatever event caused the inconsistency is resolved --- would be 5. In other words, one of the two increments of the counter would be lost. With PN Counters (introduced in Hazelcast 3.10), the counter state will eventually converge to the correct value (in this case --- 6 --- that reflects the value after two increments to an initial value of 4). PN Counters also support decrements of the counter (in addition to increments), and guarantees (1) read-your-own-writes and (2) monotonic reads in addition to (3) what I described above --- the eventual convergence of the counter value to the correct value.

Furthermore, inconsistent reads in Hazelcast prior to version 3.10 --- under PA/EC or PA/EL configurations --- could potentially result in duplicate IDs being generated. Hazelcast 3.10 introduced Flake ID generators which guarantee global unique IDs even in the absence of a global consistency guarantee.


Conclusion


PA/EC systems sound good in theory, but are not particularly useful in practice. Our one example of a real PA/EC system --- Hazelcast --- has spent the past 1.5 years introducing features that are designed for alternative PACELC configurations --- specifically PC/EC and PA/EL configurations. PC/EC and PA/EL configurations are a more natural cognitive fit for an application developer. Either the developer can be certain that the underlying system guarantees consistency in all cases (the PC/EC configuration) in which case the application code can be significantly simplified, or the system makes no guarantees about consistency at all (the PA/EL configuration) but promises high availability and low latency. CRDTs and globally unique IDs can still provide limited correctness guarantees despite the lack of consistency guarantees in PA/EL configurations.

Friday, June 28, 2019

Correctness Anomalies Under Serializable Isolation



Most database systems support multiple isolation levels that enable their users to trade off exposure to various types of application anomalies and bugs for (potentially small) increases in potential transaction concurrency. For decades, the highest level of “bug-free correctness” offered by commercial database systems was “SERIALIZABLE” isolation in which the database system runs transactions in parallel, but in a way that is equivalent to as if they were running one after the other. This isolation level was considered “perfect” because it enabled users that write code on top of a database system to avoid having to reason about bugs that could arise due to concurrency. As long as particular transaction code is correct in the sense that if nothing else is running at the same time, the transaction will take the current database state from one correct state to another correct state (where “correct” is defined as not violating any semantics of an application), then serializable isolation will guarantee that the presence of concurrently running transactions will not cause any kind of race conditions that could allow the database to get to an incorrect state. In other words, serializable isolation generally allows an application developer to avoid having to reason about concurrency, and only focus on making single-threaded code correct.

In the good old days of having a “database server” which is running on a single physical machine, serializable isolation was indeed sufficient, and database vendors never attempted to sell database software with stronger correctness guarantees than SERIALIZABLE. However, as distributed and replicated database systems have started to proliferate in the last few decades, anomalies and bugs have started to appear in applications even when running over a database system that guarantees serializable isolation. As a consequence, database system vendors started to release systems with stronger correctness guarantees than serializable isolation, which promise a lack of vulnerability to these newer anomalies. In this post, we will discuss several well known bugs and anomalies in serializable distributed database systems, and modern correctness guarantees that ensure avoidance of these anomalies. 

What does “serializable” mean in a distributed/replicated system?


We defined “serializable isolation” above as a guarantee that even though a database system is allowed to run transactions in parallel, the final result is equivalent to as if they were running one after the other. In a replicated system, this guarantee must be strengthened in order to avoid the anomalies that would only occur at lower levels of isolation in non-replicated systems. For example, let’s say that the balance of Alice’s checking account ($50) is replicated so that the same value is stored in data centers in Europe and the United States. Many systems do not replicate data synchronously over such large distances. Rather, a transaction will complete at one region first, and its update to the database system may be replicated afterwards.  If a withdrawal of $20 is made concurrently in the United States and Europe, the old balance is read ($50) in both places, $20 is removed from it, and the new balance ($30) is written back in both places and replicated to the other data center. This final balance is clearly incorrect ---- it should be $10 --- and was caused by concurrently executing transactions. But the truth is, the same outcome could happen if the transactions were serial (one after the other) as long as the replication is not included as part of the transaction (but rather happens afterwards). Therefore, a concurrency bug results despite equivalence to a serial order.

Rony Attar, Phil Bernstein, and Nathan Goodman expanded the concept of serializability in 1984 to define correctness in the context of replicated systems. The basic idea is that all the replicas of a data item behave like a single logical data item. When we say that a concurrent execution of transactions is “equivalent to processing them in a particular serial order”, this implies that whenever a data item is read, the value returned will be the most recent write to that data item by a previous transaction in the (equivalent) serial order --- no matter which copy was written by that write. In this context “most recent write” means the write by the closest (previous) transaction in that serial order. In our example above, either the withdrawal in Europe or the withdrawal in the US will be ordered first in the equivalent serial order. Whichever transaction is second --- when it reads the balance --- it must read the value written by the first transaction. Attar et. al. named this guarantee “one copy serializability” or “1SR”, because the isolation guarantee is equivalent to serializability in an unreplicated system with “one copy” of every data item.

Anomalies are possible under serializability; Anomalies are  possible under one-copy serializability


We just stated that one-copy serializability in replicated systems is the identical isolation guarantee as serializability in unreplicated systems. There are many, many database systems that offer an isolation level called “serializability”, but very few (if any) replicated database systems that offer an isolation level called “one-copy serializability”. To understand why this is the case, we need to explain some challenges in writing bug-free programs on top of systems that “only” guarantee serializability. 

A serializable system only guarantees that transactions will be processed in an equivalent way to some serial order. The serializability guarantee by itself doesn’t place any constraints on what this serial order is. In theory, one transaction can run and commit. Another transaction can come along --- a long period of time after the first one commits --- and be processed in such a way that the resulting equivalent serial order places the later transaction before the earlier one. In a sense, the later transaction “time travels” back in time, and is processed such that the final state of the database is equivalent to that transaction running prior to transactions that completed prior to when it began. A serializable system doesn’t prevent this. Nor does a one-copy serializable system. Nonetheless, in single-server systems, it is easy and convenient to prevent time-travel. Therefore, the vast majority of single-server systems that guarantee “serializability” also prevent time-travel. Indeed, it was so trivial to prevent time travel that most commercial serializable systems did not consider it notable enough to document their prevention of this behavior. 

In contrast, in distributed and replicated systems, it is far less trivial to guarantee a lack of time travel, and many systems allow some forms of time travel in their transaction processing behavior. 

The next few sections describe some forms of time-travel anomalies that occur in distributed and/or replicated systems, and the types of application bugs that they may cause. All of these anomalies are possible under a system that only guarantees one-copy serializability. Therefore, vendors typically document which of these anomalies they do and do not allow, thereby potentially guaranteeing a level of correctness higher than one-copy serializability. At the end of this post, we will classify different correctness guarantees and which time-travel anomalies they allow.

The immortal write


Let’s say the user of an application currently has a display name of “Daniel”, and decides to change it to “Danny”. He goes to the application interface, and changes his display name accordingly. He then reads his profile to ensure the change took effect, and confirms that it has. Two weeks later, he changes his mind again, and decides he wants to change his display name to “Danger”. He goes to the interface and changes his display name accordingly and was told that that change was successful. But when he performs a read on his profile, it still lists his name as “Danny”. He can go back and change his name a million times. Every time, he is told the change was successful, but the value of his display name in the system remains “Danny”. 
What happened? All of the future writes of his name travelled back in time to a point in the serial order directly before the transaction that changed his name to “Danny”. The “Danny” transaction therefore overwrote the value written by all of these other transactions, even though it happened much earlier than these other transactions in real time. The system decided that the serial order that it was guaranteeing equivalence to has the “Danny” transaction after all of the other name-change transactions --- it has full power to decide this without violating its serializability guarantee. [Side note: when the “Danny” transaction and/or the other name-change transactions also perform a read to the database as part of the same transaction as the write to the name, the ability to time-travel without violating serializability becomes much more difficult. But for “blind write” transactions such as these examples, time-travel is easy to accomplish.]

In multi-master asynchronously replicated database systems, where writes are allowed to occur at either replica, it is possible for conflicting writes to occur across the replicas. In such a scenario, it is tempting to leverage time travel to create an immortal blind write, which enables straightforward conflict resolution without violating the serializability guarantee. 

The stale read


The most common type of anomaly that appears in replicated systems but not in serializable single-server systems is the “stale read anomaly”. For example, Charlie has a bank account with $50 left in the account. He goes to an ATM and withdraws $50. Afterwards, he asks for a receipt with his current bank balance. The receipt (incorrectly) states that he has $50 left in his account (when, in actuality, he now has no money left). As a result, Charlie is left with an incorrect impression of how much money he has, and may make real life behavioral mistakes (for example, splurging on a new gadget) that he wouldn’t have done if he had the correct impression of the balance of his account. This anomaly happened as a result of a stale read: his account certainly used to have $50 in it. But when the ATM did a read request on the bank database to get his current balance, this read request did not reflect the write to his balance that happened a few seconds earlier when he withdrew money from his account. 

The stale read anomaly is extremely common in asynchronously replicated database systems (such as read replicas in MySQL or Amazon Aurora). The write (the update to Charlie’s balance) gets directed to one copy, which is not immediately replicated to the other copy. If the read gets directed to the other copy before the new write has been replicated to it, it will see a stale value.




Stale reads do not violate serializability. The system is simply time travelling the read transaction to a point in time in the equivalent serial order of transactions before the new writes to this data item occur. Therefore, asynchronously replicated database systems can allow stale reads without giving up its serializability (or even one-copy serializability) guarantee. 

In a single-server system, there’s little motivation to read anything aside from the most recent value of a data item. In contrast, in a replicated system, network delays from synchronous replication are time-consuming and expensive. It is therefore tempting to do replication asynchronously, since reads can occur from asynchronous read-only replicas without violating serializability (as long as the replicated data becomes visible in the same order as the original). 

The causal reverse


In contrast to the stale read anomaly, the causal reverse anomaly can happen in any distributed database system and is independent of how replication is performed (synchronous or asynchronous). In the causal reverse anomaly, a later write which was caused by an earlier write, time-travels to a point in the serial order prior to the earlier write. In general, these two writes can be to totally different data items. Reads that occur in the serial order between these two writes may observe the “effect” without the “cause”, which can lead to application bugs.

For example, most banks do not exchange money between accounts in a single database transaction. Instead, money is removed from one account into bank-owned account in one transaction. A second transaction moves the money from the bank-owned account to the account intended as the destination for this transfer. The second transaction is caused by the first. If the first transaction didn’t succeed for any reason, the second transaction would never be issued.

Let’s say that $1,000,000 is being transferred from account A (which currently has $1,000,000 and will have $0 left after this transfer) to account B (which currently has $0, and will have $1,000,000 after the transfer). Let’s say that account A and account B are owned by the same entity, and this entity wishes to get a sizable loan that requires $2,000,000 as a down payment. In order to see if this customer is eligible for the loan, the lender issues a read transaction that reads the values of accounts A and B and takes the sum of the balance of those two accounts. If this read transaction occurs in the serial order before the transfer of $1,000,000 from A to B, a total of $1,000,000 will be observed across accounts. If this read transaction occurs after the transfer of $1,000,000 from A to B, a total of $1,000,000 will still be observed across accounts. If this read transaction occurs between the two transactions involved in transfer of $1,000,000 from A to B that we described above, a total of $0 will be observed across accounts. In all three possible cases, the entity will be (correctly) denied the loan due to lack of funds necessary for the down payment. 
However, if a second transaction involved in the transfer (the one that adds $1,000,000 to account B) time-travels before the transaction that caused it to exist in the first place (the one that subtracts $1,000,000 from account A), it is possible for a read transaction that occurs between these two writes to (incorrectly) observe a balance across accounts of $2,000,000 and thereby allow the entity to secure the loan. Since the transfer was performed in two separate transactions, this example does not violate serializability. The equivalent serial order is: (1) the transaction that does the second part of the transfer (2) the read transaction and (3) the transaction that does the first part of the transfer. However, this example shows the potential for devastating bugs in application code code if causative transactions are allowed to time-travel to a point in time before their cause. 

One example of a distributed database system that allows the causal reverse is CockroachDB (aka CRDB). CockroachDB partitions a database such that each partition commits writes and synchronously replicates data separately from other partitions. Each write receives a timestamp based on the local clock on one of the servers within that partition. In general, it is impossible to perfectly synchronize clocks across a large number of machines, so CockroachDB allows a maximum clock skew for which clocks across a deployment can differ. However, (unlike Google Spanner) CockroachDB does not wait for the maximum clock skew bound to pass before committing a transaction. Therefore, it is possible in CockroachDB for a transaction to commit, and a later transaction to come along (that writes data to a different partition), that was caused by the earlier one (that started after the earlier one finished), and still receive an earlier timestamp than the earlier transaction. This enables a read (in CockroachDB’s case, this read has to be sent to the system before the two write transactions) to potentially see the write of the later transaction, but not the earlier one. 

In other words, if the bank example we’ve been discussing was implemented over CockroachDB, the entity wishing to secure the loan could simply repeatedly make the loan request and then transfer money between accounts A and B until the causal reverse anomaly shows up, and the loan is approved. Obviously, a well-written application should be able to detect the repeated loan requests and prevent this hack from occurring. But in general, it is hard to predict all possible hacks and write defensive application code to prevent them. Furthermore, banks are not usually able to recruit elite application programmers, which leads to some mind-boggling vulnerabilities in real-world applications. 

Avoiding time travel anomalies


All the anomalies that we’ve discussed so far --- the immortal write, the stale read, and the causal reverse --- all exploit the permissibility of time travel in the serializability guarantee, and thereby introduce bugs in application code. To avoid these bugs, the system needs to guarantee that transactions are not allowed to travel in time, in addition to guaranteeing serializability. As we mentioned above, single-server systems generally make this time-travel guarantee without advertising it, since the implementation of this guarantee is trivial on a single-server. In distributed and replicated database systems, this additional guarantee of “no time travel” on top of the other serializability guarantees is non-trivial, but has nonetheless been accomplished by several systems such as FaunaDB/Calvin, FoundationDB, and Spanner. This high level of correctness is called strict serializability

Strict serializability makes all of the guarantees of one-copy serializability that we discussed above. In addition, it guarantees that if a transaction X completed before transaction Y started (in real time) then X will be placed before Y in the serial order that the system guarantees equivalence to. 

Classification of serializable systems


Systems that guarantee strict serializability eliminate all types of time travel anomalies. At the other end of the spectrum, systems that guarantee “only” one-copy serializability are vulnerable to all of the anomalies that we’ve discussed in this post (even though they are immune to the isolation anomalies we discussed in a previous post). There also exist systems that guarantee a version of serializability between these two extremes. One example are “strong session serializable” systems that guarantee strict serializability of transactions within the same session, but otherwise only one-copy serializability. Another large well-known class of such systems are read-only replica systems where all update transactions go to the master replica which processes them with strict serializability. These updates are asynchronously replicated to read-only replicas in the order they were processed at the master. Reads from the replicas may be stale, but they are still serializable. We call this “asynchronous serializability”. Another class of systems are partitioned systems in which writes within a partition are synchronously replicated, but no coordination is performed across partitions for disjoint writes (we explained above that CockroachDB is member of this class). We call this “partitioned serializability”. Now that we have given names to these different levels of serializability, we can summarize the anomalies to which they are vulnerable with a simple chart:


System Guarantee
Immortal write
Stale read
Causal reverse
ONE COPY SERIALIZABLE
Possible
Possible
Possible
STRONG SESSION SERIALIZABLE
Possible (but not within same session)
Possible (but not within same session)
Possible (but not within same session)
ASYNCHRONOUS SERIALIZABLE
Not Possible
Possible
Not Possible
PARTITIONED SERIALIZABLE
Not Possible
Not Possible
Possible
STRICT SERIALIZABLE
Not Possible
Not Possible
Not Possible


For readers who read my previous post on isolation levels, we can combine the isolation anomalies from that post with the time travel anomalies from this post to get a single table with all the anomalies we’ve discussed across the two posts:



System Guarantee
Dirty read
Non-repeatable read
Phantom Read
Write Skew
Immortal write
Stale read
Causal reverse
READ UNCOMMITTED
Possible
Possible
Possible
Possible
Possible
Possible
Possible
READ COMMITTED
Not Possible
Possible
Possible
Possible
Possible
Possible
Possible
REPEATABLE READ
Not Possible
Not Possible
Possible
Possible
Possible
Possible
Possible
SNAPSHOT ISOLATION
Not Possible
Not Possible
Not Possible
Possible
Possible
Possible
Possible
SERIALIZABLE / ONE COPY SERIALIZABLE / STRONG SESSION SERIALIZABLE
Not Possible
Not Possible
Not Possible
Not Possible
Possible
Possible
Possible
ASYNCHRONOUS SERIALIZABLE
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible
Possible
Not Possible
PARTITIONED SERIALIZABLE
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible
Possible
STRICT SERIALIZABLE
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible
Not Possible


Friday, May 3, 2019

Introduction to Transaction Isolation Levels

For decades, database systems have given their users multiple isolation levels to choose from, ranging from some flavor of “serializability” at the high end down to “read committed” or “read uncommitted” at the low end. These different isolation levels expose an application to markedly different types of concurrency bugs. Nonetheless, many database users stick with the default isolation level of whatever database system that they are using, and do not bother to consider which isolation level is optimal for their application. This practice is dangerous—the vast majority of widely-used database systems—including Oracle, IBM DB2, Microsoft SQL Server, SAP HANA, MySQL, and PostgreSQL—do not guarantee any flavor of serializability by default. As we will detail below, isolation levels weaker than serializability can lead to concurrency bugs in an application and negative user experiences. It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.
In this post we give a tutorial on database isolation levels, the advantages that come with lower isolation levels, and the types of concurrency bugs that these lower levels may allow. Our main focus in this post is the difference between serializable isolation levels and lower levels that expose an application to specific types of concurrency anomalies. There are also important differences within the category of serializable isolation (e.g., “strict serializability” makes a different set of guarantees than “one-copy serializability”). However, in keeping with the title of this post as an “introduction” to database isolation levels, we will ignore the subtle differences within the class of serializable isolation, and focus on the commonality of all elements within this class—referring to this commonality as “serializable”. In a future, less introductory, post, we will investigate the category of serializable isolation in more detail.

What is an “Isolation Level”?

Database isolation refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.
There is such a thing as perfect isolation (we will define this below). Unfortunately, perfection usually comes at a performance cost—in terms of transaction latency (how long before a transaction completes) or throughput (how many transactions per second can the system complete). Depending on how a particular system is architected, perfect isolation becomes easier or harder to achieve. In poorly designed systems, achieving perfection comes with a prohibitive performance cost, and users of such systems will be pushed to accept guarantees significantly short of perfection. However, even in well designed systems, there is often a non-trivial performance benefit achieved by accepting guarantees short of perfection. Therefore, isolation levels came into existence: they provide the user of a system the ability to trade off isolation guarantees for improved performance.
As we will see as we proceed in this discussion, there are many subtleties that lead to confusion when discussing isolation levels. This confusion is exacerbated by the existence of a SQL standard that fails to accurately define database isolation levels and database vendors that attach liberal and non-standard semantics to particular named isolation levels. Nonetheless, a database user is obligated to do due diligence on the different options provided by a particular system, and choose the right level for their application.

Perfect Isolation

Let’s begin our discussion of database isolation levels by providing a notion of what “perfect” isolation is. We defined isolation above as the ability of a database system to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality that can be a large number of concurrently running transactions). What does it mean to be perfect in this regard? At first blush, it may appear that perfection is impossible. If two transactions both read and write the same data item, it is critical that they impact each other. If they ignore each other, then whichever transaction completes the write last could clobber the first transaction, resulting in the same final state as if it never ran.
The database system was one of the first scalable concurrent systems, and has served as an archetype for many other concurrent systems developed subsequently. The database system community—many decades ago—developed an incredibly powerful (and yet perhaps underappreciated) mechanism for dealing with the complexity of implementing concurrent programs.
The idea is as follows: human beings are fundamentally bad at reasoning about concurrency. It’s hard enough to write a bug-free non-concurrent program. But once you add concurrency, there are a near-infinite array of race conditions that could occur—if one thread reaches line 17 of a program before another thread reaches line 5, but after it reaches line 3, a problem could occur that would not exist under any other concurrent execution of the program. It is nearly impossible to consider all the different ways that program execution in different threads could overlap with each other, and how different types of overlap can lead to different final states.
Instead, database systems provide a beautiful abstraction to the application developer—a “transaction”. A transaction may contain arbitrary code, but it is fundamentally single-threaded.
An application developer only needs to focus on the code within a transaction—to ensure it is correct when there are no other concurrent processes running in the system. Given any starting state of the database, the code must not violate the semantics of the application. Ensuring correctness of code is non-trivial, but it is much easier to ensure the correctness of code when it is running by itself, than ensuring the correctness of code when it is running alongside other code that may attempt to read or write shared data. 
If the application developer is able to ensure the correctness of their code when no other concurrent processes are running, a system that guarantees perfect isolation will ensure that the code remains correct even when there is other code running concurrently in the system that may read or write the same data. In other words, a database system enables a user to write code without concern for the complexity of potential concurrency, and yet still process that code concurrently without introducing new bugs or violations of application semantics.
Implementing this level of perfection sounds difficult, but it’s actually fairly straightforward to achieve. We have already assumed that the code is correct when run without concurrency over any starting state. Therefore, if transaction code is run serially — one after the other — then the final state will also be correct. Thus, in order to achieve perfect isolation, all the system has to do is to ensure that when transactions are running concurrently, the final state is equivalent to a state of the system that would exist if they were run serially. There are several ways to achieve this—such as via locking, validation, or multi-versioning—that are out of scope for this article. The key point for our purposes is that we are defining “perfect isolation” as the ability of a system to run transactions in parallel, but in a way that is equivalent to as if they were running one after the other. In the SQL standard, this perfect isolation level is called serializability.
Isolation levels in distributed systems get more complicated. Many distributed systems implement variations of the serializable isolation level, such as one copy-serializability (1SR), strict serializability (strict 1SR) or update serializability (US). However, as we mentioned above, in order to focus the discussion in this post on core concepts behind database isolation, we will defer discussion of these more advanced concepts to a future post.

Anomalies in Concurrent Systems

The SQL standard defines several isolation levels below serializability. Furthermore, there are other isolation levels commonly found in commercial database systems—most notably snapshot isolation—which are not included in the SQL standard. Before we discuss these different levels of isolation, let’s discuss some well-known application bugs/anomalies that can occur at isolation levels below serializability. We will describe these bugs using a retail example.
Let’s assume that whenever a customer purchases a widget, the following “Purchase” transaction is run:
  1. Read old inventory
  2. Write new inventory which is one less than what was read in step (1)
  3. Insert new order corresponding to this purchase into the orders table
If such Purchase transactions run serially, then all initial inventory will always be accounted for. If we started with 42 widgets, then at all times, the sum of all inventory remaining plus orders in the order table will be 42.
But what if such transactions run concurrently at isolation levels below serializability?
For example, let’s say that two transactions running concurrently read the same initial inventory (42), and then both attempt to write out the new inventory of one less than the value that they read (41) in addition to the new order. In such a case, the final state is an inventory of 41, yet there are two new orders in the orders table (for a total of 43 widgets accounted for). We created a widget out of nothing! Clearly, this is a bug. It is known as the lost-update anomaly
As another example, let’s say these same two transactions are running concurrently but this time the second transaction starts in between steps (2) and (3) of the first one. In this case, the second transaction reads the value of inventory after it has been decremented - i.e. it reads the value of 41 and decrements it to 40, and writes out the order. In the meantime, the first transaction aborted when writing out the order (e.g. because of a credit card decline). In such a case, during the abort process, the first transaction reverts back to the state of the database before it began (when inventory was 42). Therefore the final state is an inventory of 42, and one order written out (from the second transaction). Again, we created a widget out of nothing! This is known as the dirty-write anomaly (because the second transaction overwrote the value of the first transaction’s write before it decided whether it would commit or abort).
As a third example, let’s say a separate transaction performs a read of both the inventory and the orders table, in order to make an accounting of all widgets that ever existed. If it runs between steps (2) and (3) of a Purchase transaction, it will see a temporary state of the database in which the widget has disappeared from the inventory, but has not yet appeared as an order. It will appear that a widget has been lost—another bug in our application. This is known as the dirty-read anomaly, since the accounting transaction was allowed to read the temporary (incomplete) state of the purchase transaction.
As a fourth example, let’s say that a separate transaction checks the inventory and acquires some more widgets if there are fewer than 10 widgets left:
  1. IF (READ(Inventory) = (10 OR 11 OR 12))
  2.    Ship some new widgets to restock inventory via standard shipping
  3. IF (READ(Inventory) < 10)
  4.    Ship some new widgets to restock inventory via express shipping
Note that this transaction reads the inventory twice. If the Purchase transaction runs in between step (1) and (3) of this transaction, then a different value of inventory will be read each time. If the initial inventory before the Purchase transaction ran was 10, this would lead to the same restock request to be made twice—once with standard shipping and once with express shipping. This is called the non-repeatable read anomaly.
As a fifth example, imagine a transaction that scans the orders table in order to calculate the maximum price of an order and then scans it again to find the average order price. In between these two scans, an extremely expensive order gets inserted that skews the average so much that it becomes higher than the maximum price found in the previous scan. This transaction returns an average price that is larger than the maximum price—a clear impossibility and a bug that would never happen in a serializable system. This bug is slightly different than the non-repeatable read anomaly since every value that the transaction read stayed the same between the two scans—the source of the bug is that additional records were inserted in between these two scans. This is called the phantom read anomaly.
As a final example, assume that the application allows the price of the widget to change based on inventory. For example, many airlines increase ticket price as the inventory for a flight decreases. Assume that the application uses a formula to place a constraint on how these two variables interrelate—e.g. 10I + P >= $500 (where I is inventory and P is price). Before allowing a purchase to succeed, the purchase transaction has to check both the inventory and price to make sure the above constraint is not violated. If the constraint will not be violated, the update of inventory by that Purchase transaction may proceed. Similarly, a separate transaction that implements special promotional discounts may check both the inventory and price to make sure that the constraint is not violated when updating the price as part of a promotion. If it will not be violated, the price can be updated. Now: imagine these two transactions are running at the same time—they both read the old value of I and P and independently decide that their updates of inventory and price respectively will not violate the constraint. They therefore proceed with their updates. Unfortunately, this may result in a new value of I and P that violates the constraint! If one had run before the other, the first one would have succeeded and the other would would have read the value of I and P after the first one finished and detected that their update would violate the constraint and therefore not proceed. But since they were running concurrently, they both see the old value and incorrectly decide that they can proceed with the update. This bug is called the write skew anomaly because it happens when two transactions read the same data but update disjoint subsets of the data that was read.

Definitions in The ISO SQL Standard

The SQL standard defines reduced isolation levels in terms of which of these anomalies are possible. In particular, it contains the following table:
Isolation LevelDirty ReadNon-repeatable readPhantom Read
READ UNCOMMITTEDPossiblePossiblePossible
READ COMMITTEDNot PossiblePossiblePossible
REPEATABLE READNot PossibleNot PossiblePossible
SERIALIZABLENot PossibleNot PossibleNot Possible

There are many, many problems which how the SQL standard defines these isolation levels. Most of these problems were already pointed out in 1995, but inexplicably, revision after revision of the SQL standard have been released since that point without fixing these problems.
The first problem is that the standard only uses three types of anomalies to define each isolation level—dirty read, non-repeatable read, and phantom read. However, there are many types of concurrency bugs that can appear in practice—many more than just these three anomalies. In this post alone we have already described six unique types of anomalies. The SQL standard makes no mention about whether the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels are susceptible to the lost update anomaly, the dirty-write anomaly, or the write-skew anomaly. As a result, each commercial system is free to decide which of these other anomalies these reduced isolation levels are susceptible to—and in many cases these vulnerabilities are poorly documented, leading to confusion and unpredictable bugs for the application developer.
A second (related) problem is that using anomalies to define isolation levels only gives the end user a guarantee of what specific types of concurrency bugs are impossible. It does not give a precise definition of the potential database states that are viewable by any particular transaction. There are several improved and more precise definitions of isolation levels given in the academic literature. Atul Adya’s PhD thesis gives a precise definition of the SQL standard isolation levels based on how reads and writes from different transactions may be interleaved. However these definitions are given from the point of view of the system. The recent work by Natacha Crooks et. al gives elegant and precise definitions from the point of view of the user.
A third problem is that the standard does not define, nor provide correctness constraints on one of the most popular reduced isolation levels used in practice: snapshot isolation (nor any of its many variants—PSI, NMSI, Read Atomic, etc). By failing to provide a definition of snapshot isolation, differences in concurrency vulnerabilities allowed by snapshot isolation have emerged across systems. In general, snapshot isolation performs all reads of data as of a particular snapshot of the database state which contains only committed data. This snapshot remains constant throughout the lifetime of the transaction, so all reads are guaranteed to be repeatable (in addition to being only of committed data). Furthermore, concurrent transactions that write the same data detect conflicts with each other and typically resolve this conflict via aborting one of the conflicting transactions. This prevents the lost-update anomaly. However, conflicts are only detected if the conflicting transactions write an overlapping set of data. If the write sets are disjoint, these conflicts will not be detected. Therefore snapshot isolation is vulnerable to the write skew anomaly. Some implementations are also vulnerable to the phantom read anomaly.
A fourth problem is that the SQL standard seemingly gives two different definitions of the SERIALIZABLE isolation level. First, it defines SERIALIZABLE correctly: that the final result must be equivalent to a result that could have occured if there were no concurrency. But then, it presents the above table, which seems to imply that as long as an isolation level does not allow dirty reads, non-repeatable reads, or phantom reads, it may be called SERIALIZABLE. Oracle, has historically leveraged this ambiguity to justify calling its implementation of snapshot isolation “SERIALIZABLE”. To be honest, I think most people who read the ISO SQL standard would come away believing the more precise definition of SERIALIZABLE given earlier in the document (which is the correct one) is the intention of the authors of the document. Nonetheless, I guess Oracle’s lawyers have looked at it and determined that there is enough ambiguity in the document to legally justify their reliance on the other definition. If any of my readers are aware of any real lawsuits that came from application developers who believed they were getting a SERIALIZABLE isolation level, but experienced write skew anomalies in practice, I would be curious to hear about them. Or if you are an application developer and this happened to you, I would also be curious to hear about it.
The bottom line is this: it is nearly impossible to give clear definitions of the different isolation levels available to application developers, because vagueness and ambiguities in the SQL standard has led to semantic differences across implementations/systems. 

What Isolation Level Should You Choose?

My advice to application programmers is the following: reduced isolation levels are dangerous. It is very hard to figure out which concurrency bugs may present themselves. If every system defined their isolation levels using the methodology of Crooks et. al. that I cited above, at least you would have a precise and formal definition of their associated guarantees. Unfortunately, the formalism of the Crooks paper is too advanced for most database users, so it is unlikely that database vendors will adopt these formalisms in their documentation any time soon. In the meantime, the definition of reduced isolation levels remain vague in practice and risky to use.
Furthermore, even if you could know exactly which concurrency bugs are possible for a particular isolation level, writing an application in a way that these bugs will not happen in practice (or if they do, that they will not cause negative experiences for the users of the application) is also very challenging. If your database system gives you a choice, the right choice is usually to avoid lower isolation levels than serializable isolation (for the vast majority of database systems, you actually have to go and change the defaults to accomplish this!).
However, there are three caveats:

  1. As I mentioned above, some systems use the word “SERIALIZABLE” isolation to mean something weaker than true serializable isolation.  Unfortunately, this means that simply choosing the “SERIALIZABLE” isolation level in your database system may not be sufficient to actually ensure serializability. You need to check the documentation to ensure that it defines “SERIALIZABLE” in the following way: that the visible state of the database is always equivalent to a state that could have occurred if there was no concurrency. Otherwise, your application will likely be vulnerable to the write-skew anomaly.
  2. As mentioned above, serializable isolation level comes with a performance cost. Depending on the quality of the system architecture, the performance cost of serializability may be large or small. In a recent research paper that I wrote with Jose Faleiro and Joe Hellerstein, we showed that in a well-designed system, the performance difference between SERIALIZABLE  and READ COMMITTED can be negligible … and in some cases it is possible for the SERIALIZABLE isolation level to (surprisingly) outperform the READ COMMITTED isolation level. If you find that the cost of serializable isolation in your system is prohibitive, you should probably consider using a different database system earlier than you consider settling for a reduced isolation level.
  3. In distributed systems, there are important anomalies that can (and do) emerge even within the class of serializable isolation levels. For such systems, it is important to understand the subtle differences between the elements of the class of serializable isolation (strict serializability is known to be the most safe). We will shed more light into this matter in a future post.