DBMS Musings: July 2019

Thursday, July 25, 2019

Overview of Consistency Levels in Database Systems

Database systems typically give users the ability to trade off correctness for performance. We have spent the previous two posts in this series discussing one particular way to trade off correctness for performance: database isolation levels. In distributed systems, there is a whole other category for trading off correctness for performance: consistency levels. There are an increasing number of distributed database systems that are giving their users multiple different consistency levels to choose from, thereby allowing the user to specify which consistency guarantees are needed from the system for a particular application. Similar to isolation levels --- weaker consistency levels typically come with better performance, and thus come with the same types of temptations as reduced isolation levels.

In this post, we give a short tutorial on consistency levels --- explaining what they do, and how they work. Much of the existing literature and online discussion of consistency levels are done in the context of multi-processor or distributed systems that operate on a single data item at a time, without the concept of a “transaction”. In this post we give some direction for how to think about consistency levels in the context of ACID-compliant database systems.

What is a consistency level?

The definition of consistency is fundamentally dependent on context. In general, consistency refers to the ability of a system to ensure that it complies (without fail) to a predefined set of rules. However, this set of rules changes based on context. For example, the C of ACID and the C of CAP both refer to consistency. However, the set of rules implied by these two contexts are totally different. In ACID, the rules refer to application-defined semantics. A system that guarantees the C of ACID ensures that processing a transaction does not violate referential integrity constraints, foreign-key constraints, and any other application-specific constraints (such as: “every user must have a name”). In contrast, the consistency C of CAP refers to the rules related to making a concurrent, distributed system appear like a single-threaded, centralized system. Reads at a particular point in time have only one possible result: they must reflect the most recent completed write (in real time) of that data item, no matter which server processed that write.

One point of confusion that we can eliminate at this point is that the phrase “consistency level” is not typically used in the context of ACID consistency. This is because the C of ACID is almost entirely the responsibility of the application developer --- only the developer can ensure that the code they place inside a transaction does not violate application semantics when it is run in isolation. ACID is really a misnomer --- really it should be AID, since only those three (atomicity, isolation, and durability) are in the realm of system guarantees. [Joe Hellerstein claims that he was taught that the C in ACID was only included to make an LSD pun, but acknowledges that this may be apocryphal.]

When we talk about consistency levels, we’re really referring the the C of CAP. In this context, perfect consistency --- usually referred to as “strict consistency” --- would imply that the system ensures that all reads reflect all previous writes --- no matter where those writes were performed. Any consistency level below “perfect” consistency enables situations to occur where a read does not return the most recent write of a data item. [Side note: the C of CAP in the original CAP paper refers to something called “atomic consistency” which is slightly weaker than strict consistency but still considered “perfect” for practical purposes. We’ll discuss the difference later in this post.]

Depending on how a particular system is architected, perfect consistency becomes easier or harder to achieve. In poorly designed systems, achieving perfection comes with a prohibitive performance and availability cost, and users of such systems are pushed to accept guarantees significantly short of perfection. However, even in well designed systems, there is often a non-trivial performance benefit achieved by accepting guarantees short of perfection.

An overview of well-known consistency levels

The notion of a consistency level originated by researchers of shared-memory multi-processor systems. The goal of the early work was to reason about how and when different threads of execution, which may be running concurrently and accessing overlapping sets of data in shared memory, should see the writes performed by each other. As such, most of the initial work focused on reads and writes of individual data items, rather than at the level of groups of reads and writes within a transaction. We will begin our discussion using the same terminology as the original research and models, and then proceed to discuss how to apply these ideas to transactions.

In sequential consistency, all writes --- no matter which thread made the write, and no matter which data item was written to --- are globally ordered. Every single thread of execution must see the writes occurring in this order. For example, if one thread saw data X being updated to 5, and then later saw Y being updated to 10, every single thread must see the update of X happening before the update of Y. If any thread sees the new value of Y but the old value of X, sequential consistency would be violated. This example is shown in Figure 1. In this figure, time gets later as you move to the right in the figure, and there are 4 threads of execution: P1, P2, P3, and P4. Every thread (that reads X and Y) sees the update of X from 0 to 5 happening before the update of Y from 0 to 10. Threads: P1 and P2 write X and Y respectively, but do not read either one. Thread P3 sees the new value of X and subsequently sees the old value of Y. This is only possible if the update to X happened before the update to Y. Thread P4 only sees the new values of X and Y, so it does not see which one happened first. Thus, all threads agree that it is possible that the update of X happened before the update to Y. Contrast this to Figure 2, below, in which P3 and P4 see clearly different orders of the updates to X and Y --- P3 sees the new value of X (5) and subsequently the old value of Y (0), while P4 sees the new value of Y (10) and subsequently the old value of X (0). Figure 2 is thus not a sequentially consistent schedule.

In general, sequential consistency does not place any requirements on how to order the writes. In our example, the write to X happened in real time before the write to Y. Nonetheless, as long as every thread agrees to see the write to Y as happening before the write to X, sequential consistency allows the official history to be different that what occurred according to real time (the only restriction is that writes and reads originating from the same thread of execution cannot be reordered). See Figure 3 for an example of this.

In contrast to sequential consistency, strict consistency does place real time requirements on how to order the writes. It assumes that it is always possible to know what time it currently is with zero error --- i.e. that every thread of execution agree on the precise current time. The order of the writes in the sequential order must be equal to the real time that these writes were issued. Furthermore, every read operation must read the value of the most recent write in real time --- no matter which thread of execution initiated that write. In a distributed system (and even multi-processor single-server systems), it is impossible in practice to have global agreement on precise current time, which renders strict consistency to be mostly of theoretical interest.

None of Figures 1, 2, or 3 above satisfy strict consistency because they all contain either a read of x=0 or a read of y=0 after the value of x or y has been written to a new value. However, Figure 4 below satisfies strict consistency since all reads reflect the most recent write in real time:

In a distributed/replicated system, where writes and reads can originate anywhere, the highest level of consistency obtained in practice is linearizability (also known as “atomic consistency” which is what it is called in the CAP theorem). Linearizability is very similar to strict consistency: both are extensions of sequential consistency that impose real time constraints on writes. The difference is that the linearizability model acknowledges that there is a period of time that occurs between when an operation is submitted to the system, and when the system responds with an acknowledgement that it was completed. In a distributed system, the sending of the write request to the correct location(s) --- which may include replication --- can occur during this time period. A linearizability guarantee does not place any ordering constraints on operations that occur with overlapping start and end times. The only ordering constraint is for operations that do not overlap in time --- only in those cases does the earlier write have to be seen before the later write.

Figure 5 above shows an example of a schedule that is linearizable, but not strictly consistent. It is not strictly consistency since the read of X by P3 is initiated (and returns) slightly after the write of X by P1, but still sees the old value. Nonetheless, it is linearizable because this read of X by P3 and write of X by P1 overlap in time, and therefore linearizability does not require the read of X by P3 to see the result of the write of X by P1.

While linearizability and strict consistency are stronger than sequential consistency, sequential consistency is by itself a very high level of consistency, and there exist many consistency levels below it.

Causal consistency is a popular and useful consistency level that is slightly below sequential consistency. In sequential consistency, all writes must be globally ordered --- even if they are totally unrelated to each other. Causal consistency does not enforce orderings of unrelated writes. However, if a thread of execution performs a read of some data item (call it X) and then writes that data item or a different one (call it Y), it assumes that the subsequent write may have been caused by the read. Therefore, it enforces the order of X and Y --- specifically all threads of execution must observe the write of Y after the write of X.

As an example, compare Figure 6 (above) with Figure 2. In Figure 2, P3 saw the write to X happening before the write to Y, but P4 saw the write to Y happening before the write to X. This violates sequential consistency, but not causal consistency. However, in Figure 6, P2 read the write to X before performing the write to Y. That places a causal constraint between the write to X and Y --- Y must happen after X. Therefore, when P4 sees the write to Y without the write to X, causal consistency is violated.

Eventual consistency is even weaker --- even causally dependent writes may become visible out of order. For example, despite violating every other consistency guarantee that we have discussed so far, Figure 6 does not necessarily violate eventual consistency. The only guarantee in eventual consistency is that if there are no writes for a “long” period of time (where the definition of “long” is system dependent), every thread of execution will agree on the value of the last write. So as long as P4 eventually sees the new value of X (5) at some later point in time (not shown in Figure 6), then eventual consistency is maintained.

Strong consistency vs. Weak consistency

Strict consistency and linearizability/atomic consistency are typically thought of as “strong” consistency levels. In many cases, sequential consistency is also referred to as a strong consistency level. The key feature that each of these consistency levels share is that the state of the database goes through a universally agreed-upon sequence of state changes. This allows the realities of replication to be hidden to the end user. In essence, the view of the database to the user is that there is only one copy of the database that continuously makes state transitions in a forward direction. In contrast, weaker consistency levels (such as causal consistency and eventual consistency) allow different views of the database state to see different progression of steps in database state ---- a clear impossibility unless there is more than one copy of the database. Thus, for weaker levels of consistency, the application developer must be explicitly aware of the replicated nature of the data in the database, thereby increasing the complexity of development relative to strong consistency.

Transactions and consistency levels

As we discussed above, consistency levels have historically been defined in terms of individual reads or writes of a data item. This makes the discussion of consistency levels hard to apply to the context of database systems in which groups of read and write operations occur together in atomic transactions. Indeed, both the research literature and vendor documentation (for those vendors that offer multiple consistency levels) are extremely confusing and do not take a uniform approach when it comes to applying consistency levels in the context of database systems.

I think that the simplest way to reason about consistency levels in the presence of database transactions is to make only minor adjustments to the consistency models we discussed earlier. We still view consistency as threads of execution sending read and write requests to a shared data store. The only difference is that we annotate each read and write request with the transaction identifier of the transaction that initiated each request. If each thread of execution can only process one transaction at a time, and transactions can not be processed by more than one thread of execution, then the traditional timeline consistency diagrams need only be supplemented with rectangular boundaries indicating the start and end point of each transaction within a thread of execution, as shown in the figure below.

The presence of a transaction in a consistency diagram adds additional constraints corresponding to AID of ACID: all of the reads and writes within the transaction succeed or fail together (atomicity), they are isolated from other concurrently running transactions (the degree of isolation will depend on the isolation level), and writes of committed transactions will outlive all kinds of system failure (durability).

The atomicity and durability guarantees of transactions are pretty easy to cognitively separate from consistency guarantees, because they deal with fundamentally different concepts. The problem is isolation. Consistency guarantees specify how and when writes are visible to threads of execution that read database state. Isolation guarantees also place limitations on when writes become visible. This conceptual overlap of isolation and consistency guarantees is the source of much confusion. In the next post of this series I plan to give a tutorial for understanding the difference between isolation levels and consistency levels.

Monday, July 15, 2019

The dangers of conditional consistency guarantees

The ease of writing an application on top of infrastructure that guarantees consistency can not be overstated. It’s just so much easier to write an application when you can be sure that every read will return the most recent state of the data --- no matter where in the world writes to data state occur, and no matter what types of failure may occur.

Unfortunately, consistency in distributed systems comes at a cost. Other desirable system properties may be traded off in order to uphold a consistency guarantee. For example, it is well known (see the CAP theorem) that in the event of a network partition, it is impossible to guarantee both consistency and availability. One must be traded off for the other. Furthermore, in a system deployed across a wide geography, the laws of physics prevent the communication required to maintain consistency from happening instantaneously. This leads to the PACELC tradeoff: when there is a network partition (the P in PACELC), the system must decide how it will tradeoff availability (A) for consistency (C). Else (E) --- during normal operation of the system --- the system must decide how it will tradeoff latency (L) for consistency (C).

There are four points in the PACELC tradeoff space: (1) PA/EC (when there is a partition, choose availability; else, choose consistency), (2) PC/EC (choose consistency at all times), (3) PA/EL (prioritize availability and latency over consistency), and (4) PC/EL (when there is a partition, choose consistency; else, choose latency).

If you were to ask a student who just learned about the CAP theorem in a distributed systems class what guarantees they would want from a distributed system deployed for an application that needs 99.999% availability, the student would likely answer: “That’s easy! I would choose a system that prioritizes availability over consistency in the event of a network partition. But in the absence of a network partition, it is possible for a system to guarantee both availability and consistency, so I would want both those guarantees in the system I would deploy.”

This answer is natural and obvious. But at the same time, it is naive and incorrect.

The system that the student described would be a PA/EC system from PACELC --- a system that does not guarantee consistency during a network partition, but otherwise guarantees consistency. There are many applications that need several “nines” of availability. If the right PACELC system for those applications was PA/EC, then there would be many vendors competing with each other, attempting to sell the best PA/EC system. Yet in practice, the opposite is true --- it is extremely rare to find PA/EC systems in the wild. In fact, it took me several years until I came across the IMDG (in-memory data grid) market before I could find examples of real PA/EC systems.

The reason for this is that PA/EC systems are much less useful than one might initially expect. To understand why this is the case, one must understand the fundamental difference between a real guarantee and a conditional guarantee.

Real guarantees vs. conditional guarantees

Of the three desirable properties that we’ve mentioned thus far as being desirable in a distributed system --- availability, consistency, and latency --- only one of them can ever be a real guarantee: consistency. No system will ever guarantee 100% availability, nor will any system guarantee that 100% of requests will complete within a particular latency bound. Such guarantees would be impossible to uphold. A “PA” system in PACELC prioritizes availability, but cannot guarantee 100% availability. Similarly, an “EL” system in PACELC prioritizes latency, but does not make any particular 100% guarantees.

In contrast, consistency is a real guarantee. Systems that guarantee consistency are actually capable of upholding the promise of 100% consistency, and many systems do in fact provide it.

Real guarantees form the foundation of the success of modern computing. Computers are extremely complex systems. Even a simple program must involve a large number of complex components --- from registers and logic gates to cache and memory to operating systems and security all the way up to the run time environment of the program. A simple “hello world” program ought to take weeks to develop. The reason why it doesn’t is because of the power of “abstraction” in computer science. Each level of a computer system builds on top of an underlying level that abstracts away the complex details of its implementation. Instead, the underlying level presents an interface to the higher level, and inviolable guarantees about what is returned through this interface.

For example, when a program reads from a particular place in memory, it can assume that it will see the data that was last written there. In reality, the processor inside a computer may choose to execute instructions within a program out of order. Furthermore, electric or magnetic interference may cause bits to spontaneously flip in memory. In practice, complicated solutions are used to detect and correct such bit corruption, including using redundant memory bits, additional circuitry, and error correcting codes. All of this is hidden to the higher level program that wants to read this location in memory. The higher level program assumes that it will receive the data that was most recently written to that location 100% of the time. [Side note: In theory, all of the error correcting logic in storage systems still cannot yield an actual 100% guarantee. However, the chances of a corrupted data value being returned to the end user is so infinitesimally small that this guarantee is still treated like 100% for all practical purposes.]

Imagine the extra complexity that would be involved in writing a program if you could not assume that instructions from a program are processed in the specified order, or the data in memory has not been corrupted. Almost every line of code that reads data from a variable would have to be surrounded by ‘if-statements’ that check for what might be corruption in that particular context and that specifies what should be done if the read appears to not be correct. This extra code would likely increase the development cycle for a program by one to two orders of magnitude!

For a program to be “bug-free”, this guarantee must be 100%. If a read returns the correct result 99% of the time, or even 99.99% or 99.9999% of the time, the above described ‘if-statements’ and conditional logic will still be necessary. Only if the program can assume that the correct result will always be returned can the extra conditional logic be avoided. Thus, from the perspective of the developer, there is an enormous difference between a complete guarantee and an incomplete guarantee. Within the space of incomplete guarantees, it is certainly the case that 99.9999% accuracy is better than 99% accuracy. But the big jump in complexity savings for the developer is the jump between anything less than 100% and a 100% guarantee.

This is why a PA/EC system has limited utility to an application developer. A PA/EC system is a system that guarantees consistency in all cases …. except when there is a network partition. In other words, a PA/EC makes a conditional guarantee of consistency: it guarantees consistency in all cases as long as a particular condition is met: that there is no network partition.

However, we just said that the big jump in complexity savings for the developer is a jump between anything less than 100% and a 100% guarantee. Any guarantee that comes with conditions that are out of the control of the developer (such as ensuring that there will never be a network partition), is not that helpful. The developer still has to write code to handle the cases where the guarantee is not upheld.

I am not arguing that all systems must guarantee consistency. I am only pointing out that there is a big difference in developer complexity between developing on top of a system that guarantees consistency fully, and one that guarantees consistency only under certain conditions. Without a full guarantee of consistency, there is a significant amount of extra effort required to ensure the absence of corner-case bugs that may emerge days, weeks, or even years after the application is first deployed.

Conditional guarantees and PACELC

We just explained that a PA system in PACELC results in a large amount of extra effort for the application developer, in order to ensure that the application remains bug-free even when inconsistent reads occur during a network partition. If so, during normal operation, where there is a tradeoff between consistency and latency, why choose consistency? All of the effort in building an application that does not break in the face of inconsistent reads has already been paid. Why not benefit from all that effort during normal operation and get better latency as well?

All of this suggests that application developers should decide whether they want to develop on top of a system that guarantees consistency or not. If they want consistency, they should deploy on top of a PC/EC system in PACELC --- i.e. a system that guarantees consistency 100% of the time. If they are willing to go to all of the extra effort to build on top of a system that does not guarantee consistency, they should be looking to get as many benefits as possible from this effort, and should deploy on top of PA/EL system, and thereby get better availability and latency properties (even if those properties do not come with guarantees).

New consistency guarantees in Hazelcast

As I pointed out in a previous post, when deploying within a single geographic region, the latency vs. consistency tradeoff is not significant. In such environments, there is not much difference between PA/EC and PA/EL systems. Since in-memory data grids are typically deployed in a single region, PA/EC became the standard PACELC option for the IMDG market.

Nonetheless, it is interesting to note how Hazelcast --- one of the most widely used IMDG implementations --- is moving away from the PA/EC default configuration and towards the more natural PC/EC and PA/EL pairings.

In 2017, I wrote an in-depth analysis of Hazelcast’s consistency and availability guarantees. At around the same time, Kyle Kingsbury published a linearizability analysis of Hazelcast using his Jepsen testing library. These analyses led to what CEO Greg Luck tells me was 3 person-years of effort to release the first PC/EC configuration option in Hazelcast.

A quick reminder about what Hazelcast does: Many Java programs store program state inside Java collections and data structures such as Queue, Map, or Multimap. As a program scales --- both in terms of the number of clients accessing data and in terms of the amount of data stored, these data structures may get too large to fit in memory on a single server. Hazelcast provides a distributed implementation of these popular data structures, thereby enabling programs that leverage them to scale. Java programs interact with Hazelcast identically to how they interacted with the local data Java structures. However, under the covers, Hazelcast is distributing and replicating them across a cluster of machines.

Hazelcast also provides distributed implementations of Java concurrency structures such as AtomicLong, AtomicReference, Lock. Using such structures without a consistency guarantee is downright dangerous. For example, in situations for which consistency is not upheld, it is possible for multiple different processes to acquire the same lock (concurrently) or for non-atomic actions to occur on supposedly atomic data structures. In other words, lack of consistency violates the fundamental premise of these backbone Java concurrency structures.

With a PC/EC configuration option, you can now deploy Hazelcast’s distributed version of these concurrency tools --- IAtomicLong, IAtomicReference, and ILock --- and get identical correctness guarantees relative to what they provide when being used by a regular single-server Java program.

Hazelcast’s PC/EC configuration performs distributed agreement on data structure state via the Raft consensus protocol. Raft is known for achieving strong availability. Even when a network partition occurs, the majority of servers can remain available (the consistency guarantee only requires that the minority partition must lose availability).

Separately from the consistency issue, another well-known problem with distributed implementations of lock data structures is liveness. A server can acquire the lock and then fail for an indefinite period of time while holding the lock. Meanwhile the rest of the distributed system remains alive, but unable to acquire the lock and make progress. Therefore, in practice, distributed locks typically include timeouts, in order to prevent servers that fail while holding the lock from causing the rest of the system to stall indefinitely. Unfortunately, as soon as it is possible for a lock to timeout, it is possible for a process to get timed out while holding the lock, and yet be temporarily unaware that the lock timed out. In such a scenario, it is possible for two different processes to concurrently believe they have the lock, and perform concurrent side-effects that the lock was designed to prevent. Hazelcast’s distributed lock implementation includes a fencing mechanism which enables downstream code --- code that is designed to be run while locks are being held --- to participate in the locking protocol and be able to distinguish which process is correct when multiple processes concurrently believe that they hold the lock.

As far as the PA/EL option in Hazelcast, they have recently added PN Counters --- a CRDT counter implementation that enables some correctness guarantees despite the potential for inconsistent reads. In particular, when there is no consistency guarantee, it is possible for a server to read a stale value of the counter. For example, process A may update the value of a counter from 4 to 5. Afterwards, process B sees the stale value of the counter (4) and also adds one to it (to get 5). In versions of Hazelcast prior to 3.10, the final value of the counter --- even after whatever event caused the inconsistency is resolved --- would be 5. In other words, one of the two increments of the counter would be lost. With PN Counters (introduced in Hazelcast 3.10), the counter state will eventually converge to the correct value (in this case --- 6 --- that reflects the value after two increments to an initial value of 4). PN Counters also support decrements of the counter (in addition to increments), and guarantees (1) read-your-own-writes and (2) monotonic reads in addition to (3) what I described above --- the eventual convergence of the counter value to the correct value.

Furthermore, inconsistent reads in Hazelcast prior to version 3.10 --- under PA/EC or PA/EL configurations --- could potentially result in duplicate IDs being generated. Hazelcast 3.10 introduced Flake ID generators which guarantee global unique IDs even in the absence of a global consistency guarantee.

Conclusion

PA/EC systems sound good in theory, but are not particularly useful in practice. Our one example of a real PA/EC system --- Hazelcast --- has spent the past 1.5 years introducing features that are designed for alternative PACELC configurations --- specifically PC/EC and PA/EL configurations. PC/EC and PA/EL configurations are a more natural cognitive fit for an application developer. Either the developer can be certain that the underlying system guarantees consistency in all cases (the PC/EC configuration) in which case the application code can be significantly simplified, or the system makes no guarantees about consistency at all (the PA/EL configuration) but promises high availability and low latency. CRDTs and globally unique IDs can still provide limited correctness guarantees despite the lack of consistency guarantees in PA/EL configurations.

DBMS Musings