DBMS Musings: NewSQL database systems are failing to guarantee consistency, and I blame Spanner

Friday, September 21, 2018

NewSQL database systems are failing to guarantee consistency, and I blame Spanner

(Spanner vs. Calvin, Part 2)

[TL;DR I wrote a post in 2017 that discussed Spanner vs. Calvin that focused on performance differences. This post discusses another very important distinction between the two systems: the subtle differences in consistency guarantees between Spanner (and Spanner-derivative systems) vs. Calvin.]

The CAP theorem famously states that it is impossible to guarantee both consistency and availability in the event of a network partition. Since network partitions are always theoretically possible in a scalable, distributed system, the architects of modern scalable database systems fractured into two camps: those that prioritized availability (the NoSQL camp) and those that prioritized consistency (the NewSQL camp). For a while, the NoSQL camp was clearly the more dominant of the two --- in an “always-on” world, downtime is unacceptable, and developers were forced into handling the reduced consistency levels of scalable NoSQL systems. [Side note: NoSQL is a broad umbrella that contains many different systems with different features and innovations. When this post uses the term “NoSQL”, we are referring to the subset of the umbrella that is known for building scalable systems that prioritize availability over consistency, such as Cassandra, DynamoDB (default settings), Voldemort, CouchDB, Riak, and multi-region deployments of Azure CosmosDB.]

Over the past decade, application developers have discovered that it is extremely difficult to build bug-free applications over database systems that do not guarantee consistency. This has led to a surprising shift in momentum, with many of the more recently released systems claiming to guarantee consistency (and be CP from CAP). Included in this list of newer systems are: Spanner (and its Cloud Spanner counterpart), FaunaDB, CockroachDB, and YugaByte. In this post, we will look more deeply into the consistency claims of these four systems (along with similar systems) and note that while some do indeed guarantee consistency, way too many of them fail to completely guarantee consistency. We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.

What is consistency anyway?

Consistency, also known as “atomic consistency” or “linearizability”, guarantees that once a write completes, all future reads will reflect that value of the write. For example, let’s say that we have a variable called X, whose value is currently 4. If we run the following code:

X = 10;
Y = X + 8;

In a consistent system, there is only one possible value for Y after running this code (assuming the second statement is run after the first statement completes): 18. Everybody who has completed an “Introduction to Programming” course understands how this works, and relies on this guarantee when writing code.

In a system that does not guarantee consistency, the value of Y after running this code is also probably 18. But there’s a chance it might be 12 (since the original value of X was 4). Even if the system returns an explicit message: “I have completed the X = 10 statement”, it is nonetheless still a possibility that the subsequent read of X will reflect the old value (4) and Y will end up as 12. Consequently, the application developer has to be aware of the non-zero possibility that Y is not 18, and must deal with all possible values of Y in subsequent code. This is MUCH more complicated, and beyond the intellectual capabilities of a non-trivial subset of application developers.

[Side note: Another name for "consistency" is "strong consistency". This alternate name was coined in order to distinguish the full consistency guarantee from weaker consistency levels that also use the word "consistency" in their name (despite not providing the complete consistency guarantee). Indeed, some of these weaker consistency levels, such as "causal consistency", "session consistency", and "bounded staleness consistency" provide useful guarantees that somewhat reduce complexity for application developers. Nonetheless, the best way to avoid the existence of corner case bugs in an application is to build it on top of a system that guarantees complete, strong consistency.]

Why give up on consistency?

Consistency is a basic staple, a guarantee that is extremely hard to live without. So why do most NoSQL systems fail to guarantee consistency? They blame the CAP theorem. (For example, the Amazon Dynamo paper, which inspired many widely used NoSQL systems, such as Cassandra, DynamoDB, and Riak, mention the availability vs. consistency tradeoff in the first paragraph of the section that discussed their “Design Considerations”, which lead to their famous “eventually consistent” architecture.) It is very hard, but not impossible, to build applications over systems that do not guarantee consistency. But the CAP theorem says that it is impossible for a system that guarantees consistency to guarantee 100% availability in the presence of a network partition. So if you can only choose one, it makes sense to choose availability. As we said above, once the system fails to guarantee consistency, developing applications on top of it without ugly corner case bugs is extremely challenging, and generally requires highly-skilled application developers that are able to handle the intellectual rigors of such development environments. Nonetheless, such skilled developers do exist, and this is the only way to avoid the impossibility proof from the CAP theorem of 100% availability.

The reasoning of the previous paragraph, although perhaps well-thought out and convincing, is fundamentally flawed. The CAP theorem lives in a theoretical world where there is such a thing as 100% availability. In the real world, there is no such thing as 100% availability. Highly available systems are defined in terms of ‘9s’. Are you 99.9% available? Or 99.99% available? The more 9s, the better. Availability is fundamentally a pursuit in imperfection. No system can guarantee availability.

This fact has significant ramifications when considering the availability vs. consistency tradeoff that was purported by the CAP theorem. It is not the case that if we guarantee consistency, we have to give up the guarantee of availability. We never had a guarantee of availability in the first place! Rather, guaranteeing consistency causes a reduction to our already imperfect availability.

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability. Consequently, the real world impact of guaranteeing consistency on availability is often negligible. It is very possible to have a system that guarantees consistency and achieves high availability at the same time.

[Side note: I have written extensively about these issues with the CAP theorem. I believe the PACELC theorem is better able to summarize consistency tradeoffs in distributed systems.]

The glorious return of consistent NewSQL systems

The argument above actually results in 3 distinct reasons for modern systems to be CP from CAP, instead of AP (i.e. choose consistency over availability):

(1) Systems that fail to guarantee consistency result in complex, expensive, and often buggy application code.

(2) The reduction of availability that is caused by the guarantee of consistency is minute, and hardly noticeable for many deployments.

(3) The CAP theorem is fundamentally asymmetrical. CP systems can guarantee consistency. AP systems do not guarantee availability (no system can guarantee 100% availability). Thus only one side of the CAP theorem opens the door for any useful guarantees.

I believe that the above three points is what has caused the amazing renaissance of distributed, transactional database systems --- many of which have become commercially available in the past few years --- that choose to be CP from CAP instead of AP. There is still certainly a place for AP systems, and their associated NoSQL implementations. But for most developers, building on top of a CP system is a safer bet.

However, when I say that CP systems are the safer bet, I intend to refer to CP systems that actually guarantee consistency. Unfortunately, way too many of these modern NewSQL systems fail to guarantee consistency, despite their claims to the contrary. And once the guarantee is removed, the corner case bugs, complexity, and costs return.

Spanner is the source of the problem

I have discussed in previous posts that there are many ways to guarantee consistency in distributed systems. The most popular mechanism, which guarantees consistency at minimal cost to availability, is to use the Paxos or Raft consensus protocols to enforce consistency across multiple replicas of the data. At a simplified level, these protocols work via a majority voting mechanism. Any change to the data requires a majority of replicas to agree to the change. This allows the minority of replicas to be down or unavailable and the system can nonetheless continue to read or write data.

Most NewSQL systems use consensus protocols to enforce consistency. However, they differ in a significant way in how they use these protocols. I divide NewSQL systems into two categories along this dimension: The first category, as embodied in systems such as Calvin (which came out of my research group) and FaunaDB, uses a single, global consensus protocol per database. Every transaction participates in the same global protocol. The second category, as embodied in systems such as Spanner, CockroachDB, and YugaByte, partitions the data into ‘shards’, and applies a separate consensus protocol per shard.

The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.

Calvin and FaunaDB get around this downside via batching. Rather than voting on each transaction individually, they vote on batches of transactions. Each server batches all transactions that it receives over a fixed time period (e.g., 10 ms), and then initiates a vote on that entire batch at once. With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads [Update: there has been some discussion about these numbers from my readers. The number for NASDAQ might be closer to 100,000 orders per second. I have not seen anybody dispute the 10,000 orders per second number from Amazon.com, but readers have pointed out that they issue more than 10,000 writes to the database per second. However, this blog post is focused on strictly serializable transactions rather than individual write operations. For Calvin's 500,000 transactions per second number, each transaction included many write operations.]

The main downside of the second category is that by localizing consensus on a per-shard basis, it becomes nontrivial to guarantee consistency in the presence of transactions that touch data in multiple shards. The quintessential example is the case of someone performing a sequence of two actions on a photo-sharing application (1) Removing her parents from having permission to see her photos (2) Posting her photos from spring break. Even though there was a clear sequence of these actions from the vantage point of the user, if the permissions data and the photo data are located in separate shards, and the shards perform consensus separately, there is a risk that the parents will nonetheless be able to see the user’s recently uploaded photos.

Spanner famously got around this downside with their TrueTime API. All transactions receive a timestamp which is based on the actual (wall-clock) current time. This enables there to be a concept of “before” and “after” for two different transactions, even those that are processed by completely disjoint set of servers. The transaction with a lower timestamp is “before” the transaction with a higher timestamp. Obviously, there may be a small amount of skew across the clocks of the different servers. Therefore, Spanner utilizes the concept of an “uncertainty” window which is based on the maximum possible time skew across the clocks on the servers in the system. After completing their writes, transactions wait until after this uncertainty window has passed before they allow any client to see the data that they wrote.

Spanner thus faces a potentially uncomfortable tradeoff. It is desirable that the uncertainty window should be as small as possible, since as it gets larger, the latency of transactions increases, and the overall concurrency of the system decreases. On the other hand, it needs to 100% sure that clock skew never gets larger than the uncertainty window (since otherwise the guarantee of consistency would no longer exist), and thus larger windows are safer than smaller ones.

Spanner handles this tradeoff with a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers. This solution allows the system to keep the uncertainty window relatively narrow while at the same time keeping the probability of incorrect uncertainty window estimates (and corresponding consistency violations) to be extremely small. Indeed, the probability is so small that Spanner’s architects feel comfortable claiming that Spanner “guarantees” consistency.

[It is worth noting at this point that systems that use global consensus avoid this problem entirely. If every transaction goes through the same protocol, then a natural order of all transactions emerges --- the order is simply the order in which transactions were voted on during the protocol. When batches are used instead of transactions, it is the batches that are ordered during the protocol, and transactions are globally ordered by combining their batch identifier with their sequence number within the batch. There is no need for clock time to be used in order to create a notion of before or after. Instead, the consensus protocol itself can be used to elegantly create a global order.]

Spanner Derivatives

Spanner is a beautiful and innovative system. It was also invented by Google and widely used there. Either because of the former or latter (or both), it has been extremely influential, and many systems (e.g., CockroachDB and YugaByte) have been inspired by the architectural decisions by Spanner. Unfortunately, these derivative systems are software-only, which implies that they have inherited only the software innovations without the hardware and infrastructure upon which Spanner relies at Google. In light of Spanner’s decision to have separate consensus protocols per shard, software-only derivatives are extremely dangerous. Like Spanner, these systems rely on real-world time in order to enforce consistency --- CockroachDB on HLC (hybrid logical clocks) and YugaByte on Hybrid Time. Like Spanner, these systems rely on knowing the maximum clock skew across servers in order to avoid consistency violations. But unlike Spanner, these systems lack hardware and infrastructure support for minimizing and measuring clock skew uncertainty.

CockroachDB, to its credit, has acknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).

YugaByte, however, continues to claim a guarantee of consistency [Edit for clarification: YugaByte only makes this claim for single key operations; however, YugaByte also relies on time synchronization for reading correct snapshots for transactions running under snapshot isolation.]. I would advise people to be wary of these claims which are based on assumptions of maximum clock skew. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. This can happen under a variety of scenarios such as when a VM that is running YugaByte freezes or migrates to a different machine. Even without sudden jumps, YugaByte’s free edition relies on the user to set the assumptions about maximum clock skew. Any mistaken assumptions on behalf of the user can result in consistency violations.

In contrast to CockroachDB and YugaByte, FaunaDB was inspired by Calvin instead of Spanner. [Historical note: the Calvin and Spanner papers were both published in 2012]. FaunaDB therefore has a single, elegant, global consensus protocol, and needs no small print regarding clock skew assumptions. Consequently, FaunaDB is able to guarantee consistency of transactions that modify any data in the database without concern for the corner case violations that can plague software-only derivatives of Spanner-style systems.

There are other differences between Calvin-style systems and Spanner-style systems that I’ve talked about in the past. In this post we focused on perhaps the most consequential difference: global consensus vs. partitioned consensus. As with any architectural decision, there are tradeoffs between these two options. For the vast majority of applications, exceeding 500,000 transactions a second is beyond their wildest dreams. If so, then the decision is clear. Global consensus is probably the better choice.

[Editor's note: Daniel Abadi is an advisor at FaunaDB.]

[This article includes a description of work that was funded by the NSF under grant IIS-1763797. Any opinions, findings, and conclusions or recommendations expressed in this article are those of Daniel Abadi and do not necessarily reflect the views of the National Science Foundation.]

50 comments:

UnknownSeptember 21, 2018 at 7:27 AM
Very interesting read and well written! Shameless plug: I'm a master student in compsci and currently looking for an interesting 6 month research topic for my thesis in exactly this kind of topic (distributed database systems). While I have an interest in theory, I spent most of my time on practical stuff and have a hard time finding something worthy to research. Maybe any suggestions/tips?
ReplyDelete
Replies
Tomer Ben DavidSeptember 21, 2018 at 12:12 PM
The itch I had has been confirmed :)
ReplyDelete
Replies
UnknownSeptember 21, 2018 at 1:06 PM
CTO of YugaByte here. We firmly stand by our claims, and I wanted to explain more.

From the post by Daniel:
<< CockroachDB, to its credit, has acknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).

YugaByte, however, continues to claim a guarantee of consistency. I would advise people not to trust this claim. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. >>

The statement about YugaByte DB is incorrect.

1. With respect to CAP, both Cockroach DB (https://www.cockroachlabs.com/blog/limits-of-the-cap-theorem/) and YugaByte DB (https://docs.yugabyte.com/latest/faq/architecture/#how-can-yugabyte-db-be-both-cp-and-ha-at-the-same-time) are CP databases with HA and there is really no difference in the claims.

2. With respect to Isolation level in ACID, YugaByte DB does not make the linearizability (called external consistency by Google Spanner) claim. YugaByte DB offers Snapshot Isolation (detects write-write conflicts) today and Serializable isolation (detect read-write and write-write conflicts) is in the roadmap (https://docs.yugabyte.com/latest/architecture/transactions/isolation-levels/).

3. We have publicly claimed that we do rely on NTP and max clock skew bounds to guarantee consistency. For example, slide 43 of our NorCal DB Day talk (https://www.slideshare.net/YugaByte/yugabyte-db-architecture-storage-engine-and-transactions) we mention we are “relying on bounded clock sync (NTP, AWS Time Sync, etc).”
ReplyDelete
Replies
AnonymousSeptember 21, 2018 at 3:11 PM
Great post, very informative.
Curious to get your take on TiDB (https://github.com/pingcap/tidb) and its transaction model as it relates to consistency guarantee.
One of its engineers wrote about how it differs from Spanner and Cockroach here: https://dzone.com/articles/tick-or-tock-keeping-time-and-order-in-distributed-1
ReplyDelete
Replies
SergioClementeSeptember 21, 2018 at 4:11 PM
Spanner TrueTime is used for deciding the timestamp of the transaction. If the transaction touches more than one shard it uses two phase commit as mentioned in the spanner paper [1]: "If a transaction involves more than one Paxos
group, those groups’ leaders coordinate to perform twophase
commit."

[1] https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
ReplyDelete
Replies
UnknownSeptember 21, 2018 at 5:06 PM
Though mild, there is certainly aacontradiction in conceding that the CAP theorem is flawed because its claims are too absolute/theoretical (agreed btw!) but then giving these systems a hard time for using a more practical level of consistency than that which CAP relies upon.

Linearizability is not the be all and end all. We should applaud these practical, productionized contributions to the space, as long as their tradeoffs are clear!
ReplyDelete
Replies
AndyAndRachelSeptember 21, 2018 at 6:37 PM
Daniel, you did not mention one of the biggest downsides of a central oracle design: latency (especially read latency). For example, say your database is set up with nodes in both US and European data centers that each store data for users located nearby. You have to decide where to locate your central oracle - which customers will get higher latencies?

By contrast, bounded-timestamp systems can read/write locally in both US and Europe with no cross-ocean coordination necessary. That solves a real business problem for global companies, which means that the engineering will eventually catch up. Cloud providers will eventually install atomic clocks/GPS systems in all their DCs and make TrueTime-like APIs for developers to use. At that point, Cockroach/YugaByte can become fully linearizable with minimal effort.

Your arguments have merit, but they come with an expiration date.
ReplyDelete
Replies
UnknownSeptember 22, 2018 at 7:11 AM
We recently open sourced our project https://github.com/rubrikinc/kronos which tackles some of the clock synchronization problems in the absence of atomic clocks. We have integrated it with CockroachDB and have seen promising results. In practice we have seen it steadily maintains offsets within the cluster within a few hundred microseconds depending on RTT between nodes (which is acceptable for normal operations) and system clock jumps on nodes don't impact this service. Offsets can get a bit higher depending on inter node clock drifts, but they are periodically corrected.
ReplyDelete
Replies
JamieSeptember 22, 2018 at 10:02 PM
have you seen the Huygens clock sync paper https://www.usenix.org/system/files/conference/nsdi18/nsdi18-geng.pdf?

re Calvin: a lot of real world applications cannot be structured as deterministic server-side transactions.
ReplyDelete
Replies
UnknownSeptember 23, 2018 at 10:15 PM
Hi Daniel, great post.

However, this claim "Calvin was able to achieve a throughput of over 500,000 transactions per second" seems to be limitation for large scale internet company. For example, I use to work for Ola (A ride hailing company similar to Uber) and I can say that a single system use to generate a load of over 1Million Transaction /Second. Of course we could use application level sharding but that might again introduce more bugs and load on developers to redo the data model.
ReplyDelete
Replies
Lu PanSeptember 24, 2018 at 11:12 PM
This comment has been removed by the author.
ReplyDelete
Replies
Lu PanSeptember 24, 2018 at 11:16 PM
Great post! I also have a question on the 1/2 Million txn per second claim. 1) Are the numbers of Amazon and NASDAQ client purchase txns instead of database txns? Usually the write amplification is significant, that lots of metadata and other stuff are written at the same time for indexing, offline ML, etc. 2) do you think we can also optionally do reads out of paxos, for certain queries that do not require strong consistency to improve throughput? In practice, I would assume Calvin type of distributed databases are more read heavy (maybe Kafka + materialized views for write heavy workload?). Anyway for read heavy workload, most of reads usually do not require strong consistency. Doing those out side of paxos, should improve throughput a lot.
ReplyDelete
Replies
AnonymousSeptember 25, 2018 at 9:16 AM
Don't settle for eventual consistency https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
ReplyDelete
Replies
AnonymousSeptember 25, 2018 at 9:17 AM
Database Comparison: An In-Depth Look at How MapR-DB Does What Cassandra, HBase, and Others Can't https://mapr.com/blog/database-comparison-an-in-depth-look-at-mapr-db/
ReplyDelete
Replies
freemindOctober 21, 2018 at 8:35 AM
Cassandra claims a _strong_ consistency if read consistency level + write consistency level > number of replicas.

https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDataConsistency.html

however, this guarantee, too, relies on no clock skew as Cassandra uses the most recent write as the winning write.
ReplyDelete
Replies
UnknownOctober 29, 2018 at 1:04 AM
You do not mention strong eventual consistency, aka CRDTs.
Actually with CRDTs you will get your nodes to a consistent state without "traditional" synchronization for certain operations, with low (local) latency.
Therefore there are quite a few operations which would be much slower (and potentially less available) with a "NewDB". Just assume you want cross data center replication.
You are somehow correct that quite a few applications might get away with a NewDB. Spammer tries to make the cross DC case acceptable fast, by making some comprises. I don't think they can be blamed for that.
ReplyDelete
Replies

Add comment

DBMS Musings