Wednesday, December 7, 2011

Replication and the latency-consistency tradeoff

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).

The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.

The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.

There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

  1. Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:

    1. Routing updates through an additional system component (the preprocessor) increases latency

    2. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.

  2. Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:

    1. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.

    2. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
      1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
        1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.

        2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.

      2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.

    3. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .

  3. Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
    1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.

    2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.

To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).

In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.


  1. Daniel, I am not convinced that the (obvious?) relationship between latency and consistency has been ignored.

    After all, the very same issues had already been extensively researched and addressed in context of multicore and the various shared memory consistency approaches by the chip manufacturers.

    CAP was 'significant' as it addressed a shortcoming entirely absent from h/w based shared memory systems: unreliable connectivity. Had h/w systems shared the same characteristics, there is no question that CAP would have emerged from those efforts.

    That said, thank you for the excellent survey of the approaches in context of distributed databases.

  2. Great article. I especially liked the explanations of the decisions that the different distributed DB systems made.

    I'm not sure that I agree that it is useful to consider latency outside of the CAP theorem. CAP states that you can't have consistency, availability and partition tolerance and have an acceptable latency. The concept of latency is baked into the Availability term -- if the latency is higher than your requirements dictate, you've lost Availability. You can configure Cassandra to give you strong consistency. The tradeoff isn't that you'll lose "availability" in some pure sense, as if your system will suddenly become unavailable. The tradeoff is increased latency. If that latency is so high that your users close the browser, or client timeouts occur, or your servers get overloaded with backlogged requests, that is the reduced availability predicted by the CAP theorem.

    Your article also discusses another (related, but separate) flavor of latency: the latency between different internal parts of the DB system, such as latency introduced by agreement protocols. Again, this latency is part of the CAP theorem itself. In fact, that latency is why the CAP theorem exists. If there were zero latency between internal components and protocols in a DB system, you could abstract the partition tolerance of the system and provide strong consistency without giving up availability. The fact that there is a delay between writes among servers is the reason we have to worry about CAP in the first place.

  3. @alphazero, @bhudgeons, thanks for your kind words.

    @bhudgeons, a lot of people like to put latency and availability together in the same category, and I don't have any particular problem with that (though I prefer to keep them separate since they generally describe different orders of magnitude levels of latency). However, even if you want to mix them, I maintain that there is still a fundamental difference between the A/L of CAP which is only traded off with C upon a network partition, and the A/L of this post, which is traded off with C at all times, even when there is no network partition.

  4. Dan,

    Great to see this post and also great to see that we're thinking about the issues in similar or parallel tracks.

    I do believe however that the "latency" as perceived by an database client application is a result either of "availability latency", or "consistency latency". In that sense, I don't believe that latency is really a new factor or dimension, rather just a consolidation of two factors already part of the CAP; namely C and A.

    An application developer is really concerned about the numerical bounds (time limits) on "latency", and these are, I believe best visualized as the variables that I refer to as Tc and Ta in my posts on the subject. (Please see:

    You write, "Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition."

    I agree entirely with you that IF one were to consider CAP to be relevant only in the rare case of a network partition, that would be foolish.

    And that is also why I characterize CAP as something to be considered at all time where:

    Ta + Tc >= Tp.

    More details, proof, applications are available at (part 1 of 6)

  5. Another dimension in the consistency discussion is the type of transaction being performed and how it can influence consistency/latency trade-offs. Obviously many of the Dynamo derived systems allow for a per connection (xact?) consistency setting (not all that conceptually dissimilar from traditional single db xact isolation levels). But I think consistency semantics of Google's Megastore is very relevant to this discussion and looks at consistency from a slightly different perspective. In particular I'm referring to the differing degrees of consistency guarantees the system provides depending on whether your transaction is within a particular "entity group" or spans entity groups.

  6. Daniel -- Thank you for the response -- I hope I'm not seeming argumentative. I have a feeling that I may have some fundamental misunderstanding of CAP, and I suspect others may share it.

    In the systems you described, each is using replication as a way to give away some C and A for some P. The architectural decisions they made (which you described) mean that some of them get more C, and some get more A.

    Are you just saying that it's useful to look at that decision (the C/A tradeoff) in isolation of the P (where you've already decided to use replication)? I still can't get my head around what the "fundamental difference" is between the A/L of CAP and the A/L of the CA tradeoff. If there is no network partition tolerance, why are you doing replication?

  7. Hiding Synchronous Replication stalls I suspect is very similar to OoO consistency and cache miss stall hiding tricks.
    Lets assume every write takes 1ms to receive all acks from replicating agents.

    [a] This latency only matters for *dependent* reads from the same source. So if I have a transaction with 20 writes and 1 read
    this 1 read will take the stall not the 20 other writes. Of course you can't allow the transaction to commit until all
    writes have been ack'ed but I suspect you can use ack aggregation and batching.

    [b] Lets assume that the transaction contains multiple reads for data that is written in the same transaction. You can speculatively
    allow these reads to continue with uncommitted writes (again batch your write-acks). These reads will of course have to be
    rolled back if any of the above writes acks fail.

    Bottomline, it appears to me write-stalls really only matter for Transaction commit time. As you can tell I m not a Database
    Internals person but I enjoy reading your blogs. I do however intimately understand OoO processors that solve a similar problem.

  8. @bhudgeons Per the 'proof' of CAP, "network" partition is not merely limited to (literally) network issues. Even without any network specific issues, node(s) may crash, and unless the system can transparently failover to a replica, then you have a 'network partition' in your distributed system, per the "proof" of CAP: [§ 2.3 //+emph//]:

    "In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost. (And //any pattern of message loss// can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost.)"

  9. @Jamie Thanks for your comment.

    @bhudgeons I will try to answer your question in an upcoming longer publication on this topic.

    @Unknown Thanks for your comment. The latency I am talking about for this post includes the full latency all the way through transaction commit.

  10. Hi Daniel,

    Excellent post! I like the way you clarify this things.

    Did you have the chance to read the CAP theorem analysis by mentioned by @amrith ?

    I would like to read a review about it from you.

  11. Hi Pablo,

    I read it a few months ago and sent my comments to Amrith. I do not feel comfortable making these comments public.

  12. Hello Daniel,
    Thanks for another insightful post.

    I have a question. It is not really about your main point, but still, I have wondered about it before. It is about your option 1. I agree some pre-processing is needed to prevent replica divergence. But I have always wondered if this isnt introducing a new SPoF?
    (And thereby introducing a new CAP-problem?)

  13. There is potentially another source of read latency when using a Consistency Level higher than One. In the Cassandra world, when a read detects inconsistent data from nodes it is performed a second time and the inconsistencies are resolved. This happens synchronously to the read operations, the writes required to repair the inconsistency are then handled async to the read.

    The general question would be how are inconsistencies handled during read operations.?

    Aaron Morton

  14. Daniel,

    I recently released a tech report with some other folks at Berkeley that helps quantify the latency-consistency trade-off you describe here for quorum-replicated eventually consistent data stores. Eventual consistency makes no guarantees regarding data recency, but we can predict the expected consistency a data store provides, which we call Probabilistically Bounded Staleness, or PBS.

    Your post provided great motivation for our tech report (under submission):

    We also wrote a simple browser-based demo showing latency-consistency wins at:

    We'd welcome any feedback you have!

  15. I'll admit that we're lazy (and cheap). We just maintain parallel app stacks and failover completely to the other stack if one stack dies (and do load-balancing otherwise for read-only actions). Our data is such that we don't require absolute data-level consistency between stacks (eventual is ok), so we don't bother with db-level replication, which is expensive, hard, and typically the buggiest part of db engines.

    Our approach is actually far cheaper than trying to replicate at the db level, since we can "get away" with nearly-free open-source DBMSs, where replication is even buggier than it is with old-line (and hugely expensive) commercial DBMSs, and the infrastructure needed is more vertical and less horizontal than it would need to be if we tried to use db-level replication.