The Strict Serializability vs. Low Latency TradeoffModern applications span geographic borders with users located on any continent worldwide. Application state thus must be accessible anywhere. If application state never changes, it is possible for all users to interact with the global application with low latency. Application state is simply replicated to many locations around the world, and users can interact with their nearest replica. But for most applications, state changes over time: products get purchased, money is exchanged, users become linked, resources are consumed, etc. The problem is: sending messages across the world can take hundreds of milliseconds. There is thus a necessary delay before information about the state change travels to all parts of the world. In the meantime, applications have two choices: they can present stale, out-of-date state to the user, or they make the user wait until the correct, up-to-date state arrives.
Presenting stale state allows for low-latency interactions with the application around the world, but may lead to negative experiences by the end user. For example, observing stale state may result in users purchasing products that no longer exist, overdrawing balances from bank accounts, confusing discussion threading on social media, or reacting to incorrect stimuli in online multiplayer games.
Applications typically store state in some kind of “data store”. Some data stores guarantee “strong consistency” (such as strict consistency, atomic consistency, or linearizability, which I discussed in a previous post) that ensures that stale state will never be accessed. More specifically: any requests to access state on behalf of a client will reflect all state changes that occurred prior to that request. However, guaranteeing strong consistency may result in slower (high-latency) response times. In contrast, other data stores make reduced consistency guarantees and as a result, can achieve improved latency. In short, there is a tradeoff between consistency and latency: this is the ELC part of the PACELC theorem.
Separately, some systems provide support for transactions: groups of read and write requests that occur together as atomic units. Such systems typically provide “isolation” guarantees that prevent transactions from reading incomplete/temporary/incorrect data that were written by other concurrently running transactions. There are different levels of isolation guarantees, that I’ve discussed at length in previous posts (post 1, post 2, post 3), but the highest level is “serializability”, which guarantees that concurrent transactions are run in an equivalent fashion to how they would have been run if there were no other concurrently running transactions. In geo-distributed systems, where concurrently running transactions are running on physically separate machines that are potentially thousands of miles apart, serializability becomes harder to achieve. If fact, there exist proofs in the literature that show that there is a fundamental tradeoff between serializability and latency in geographically distributed systems.
Bottom line: there is a fundamental tradeoff between consistency and latency. And there is another fundamental tradeoff between serializability and latency.
Strict serializability includes perfect isolation (serializability) and consistency (linearizability) guarantees. If either one of those guarantees by itself is enough to incur a tradeoff with latency, then certainly the combination will necessarily require a latency tradeoff.
Indeed, a close examination of the commercial (and academic) data stores that guarantee strict serializability highlights this latency tradeoff. For example, Spanner (along with all the Spanner-derivative systems: Cloud Spanner, CockroachDB, and YugaByte) runs an agreement protocol (e.g. Paxos/Raft) as part every write to the data store in order to ensure that a majority of replicas agree to the write. This global coordination helps to ensure strict serializability, but increases the latency of each write. [Side note: neither CockroachDB nor YugaByte currently guarantee full strict serializability, but the reason for that is unrelated the discussion in this post, and discussed at length in several of my previous posts (post 1, post 2). The main point that is relevant for this post is that they inherit Spanner’s requirement to do an agreement protocol upon each write, which facilitates achieving a higher level of consistency than what would have been achieved if they did not inherit this functionality from Spanner.]
For example, if there are three copies of the data --- in North America, Europe, and Asia --- then every single write in Spanner and Spanner-derivative systems require agreement between at least two continents before it can complete. It is impossible to complete a write in less than 100 milliseconds in such deployments. Such high write latencies are unacceptable for most applications. Cloud Spanner, aware of this this latency problem that arises from the Spanner architecture, refuses to allow such “global” deployments. Rather, all writable copies of data must be within a 1000 mile radius (as of the time of publications of this post … see: https://cloud.google.com/spanner/docs/instances which currently says: “In general, the voting regions in a multi-region configuration are placed geographically close—less than a thousand miles apart—to form a low-latency quorum that enables fast writes (learn more).”). Spanner replicas outside of this 1000 mile radius must be read-only replicas that operate at a lag behind the quorum eligible copies. In summary, write latency is so poor in Spanner that they disable truly global deployments. [Side note: CockroachDB improves over Spanner by allowing truly global deployments and running Paxos only within geo-partitions, but as mentioned in the side note above, not with strict serializability.]
Amazon Aurora --- like Cloud Spanner --- explicitly disables strictly serializable global deployments. As of the time of this post, Aurora only guarantees serializable isolation for single-master deployments, and even then only on the primary instance. The read-only replicas operate at a maximum of “repeatable read” isolation.
Calvin and Calvin-derivative systems (such as Fauna, Streamy-db, and T-Part) typically have better write latency than Spanner-derivative systems (see my previous post on this subject for more details), but they still have to run an agreement protocol across the diameter of the deployment for every write transaction. The more geographically disperse the deployment, the longer the write latency.
In the research literature, papers like MDCC and TAPIR require at least one round trip across the geographic diameter of the deployment in order to guarantee strict serializability. Again: for truly global deployments, this leads to hundreds of milliseconds of latency.
In short: whether you focus on the theory, or whether you focus on the practice, it is seemingly impossible to achieve both strict serializability (for arbitrary transactions) and low latency reads and writes.
How SLOG cheats Strict Serializability vs. Low Latency TradeoffLet’s say that a particular data item was written by a client in China. Where is the most likely location from which it will be accessed next? In theory, it could be accessed from anywhere. But in practice for most applications, (at least) 8 times out of 10, it will be accessed by a client in China. There is a natural locality in data access that existing strictly serializable systems have failed to exploit.
If the system can be certain that the next read after a write will occur from the same region of the world that originated the write, there is no need (from a strict serializability correctness perspective) to synchronously replicate the write to any other region. The next read will be served from the same near-by replica from which the write originated, and is guaranteed to see the most up to date value. The only reason to synchronously replicate to another region is for availability --- in case an entire region fails. Entire region failure is rare, but still important to handle in highly available systems. Some systems, such as Amazon Aurora, only synchronously replicate to a nearby region. This limits the latency effect from synchronous replication. Other systems, such as Spanner, run Paxos or Raft, which causes synchronous replication to a majority of regions in a deployment --- a much slower process. The main availability advantage of Paxos/Raft over replication to only nearby regions is to be able to remain (partially) available in the event of network partitions. However, Google themselves claim that network partitions are very rare, and have almost no effect on availability in practice. And Amazon clearly came to the same conclusion in their decision to avoid Paxos/Raft from the beginning.
To summarize, in practice: replication within a region and to near-by regions can achieve near identical levels of availability as Paxos/Raft, yet avoid the latency costs. And if there is locality in data access, there is no need to run synchronous replication of every write for any reason unrelated to availability. So in effect, for workloads with access locality, there is no strict serializability vs. low latency tradeoff. Simply serve reads from the same location as the most recent write to that same data item, and synchronously replicate writes only within a region and to near-by regions for availability.
Now, let’s drop the assumption of perfect locality. Those reads that initiate far from where that data item was last written must travel farther and thus necessarily take longer (i.e. high latency). If such reads are common, it is indeed impossible to achieve low latency and strict serializability at the same time. However, as long as the vast majority of reads initiate from locations near to where that item was last written, the vast majority of reads will achieve low latency. And writes also achieve low latency (since they are not synchronously replicated across the geographic diameter of the deployment). In short, low latency is achieved for the vast majority of reads and writes, without giving up strict serializability.
Where things get complicated are (1) How does the system which is processing a read know which region to send it to? In other words, how does any arbitrary region know which was the last region to write a data item? (2) What happens if a transaction needs to access multiple data items that were last written by different regions?
The solution to (1) is simpler than the solution to (2). For (1), each data item is assigned only one region at a time as its “home region”. Write to that data item can only occur at that home region. As the region locality of a data item shifts over time, an explicit hand-off process automatically occurs in which that data item is “rehoused” at a new region. Each region has a cache of the current house for each data item. If the cache is out of date for a particular read, the read will be initially sent to the wrong region, but that region will immediately forward the read request to the new home of that data item.
The solution to (2) is far more complicated. Getting such transactions to run correctly without unreasonably high latencies was a huge challenge. Our initial approach (not discussed in the paper we published) was to rehouse all data items accessed by a transaction to the same region prior to running it. Unfortunately, this approach interfered with the system’s ability to take advantage of the natural region locality in a workload (by forcing data items to be rehoused to non-natural locations) and made it nearly impossible to run transactions with overlapping access sets in parallel, which destroyed the throughput scalability of the system.
Instead, the paper we published uses a novel asynchronous replication algorithm that is designed to work with a deterministic execution framework based on our previous research on Calvin. This solution enables processing of “multi-home” transactions with no forced-rehousing, with half-round-trip cross-region latency. For example, let’s say that a transaction wants to read, modify, and write two data items: X and Y, where X is housed in the USA and Y is housed in Europe. All writes to X that occurred prior to this transaction are asynchronously sent from the USA to Europe, and (simultaneously) all writes to Y that occurred prior to this transaction are asynchronously sent from Europe to the USA. After this simple one-way communication across the contents is complete, each region then independently runs the read/modify/writes to both X and Y locally, and rely on determinism to ensure that they both are guaranteed to write the same exact values and come to the same commit/abort decision for the transaction without further communication.
Obviously there are many details that I’m leaving out of this short blog post. The reader is encouraged to look at the full paper for more details. But the final performance numbers are astonishingly good. SLOG is able to get the same throughput as state-of-the-art geographically replicated systems such as Calvin (which is an order of magnitude better than Spanner) while at the same time getting an order of magnitude better latency for most transactions, and equivalent latency to Paxos-based systems such as Calvin and Spanner for the remaining “multi-home” transactions.
An example of the latency improvement of SLOG over Paxos/Raft-based systems for a workload in which 90% of all reads initiate from the same region as where they were last written is shown in the figure below: [Experimental setup is described in the paper, and is run over a multi-continental geo-distributed deployment that includes locations on the west and east coasts of the US, and Europe.]
This figure shows two versions of SLOG --- one version that only replicates data within the same region, and one version that replicates also to a near-by region (to be robust to entire region failure). For the version of SLOG with only intra-region replication, the figure shows that 80% of transactions complete in less than 10 milliseconds, 90% less than 20 milliseconds, and the remaining transactions take no longer than round-trip Paxos latency across the diameter of the geographically dispersed deployment. Adding synchronous replication to near-by regions adds some latency, but only a small factor. In contrast, Paxos/Raft-based systems such as Spanner and Calvin take orders of magnitude longer latency to complete transactions for this benchmark (which is a simplified version of Figure 12 from the SLOG paper).
Very much like Yahoo's user database implemented in the 90s. I believe the PNUTs paper also talks about record-level mastering that gives a similar result. https://pdfs.semanticscholar.org/876b/e80390bcaffb9b910ed05680b2e81a37d64d.pdfReplyDelete
Yes, except that PNUTS didn't support transactions. SLOG not only supports transactions, but even "multi-home" transactions at high throughput and low latency. As mentioned above --- this is the main break-through of SLOG: Take advantage of locality while still supporting strictly serializable transactions.Delete
Interesting paper. I have a question regarding strict serializability. The global logs you show in figure 3 (or figure 4) exhibit causal reverses (and aren't exact replicas of one another). It would be a violation of linearizability *if* a regional global log could be read at an arbitrary point. But they can't, so the system is considered linearizable. In other words, because the anomalies can't be observed through the regular transactional interface to the system, they don't "count". Is my understanding correct?ReplyDelete
This leads to the next question: how would one build a "hot" BACKUP facility in such a database (i.e. where BACKUP happens while system is running transactions)? Presumably, you'd want a consistent snapshot of the global state. But it would seem that no matter which point in time you pick, there's the possibility that your RESTORED database would allow observation of a causal reverse. For example, say two writers are racing, one continually writing values to region A and one continually writing values to region B. Without locking (of one sort or another), the BACKUP process cannot be guaranteed to make a consistent snapshot. But locking the entire database to perform BACKUP would seem impractical in a production system. Is there some reasonable solution that guarantees consistency that I'm overlooking?
Thanks for reading the paper (and for spending enough time reading it such that you understood it correctly).Delete
Yes, all reads have to go through the front door. If you start trying to read state without using SLOG's interface, you lose out on SLOG's guarantees. (This is true of many systems).
This implies that that BACKUP has to go through the front door. If you implement it as a regular read-only query in SLOG, this will result in a giant multi-home (read-only) transaction. The paper states that SLOG acquires read locks for all reads, and read locks will block writes, so transactions that write data that are appended to the local log at each region behind the BACKUP transaction would see a latency blip until the read locks are released. I believe this is premise of your question.
However, you are indeed overlooking a better solution. Though it is really my fault for not stating this explicitly in the paper. If you use multi-versioned storage in SLOG, there is *no need* for read-only transactions to acquire read-locks. They just get inserted into the local log at each region that houses data relevant to that read-only transaction using the normal process described in the paper. If multiple regions house relevant data, the transaction also has to go through the multi-home transaction ordering process described in the paper. These two steps give the reads within the read-only transaction a lineariazble ordering relative to all writes at each region, and the read-only transaction can occur in the background, reading the correct version at each region based on its order in the local log at that region.
In Figure 4, what prevents the following ordering:Delete
Region 0 <-- T2.0 T1
Region 1 <-- T3 T2.1
Here's the sequence:
1. Multi-home transaction #2 calls InsertIntoLocalLog on Region 0. Then it pauses.
2. Single-home transaction #1 calls InsertIntoLocalLog on Region 0.
3. Single-home transaction #3 calls InsertIntoLocalLog on Region 1.
4. Multi-home transaction #2 unpauses, and calls InsertIntoLocalLog on Region 1.
Assume T2 reads the same locations that T1 and T3 write, and that T3 starts after T1 commits. But then the above ordering would mean that T2 reads the value written by T3, but not the value written by T1. But that would violate strict serializability, correct? I couldn't find anything in Figure 2 or the prose that would appear to prevent this.
After re-reading various parts of the paper, I believe the scenario I gave above is prevented by the T2 LockOnlyTxn applied to Region 0. It will lock the value it's reading, such that T1 cannot be executed until T2.1 from Region 1 has been processed (which would trigger commit of T2 and release of its locks). Meanwhile, the T1 client cannot be ack'd until T1 is *executed* (not enough to just be inserted into a global log). Therefore, T3 cannot start after T1 commits. T2 will "see" T3 but not T1, but that's a legal ordering when T3 and T1 are concurrent.Delete
What threw me off track was an assumption that the T1 client could be ack'd once it had been safely committed to the global log, as in Calvin. Also, there's some important locking-related pseudo-code in Figure 6 that I didn't look into carefully before because I thought Figure 6 was just about dynamic remastering.
It's funny --- you are the third person who initially believed there can be a causal reverse in SLOG until thinking about it more deeply. When three smart people have this initial reaction, that makes me regret not putting that example explicitly in the paper. Sorry about that.Delete
Everything you said after rereading the paper is correct: T1 cannot complete if it is after T2.0 in the local log of region 1 and they have an overlapping access set. Therefore T3 does not start until after T2 completes (if it starts after T1 completes). Therefore it cannot be ahead of T2.1 in region 1's log.
Your point that this is an important difference from Calvin is 100% correct, and worth reiterating. There are no system aborts in Calvin. Therefore, if there is no possibility of a logical abort, Calvin can commit a transaction before starting to process it (as soon as it reaches Calvin's input log). In contrast, SLOG must wait until all locks are acquired before it can commit (if there is no possibility of a logical abort).
First, thank you for engaging with us on your work. Second, thanks for describing performance in terms of physical operations rather than just naming the algorithm. That makes it easier for non experts like me to follow along. Finally, and repeating a question I had on Twitter, the paper states ""The only reason to synchronously replicate to another region is for availability --- in case an entire region fails."ReplyDelete
Writes not replicated from a failed region are lost if you don't wait for it to return. I don't think this is just about availability. But maybe I misunderstand what you write because elsewhere the paper mentions the benefit of nearby regions.
Tobias Grieger's response is correct. Durability is handled with replication within a region (though admittedly if a meteor destroys the whole region, then data not replicated yet to other regions will be lost). But if a meteor destroys a whole region, a few milliseconds of data loss is probably not going to make the news ...Delete
I am asking about transient loss of a region. Assume the region is down (power, network, maintenance oops), you know it will return but you don't know how long that will take. If you aren't willing to wait then unreplicated writes are lost unless you can reconcile them after recovery. I expect reconciling them to either not be possible or to take too long meaning writes will be lost.Delete
In this example replication within region gives durability but also has lost writes. I think this is a reasonable tradeoff for some workloads.
I assume my scenario is more common than a meteor strike but web-scale datacenter operators are not sharing much information about this.
I would classify that example as an availability problem. While the region is down, its data is unavailable. If the sysadmin decides that they can't wait the length of the availability outage and make data available anyway, that will indeed cause consistency problems. But the root cause is availability.Delete
I'm not trying to argue that data shouldn't be replicated to nearby regions. I'm just saying that availability is the only reason to do it if all reads are served from the location of the recent write.
I have difficulties reconciling between these two: "and without giving up availability" and "availability is the only reason to do". It seems to me that availability is indeed given up when the region in temporarily down.Delete
SLOG synchronously replicates data to nearby regions. So no availability is lost when a single region goes down. But since it doesn't run Paxos across all regions, it suffers a little more availability loss than Paxos-based systems in the event of a network partition. Hence the full quote to grab from the post is: "without giving up availability (aside from the negligible availability difference relative to Paxos-based systems from not being as tolerant to network partitions)"Delete
@Mark the way I understand it, each write is replicated within the region, which handles durability (a region failure wouldn't erase >= 3 hard drives), but while that region is unavailable, committed writes won't be accessible from outside the region. Committing in "HA" mode ensures that another region has the log up to and including the write, so even in the immediate event of the home region cutting out, availability (for that write) should not be compromised.ReplyDelete
Just curious. How does this work for Blockchain/DLT nodes?ReplyDelete
Not something we've looked into ...Delete
I like to dissect this problem differently: If you separate out durability as a concern that's independent of consistency and serializability, the distributed consensus algorithms only achieve durability and not the other two. Here's an example:ReplyDelete
Step 1: Leader sends a proposal to participants.
Step 2: Participants accept the tentative proposal.
Step 3: Leader receives the necessary number of acks and decides that proposal is final.
Step 4: Leader tells the participants that the proposal is final.
If a network partition happens between step 3 and step 4, then anyone reading from a participant is going to get stale data. This is no better than best effort asynchronous replication.
In other words consistency is only achieved from a system's ability to replicate to other parts, and the readers being able to know the recency of the data they're reading from. This makes it orthogonal to whether a system is using a consensus protocol or not.
Many systems conflate durability with consistency, which makes all these things confusing.
PS: Looking forward to meeting you in person at HPTS :)
What about the cluster at every node must receive all the local log from other regions, this will increase the intra cluster trafficReplyDelete
The local log only contains input transactions (as opposed to a traditional full recovery log). Therefore SLOG actually *reduces* intra-cluster traffic.Delete
I am reading the slog paper, I hope it's okay to ask a question about it here. One thing I did not understand is this section:"Data replication across regions is not strictly necessary in SLOGReplyDelete
since the master region for a data item must oversee all writes and
linearizable reads to it. However, by continuously replaying local
logs from other regions, the system is able to support local snapshot
reads of data mastered at other regions at any desired point in the
What is the reason that replication is not strictly necessary for SLOG ?
considering figure 4's setup in the paper for example, Is the answer just that that if granule A in region 0 doesn't use the granule B replica in region 0 for its transaction, then it will have to use granule B from region 1(its home) instead? Which then increases latency of that transaction since the communication is suddenly cross region, but other than that it will work.
Sorry for the delay in response, but yes, what you said is correct.Delete