tag:blogger.com,1999:blog-8899645800948009496.post3965028486417140367..comments2024-03-28T23:50:29.558-07:00Comments on DBMS Musings: Introducing SLOG: Cheating the low-latency vs. strict serializability tradeoffDaniel Abadihttp://www.blogger.com/profile/16753133043157018521noreply@blogger.comBlogger21125tag:blogger.com,1999:blog-8899645800948009496.post-3846651437172213442022-10-19T14:22:05.577-07:002022-10-19T14:22:05.577-07:00Sorry for the delay in response, but yes, what you...Sorry for the delay in response, but yes, what you said is correct. Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-84445161399888590632022-10-08T13:36:38.177-07:002022-10-08T13:36:38.177-07:00I am reading the slog paper, I hope it's okay ...I am reading the slog paper, I hope it's okay to ask a question about it here. One thing I did not understand is this section:"Data replication across regions is not strictly necessary in SLOG<br />since the master region for a data item must oversee all writes and<br />linearizable reads to it. However, by continuously replaying local<br />logs from other regions, the system is able to support local snapshot<br />reads of data mastered at other regions at any desired point in the<br />versioned history"<br />What is the reason that replication is not strictly necessary for SLOG ? <br />considering figure 4's setup in the paper for example, Is the answer just that that if granule A in region 0 doesn't use the granule B replica in region 0 for its transaction, then it will have to use granule B from region 1(its home) instead? Which then increases latency of that transaction since the communication is suddenly cross region, but other than that it will work.Matti Nielsenhttps://www.blogger.com/profile/05766961983130254515noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-13431120626570876442020-04-18T19:34:48.407-07:002020-04-18T19:34:48.407-07:00The local log only contains input transactions (as...The local log only contains input transactions (as opposed to a traditional full recovery log). Therefore SLOG actually *reduces* intra-cluster traffic.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-82855132049284197582020-04-16T23:58:54.201-07:002020-04-16T23:58:54.201-07:00What about the cluster at every node must receive ...What about the cluster at every node must receive all the local log from other regions, this will increase the intra cluster traffic<br /> lambdacathttps://www.blogger.com/profile/04674223389579817874noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-88752285773937146722020-01-29T09:59:48.865-08:002020-01-29T09:59:48.865-08:00SLOG synchronously replicates data to nearby regio...SLOG synchronously replicates data to nearby regions. So no availability is lost when a single region goes down. But since it doesn't run Paxos across all regions, it suffers a little more availability loss than Paxos-based systems in the event of a network partition. Hence the full quote to grab from the post is: "without giving up availability (aside from the negligible availability difference relative to Paxos-based systems from not being as tolerant to network partitions)"Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-51285298346896953042020-01-29T09:18:14.363-08:002020-01-29T09:18:14.363-08:00I have difficulties reconciling between these two:...I have difficulties reconciling between these two: "and without giving up availability" and "availability is the only reason to do". It seems to me that availability is indeed given up when the region in temporarily down.Unknownhttps://www.blogger.com/profile/07454695210831187962noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-14076553314555214092019-11-01T13:20:12.057-07:002019-11-01T13:20:12.057-07:00I like to dissect this problem differently: If you...I like to dissect this problem differently: If you separate out durability as a concern that's independent of consistency and serializability, the distributed consensus algorithms only achieve durability and not the other two. Here's an example:<br /><br />Step 1: Leader sends a proposal to participants.<br />Step 2: Participants accept the tentative proposal.<br />Step 3: Leader receives the necessary number of acks and decides that proposal is final.<br />Step 4: Leader tells the participants that the proposal is final.<br /><br />If a network partition happens between step 3 and step 4, then anyone reading from a participant is going to get stale data. This is no better than best effort asynchronous replication.<br /><br />In other words consistency is only achieved from a system's ability to replicate to other parts, and the readers being able to know the recency of the data they're reading from. This makes it orthogonal to whether a system is using a consensus protocol or not.<br /><br />Many systems conflate durability with consistency, which makes all these things confusing.<br /><br />PS: Looking forward to meeting you in person at HPTS :)Sugu Sougoumarnehttps://www.blogger.com/profile/17489178762436269116noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-72799530234263687422019-10-13T06:28:03.772-07:002019-10-13T06:28:03.772-07:00Not something we've looked into ...Not something we've looked into ...Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-82102691513841748462019-10-10T08:26:04.157-07:002019-10-10T08:26:04.157-07:00Just curious. How does this work for Blockchain/DL...Just curious. How does this work for Blockchain/DLT nodes?Big Data Guruhttps://www.blogger.com/profile/17286940816808959385noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-36381144022758731942019-10-10T07:47:44.374-07:002019-10-10T07:47:44.374-07:00I would classify that example as an availability p...I would classify that example as an availability problem. While the region is down, its data is unavailable. If the sysadmin decides that they can't wait the length of the availability outage and make data available anyway, that will indeed cause consistency problems. But the root cause is availability. <br /><br />I'm not trying to argue that data shouldn't be replicated to nearby regions. I'm just saying that availability is the only reason to do it if all reads are served from the location of the recent write.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-67693134485385645512019-10-10T07:37:06.126-07:002019-10-10T07:37:06.126-07:00I am asking about transient loss of a region. Assu...I am asking about transient loss of a region. Assume the region is down (power, network, maintenance oops), you know it will return but you don't know how long that will take. If you aren't willing to wait then unreplicated writes are lost unless you can reconcile them after recovery. I expect reconciling them to either not be possible or to take too long meaning writes will be lost.<br /><br />In this example replication within region gives durability but also has lost writes. I think this is a reasonable tradeoff for some workloads.<br /><br />I assume my scenario is more common than a meteor strike but web-scale datacenter operators are not sharing much information about this.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-59519113582627417772019-10-09T18:50:36.400-07:002019-10-09T18:50:36.400-07:00It's funny --- you are the third person who in...It's funny --- you are the third person who initially believed there can be a causal reverse in SLOG until thinking about it more deeply. When three smart people have this initial reaction, that makes me regret not putting that example explicitly in the paper. Sorry about that.<br /><br />Everything you said after rereading the paper is correct: T1 cannot complete if it is after T2.0 in the local log of region 1 and they have an overlapping access set. Therefore T3 does not start until after T2 completes (if it starts after T1 completes). Therefore it cannot be ahead of T2.1 in region 1's log.<br /><br />Your point that this is an important difference from Calvin is 100% correct, and worth reiterating. There are no system aborts in Calvin. Therefore, if there is no possibility of a logical abort, Calvin can commit a transaction before starting to process it (as soon as it reaches Calvin's input log). In contrast, SLOG must wait until all locks are acquired before it can commit (if there is no possibility of a logical abort).Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-26311064145710856652019-10-09T18:22:18.967-07:002019-10-09T18:22:18.967-07:00Tobias Grieger's response is correct. Durabili...Tobias Grieger's response is correct. Durability is handled with replication within a region (though admittedly if a meteor destroys the whole region, then data not replicated yet to other regions will be lost). But if a meteor destroys a whole region, a few milliseconds of data loss is probably not going to make the news ...Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-23755577514212295702019-10-09T17:50:29.539-07:002019-10-09T17:50:29.539-07:00After re-reading various parts of the paper, I bel...After re-reading various parts of the paper, I believe the scenario I gave above is prevented by the T2 LockOnlyTxn applied to Region 0. It will lock the value it's reading, such that T1 cannot be executed until T2.1 from Region 1 has been processed (which would trigger commit of T2 and release of its locks). Meanwhile, the T1 client cannot be ack'd until T1 is *executed* (not enough to just be inserted into a global log). Therefore, T3 cannot start after T1 commits. T2 will "see" T3 but not T1, but that's a legal ordering when T3 and T1 are concurrent.<br /><br />What threw me off track was an assumption that the T1 client could be ack'd once it had been safely committed to the global log, as in Calvin. Also, there's some important locking-related pseudo-code in Figure 6 that I didn't look into carefully before because I thought Figure 6 was just about dynamic remastering.AndyAndRachelhttps://www.blogger.com/profile/17684854848290797842noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-63927547338627549532019-10-09T07:47:51.003-07:002019-10-09T07:47:51.003-07:00@Mark the way I understand it, each write is repli...@Mark the way I understand it, each write is replicated within the region, which handles durability (a region failure wouldn't erase >= 3 hard drives), but while that region is unavailable, committed writes won't be accessible from outside the region. Committing in "HA" mode ensures that another region has the log up to and including the write, so even in the immediate event of the home region cutting out, availability (for that write) should not be compromised.Tobias Griegerhttps://www.blogger.com/profile/04108442025967483211noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-2107337288407174242019-10-08T18:00:25.101-07:002019-10-08T18:00:25.101-07:00First, thank you for engaging with us on your work...First, thank you for engaging with us on your work. Second, thanks for describing performance in terms of physical operations rather than just naming the algorithm. That makes it easier for non experts like me to follow along. Finally, and repeating a question I had on Twitter, the paper states ""The only reason to synchronously replicate to another region is for availability --- in case an entire region fails."<br /><br />Writes not replicated from a failed region are lost if you don't wait for it to return. I don't think this is just about availability. But maybe I misunderstand what you write because elsewhere the paper mentions the benefit of nearby regions.Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-87048641198179958532019-10-07T21:28:30.121-07:002019-10-07T21:28:30.121-07:00In Figure 4, what prevents the following ordering:...In Figure 4, what prevents the following ordering:<br /><br />Region 0 <-- T2.0 T1<br />Region 1 <-- T3 T2.1<br /><br />Here's the sequence:<br />1. Multi-home transaction #2 calls InsertIntoLocalLog on Region 0. Then it pauses.<br />2. Single-home transaction #1 calls InsertIntoLocalLog on Region 0.<br />3. Single-home transaction #3 calls InsertIntoLocalLog on Region 1.<br />4. Multi-home transaction #2 unpauses, and calls InsertIntoLocalLog on Region 1.<br /><br />Assume T2 reads the same locations that T1 and T3 write, and that T3 starts after T1 commits. But then the above ordering would mean that T2 reads the value written by T3, but not the value written by T1. But that would violate strict serializability, correct? I couldn't find anything in Figure 2 or the prose that would appear to prevent this.AndyAndRachelhttps://www.blogger.com/profile/17684854848290797842noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-59767194293559248732019-10-07T18:10:03.296-07:002019-10-07T18:10:03.296-07:00Thanks for reading the paper (and for spending eno...Thanks for reading the paper (and for spending enough time reading it such that you understood it correctly). <br /><br />Yes, all reads have to go through the front door. If you start trying to read state without using SLOG's interface, you lose out on SLOG's guarantees. (This is true of many systems).<br /><br />This implies that that BACKUP has to go through the front door. If you implement it as a regular read-only query in SLOG, this will result in a giant multi-home (read-only) transaction. The paper states that SLOG acquires read locks for all reads, and read locks will block writes, so transactions that write data that are appended to the local log at each region behind the BACKUP transaction would see a latency blip until the read locks are released. I believe this is premise of your question.<br /><br />However, you are indeed overlooking a better solution. Though it is really my fault for not stating this explicitly in the paper. If you use multi-versioned storage in SLOG, there is *no need* for read-only transactions to acquire read-locks. They just get inserted into the local log at each region that houses data relevant to that read-only transaction using the normal process described in the paper. If multiple regions house relevant data, the transaction also has to go through the multi-home transaction ordering process described in the paper. These two steps give the reads within the read-only transaction a lineariazble ordering relative to all writes at each region, and the read-only transaction can occur in the background, reading the correct version at each region based on its order in the local log at that region. <br />Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-88586668382475615692019-10-07T16:40:07.201-07:002019-10-07T16:40:07.201-07:00Interesting paper. I have a question regarding str...Interesting paper. I have a question regarding strict serializability. The global logs you show in figure 3 (or figure 4) exhibit causal reverses (and aren't exact replicas of one another). It would be a violation of linearizability *if* a regional global log could be read at an arbitrary point. But they can't, so the system is considered linearizable. In other words, because the anomalies can't be observed through the regular transactional interface to the system, they don't "count". Is my understanding correct?<br /><br />This leads to the next question: how would one build a "hot" BACKUP facility in such a database (i.e. where BACKUP happens while system is running transactions)? Presumably, you'd want a consistent snapshot of the global state. But it would seem that no matter which point in time you pick, there's the possibility that your RESTORED database would allow observation of a causal reverse. For example, say two writers are racing, one continually writing values to region A and one continually writing values to region B. Without locking (of one sort or another), the BACKUP process cannot be guaranteed to make a consistent snapshot. But locking the entire database to perform BACKUP would seem impractical in a production system. Is there some reasonable solution that guarantees consistency that I'm overlooking?AndyAndRachelhttps://www.blogger.com/profile/17684854848290797842noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-83533379233826028822019-10-07T14:50:35.441-07:002019-10-07T14:50:35.441-07:00Yes, except that PNUTS didn't support transact...Yes, except that PNUTS didn't support transactions. SLOG not only supports transactions, but even "multi-home" transactions at high throughput and low latency. As mentioned above --- this is the main break-through of SLOG: Take advantage of locality while still supporting strictly serializable transactions.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-55114400572592031622019-10-07T14:41:32.003-07:002019-10-07T14:41:32.003-07:00Very much like Yahoo's user database implement...Very much like Yahoo's user database implemented in the 90s. I believe the PNUTs paper also talks about record-level mastering that gives a similar result. https://pdfs.semanticscholar.org/876b/e80390bcaffb9b910ed05680b2e81a37d64d.pdfSamhttps://www.blogger.com/profile/14121705654622595662noreply@blogger.com