Friday, January 25, 2019

It’s Time to Move on from Two Phase Commit

The two-phase commit protocol (2PC) has been used in enterprise software systems for over three decades. It has been an an incredibly impactful protocol for ensuring atomicity and durability of transactions that access data in multiple partitions or shards. It is used everywhere --- both in older “venerable” distributed systems, database systems, and file systems such as Oracle, IBM DB2, PostgreSQL, and Microsoft TxF (transactional NTFS), and in younger “millennial” systems such as MariaDB, TokuDB, VoltDB, Cloud Spanner, Apache Flink, Apache Kafka, and Azure SQL Database. If your system supports ACID transactions across shards/partitions/databases, there’s a high probability that it is running 2PC (or some variant thereof) under the covers. [Sometimes it’s even “over the covers” --- older versions of MongoDB required users to implement 2PC for multi-document transactions in application code.]

In this post, we will first describe 2PC: how it works and what problems it solves. Then, we will show some major issues with 2PC and how modern systems attempt to get around these issues. Unfortunately, these attempted solutions cause other problems to emerge. In the end, I will make the case that the next generation of distributed systems should avoid 2PC, and how this is possible.

Overview of the 2PC protocol


There are many variants of 2PC, but the basic protocol works as follows:

Background assumption:The work entailed by a transaction has already been divided across all of the shards/partitions that store data accessed by that transaction. We will refer to the effort performed at each shard as being performed by the “worker” for that shard. Each worker is able to start working on its responsibilities for a given transaction independently of each other. The 2PC protocol begins at the end of transaction processing, when the transaction is ready to “commit”. It is initiated by a single, coordinator machine (which may be one of the workers involved in that transaction).

The basic flow of the 2PC protocol is shown in the figure below. [The protocol begins at the top of the figure and then proceeds in a downward direction.]






Phase 1: A coordinator asks each worker whether they have successfully completed their responsibilities for that transaction and are ready to commit. Each worker responds ‘yes’ or ‘no’.

Phase 2: The coordinator counts all the responses. If every worker responded ‘yes’, then the transaction will commit. Otherwise, it will abort. The coordinator sends a message to each worker with the final commit decision and receives an acknowledgement back.

This mechanism ensures the atomicity property of transactions: either the entire transaction will be reflected in the final state of the system, or none of it. If even just a single worker cannot commit, then the entire transaction will be aborted. In other words: each worker has “veto-power” for a transaction.

It also ensures transaction durability. Each worker ensures that all of the writes of a transaction have been durably written to storage prior to responding ‘yes’ in phase 1. This gives the coordinator freedom to make a final decision about a transaction without concern for the fact that a worker may fail after voting ‘yes’. [In this post, we are being purposefully vague when using the term “durable writes” --- this term can either refer to writing to local non-volatile storage or, alternatively, replicating the writes to enough locations for it to be considered “durable”.]

In addition to durably writing the writes that are directly required by the transaction, the protocol itself requires additional writes that must be made durable before it can proceed. For example, a worker has veto power until the point it votes ‘yes’ in phase 1. After that point, it cannot change its vote. But what if it crashes right after voting ‘yes’? When it recovers it might not know that it voted ‘yes’, and still think it has veto power and go ahead and abort the transaction. To prevent this, it must write its vote durably before sending the ‘yes’ vote back to the coordinator. [In addition to this example, in standard 2PC, there are two other writes that are made durable prior to sending messages that are part of the protocol.]


The problems with 2PC


There are two major problems with 2PC. The first is well known, and discussed in every reputable textbook that presents 2PC. The second is much less well known, but a major problem nonetheless.

The well-known problem is referred to as the “blocking problem”. This happens when every worker has voted ‘yes’, but the coordinator fails before sending a message with the final decision to at least one worker. The reason why this is a problem is that by voting ‘yes’, each worker has removed its power to veto the transaction. However, the coordinator still has absolute power to decide the final state of a transaction. If the coordinator fails before sending a message with the final decision to at least one worker, the workers cannot get together to make a decision amongst themselves --- they can’t abort because maybe the coordinator decided to commit before it failed, and they can’t commit because maybe the coordinator decided to abort before it failed. Thus, they have to block --- wait until the coordinator recovers --- in order to find out the final decision. In the meantime, they cannot process transactions that conflict with the stalled transaction since the final outcome of the writes of that transaction are yet to be determined.

There are two categories of work-arounds to the blocking problem. The first category of work-around modifies the core protocol in order to eliminate the blocking problem. Unfortunately, these modifications reduce the performance --- typically by adding an extra round of communication --- and thus are rarely used in practice. The second category keeps the protocol in tact but reduces the probability of the types of coordinator failure than can lead to the blocking program --- for example, by running 2PC over replica consensus protocols and ensuring that important state for the protocol is replicated at all times. Unfortunately, once again, these work-arounds reduce performance, since the protocol requires that these replica consensus rounds occur sequentially, and thus they may add significant latency to the protocol.

The lesser-known problem is what I call the “cloggage problem”. 2PC occurs after transaction is processed, and thus necessarily increases the latency of the transaction by an amount equal to the time it takes to run the protocol. This latency increase alone can already be an issue for many applications, but a potentially larger issue is that worker nodes do not know the final outcome of a transaction until mid-way through the second phase. Until they know the final outcome, they have to be prepared for the possibility that it might abort, and thus they typically prevent conflicting transactions from making progress until they are certain that the transaction will commit. These blocked transactions in turn block other transactions from running, and so on, until 2PC completes and all of the blocked transactions can resume.  This cloggage further increases the average transaction latency and also decreases transactional throughput.

To summarize the problems we discussed above: 2PC poisons a system along four dimensions: latency (the time of the protocol plus the stall time of conflicting transactions), throughput (because it prevents conflicting transactions from running during the protocol), scalability (the larger the system, the more likely transactions become multi-partition and have to pay the throughput and latency costs of 2PC), and availability (the blocking problem we discussed above).  Nobody likes 2PC, but for decades, people have assumed that it is a necessary evil.

It’s time to move on


For over three decades, we’ve been stuck with two-phase commit in sharded systems. People are aware of the performance, scalability, and availability problems it introduces, but nonetheless continue on, with no obvious better alternative.

The truth is, if we would just architect our systems differently, the need for 2PC would vanish. There have been some attempts to accomplish this --- both in academia (such as this SIGMOD 2016 paper) and industry. However, these attempts typically work by avoiding multi-sharded transactions in the first place, such as by repartitioning data in advance of a transaction so that it is no longer multi-sharded. Unfortunately, this repartitioning reduces performance of the system in other ways.

What I am calling for is a deeper type of change in the way we architect distributed systems. I insist that systems should still be able to process multi-sharded transactions --- with all the ACID guarantees and what that entails --- such as atomicity and durability --- but with much simpler and faster commit protocols.

It all comes down to a fundamental assumption that has been present in our systems for decades: a transaction may abort at any time and for any reason. Even if I run the same transaction on the same initial system state … if I run it at 2:00PM it may commit, but at 3:00 it may abort.

There are several reasons why most architects believe we need this assumption. First, a machine may fail at anytime --- including in the middle of a transaction. Upon recovery, it is generally impossible to recreate all of the state of that transaction that was in volatile memory prior to the failure. As a result, it is seemingly impossible to pick up where the transaction left off prior to the failure. Therefore, the system aborts all transactions that were in progress at the time of the failure. Since a failure can occur at any time, this means that a transaction may abort at any time.

Second, most concurrency control protocols require the ability to abort a transaction at any time. Optimistic protocols perform a “validation” phase after processing a transaction. If validation fails, the transaction aborts.  Pessimistic protocols typically use locks to prevent concurrency anomalies. This use of locks may lead to deadlock, which is resolved by aborting (at least) one of the deadlocked transactions. Since deadlock may be discovered at any time, the transaction needs to retain the ability to abort at any time.

If you look carefully at the two-phase commit protocol, you will see that this arbitrary potential to abort a transaction is the primary source of complexity and latency in the protocol. Workers cannot easily tell each other whether they will commit or not, because they might still fail after this point (before the transaction is committed) and want to abort this transaction during recovery. Therefore, they have to wait until the end of transaction processing (when all important state is made durable) and proceed in the necessary two phases: in the first phase, each worker publically relinquishes its control to abort a transaction, and only then can the second phase occur in which a final decision is made and disseminated.

In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction whenever it wants during its execution. Only logic within a transaction should be allowed to cause a transaction to abort. If it is theoretically possible to commit a transaction given an current state of the database, that transaction must commit, no matter what types of failures occur. Furthermore, there must not be race conditions relative to other concurrently running transactions that can affect the final commit/abort state of a transaction.

Removing abort flexibility sounds hard. We’ll discuss soon how to accomplish this. But first let’s observe how the commit protocol changes if transactions don’t have abort flexibility.

What a commit protocol looks like when transactions can’t abort arbitrarily


Let’s look at two examples:

In the first example, assume that the worker for the shard that stores the value for variable X is assigned a single task for a transaction: change the value of X to 42. Assume (for now) that there are no integrity constraints or triggers defined on X (which may prevent the system from setting X to 42). In such a case, that worker is never given the power to be able to abort the transaction. No matter what happens, that worker must change X to 42. If that worker fails, it must change X to 42 after it recovers. Since it never has any power to abort, there is no need to check with that worker during the commit protocol to see if it will commit.

In the second example, assume that the worker for the shard that stores the value for variables Y and Z is assigned two tasks for a transaction: subtract 1 from the previous value of Y and set Z to the new value of Y. Furthermore, assume that there is an integrity constraint on Y that states that Y can never go below 0 (e.g., if it represents the inventory of an item in a retail application). Therefore, this worker has to run the equivalent of the following code:

          IF (Y > 0)
             Subtract 1 from Y
          ELSE
              ABORT the transaction
          Z = Y

This worker must be given the power to abort the transaction since this required by the logic of the application. However, this power is limited. Only if the initial value of Y was 0 can this worker abort the transaction. Otherwise, it has no choice but to commit. Therefore, it doesn’t have to wait until it has completed the transaction code before knowing whether it will commit or not. On the contrary: as soon as it has finished executing the first line of code in the transaction, it already knows its final commit/abort decision. This implies that the commit protocol will be able to start much earlier relative to 2PC.

Let’s now combine these two examples into a single example in which a transaction is being performed by two workers --- one of them is doing the work described in the first example, and the other one doing the work described in the second example. Since we are guaranteeing atomicity, the first worker cannot simply blindly set X to 42. Rather, it’s own work must also be dependent on the value of Y. In effect, it’s transaction code becomes:

     temp = Do_Remote_Read(Y)
     if (temp > 0)
        X = 42

Note that if the first worker’s code is written in this way, the code for the other worker can be simplified to just:

     IF (Y > 0)
        Subtract 1 from Y
        Z = Y

By writing the transaction code in this way, we have removed explicit abort logic from both workers. Instead, both workers have if statements that check for the constraint that would have caused the original transaction to abort. If the original transaction would have aborted, both workers end up doing nothing. Otherwise, both workers change the values of their local state as required by the transaction logic.

The important thing to note at this point is that the need for a commit protocol has been totally eliminated in the above code. The system is not allowed to abort a transaction for any reason other than conditional logic defined by application code on a given state of the data. And all workers condition their writes on this same conditional logic so that they can all independently decide to “do nothing” in those situations where a transaction cannot complete as a result of current system state. Thus, all possibility of a transaction abort has been removed, and there is no need for any kind of distributed protocol at the end of transaction processing to make a combined final decision about the transaction. All of the problems of 2PC have been eliminated. There is no blocking problem because there is no coordinator. And there is no cloggage problem, because all necessary checks are overlapped with transaction processing instead of after it completes.

Moreover, as long as the system is not allowed to abort a transaction for any reason other than the conditional application logic based on input data state, it is always possible to rewrite any transaction as we did above in order to replace abort logic in the code with if statements that conditionally check the abort conditions. Furthermore, it is possible to accomplish this without actually rewriting application code. [The details of how to do this are out of scope for this post, but to summarize at a high level: shards can set special system-owned boolean flags when they have completed any conditional logic that could cause an abort, and it is these boolean flags that are remotely read from other shards.]

In essence: there are two types of aborts that are possible in transaction processing systems: (1) Those that are caused by the state of the data and (2) Those that are caused by the system itself (e.g. failures or deadlocks). Category (1) can always be written in terms of conditional logic on the data as we did above. So if you can eliminate category (2) aborts, the commit protocol can be eliminated.

So now, all we have to do is explain how to eliminate category (2) aborts.

Removing system-induced aborts


I have spent almost an entire decade designing systems that do not allow system-induced aborts. Examples of such systems are Calvin, CalvinFS, Orthrus, PVW, and a system that processes transactions lazily. The impetus for this feature came from the first of these projects --- Calvin --- because of its status of being a deterministic database system. A deterministic database guarantees that there is only one possible final state of the data in the database given a defined set of input requests. It is therefore possible to send the same input to two distinct replicas of the system and be certain that the replicas will process this input independently and end up in the same final state, without any possibility of divergence.

System-induced aborts such as system failure or concurrency control race conditions are fundamentally nondeterministic events. It is very possible that one replica will fail or enter a race condition while the other replica will not. If these nondeterministic events were allowed to result in an a transaction to abort, then one replica may abort a transaction while the other one would commit --- a fundamental violation of the deterministic guarantee. Therefore, we had to design Calvin in a way that failures and race conditions cannot result in a transaction to abort. For concurrency control, Calvin used pessimistic locking with a deadlock avoidance technique that ensured that the system would never get into a situation where it had to abort a transaction due to deadlock. In the face of a system failure, Calvin did not pick up a transaction exactly where it left off (because of the loss of volatile memory during the failure). Nonetheless, it was able to bring the processing of that transaction to completion without having to abort it. It accomplished this via restarting the transaction from the same original input.

Neither of these solutions --- neither deadlock avoidance nor transaction restart upon a failure --- are limited to being used in deterministic database systems. [Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.] Indeed, some of the other systems I linked to above are nondeterministic systems. Once we realized the power that comes with removing system-level aborts, we built this feature into every system we built after the Calvin project --- even the nondeterministic systems.

Conclusion


I see very little benefit in system architects making continued use of 2PC in sharded systems moving forward. I believe that removing system-induced aborts and rewriting state-induced aborts is the better way forward. Deterministic database systems such as Calvin or FaunaDB  always remove system-induced aborts anyway, and thus usually can avoid 2PC as we described above. But it is a huge waste to limit this benefit to only deterministic databases. It is not hard to remove system-induced aborts from nondeterministic systems. Recent projects have shown that it is even possible to remove system-induced aborts in systems that use concurrency control techniques other than pessimistic concurrency control. For example, both the PVW and the lazy transaction processing systems we linked to above use a variant of multi-versioned concurrency control. And FaunaDB uses a variant of optimistic concurrency control.

In my opinion there is very little excuse to continue with antiquated assumptions regarding the need for system-induced aborts in the system. In the old days when systems ran on single machines, such assumptions were justifiable. However, in modern times, where many systems need to scale to multiple machines that can fail independently of each other, these assumptions require expensive coordination and commit protocols such as 2PC. The performance problems of 2PC has been a major force behind the rise of non-ACID compliant systems that give up important guarantees in order to achieve better scalability, availability, and performance. 2PC is just too slow --- it increases the latency of all transactions --- not just by the length of the protocol itself, but also by preventing transactions that access the same set of data from running concurrently. 2PC also limits scalability (by reducing concurrency) and availability (the blocking problem we discussed above). The way forward is clear: we need to reconsider antiquated assumptions when designing our systems and say “good-bye” to two phase commit!


[This article includes a description of work that was funded by the NSF under grant IIS-1763797. Any opinions, findings, and conclusions or recommendations expressed in this article are those of Daniel Abadi and do not necessarily reflect the views of the National Science Foundation.]

34 comments:

  1. Please build something to address this

    ReplyDelete
    Replies
    1. FaunaDB is that something: https://fauna.com/

      Delete
  2. 2PC is a theory formally, but your contract "cannot exit arbitarily" is just a Pragmatism which cannot be analysised and modeled formally.

    ReplyDelete
  3. can't exactly see how removing the abortability from individual works with the proposed example preserves ACID

    in the example above, the workers just retrying doesn't cut it as the whole thing then gets down to guaranteeing ordering of transactions (otherwise 'set 42' may come at the wrong time and mess up concurrent transactions the worker has executed already)

    smth is missing here

    ReplyDelete
    Replies
    1. I'm not sure I understand your question, but maybe your question is related to what PVSS asked below?

      Delete
  4. What if the value of Y changed (let's say from 3 to 0 by another process within the second worker) between Do_Remote_Read(Y) and the execution of code in the second worker?

    ReplyDelete
    Replies
    1. A similar question was also asked on the hackernews thread --- https://news.ycombinator.com/item?id=19000842

      Let me know if that thread answers your question. If not, let's continue the discussion in this thread.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Hi,

      I went over ycombinator thread & still did not understand how the response solved the issue

      I am sure I am missing something, appreciate if you can explain

      Delete
    4. Also, another question. What if Y value was reduced but X change failed in the first worker? Does not that leave the system in inconsistent state for anyone else who depend on a combination of Y & X value combination?

      Even here, I am certain that I am missing something. Appreciate your help

      Delete
    5. In the example in my post, there is no state of the database that could exist that would force the update of X to fail. So only system-issues can potentially cause X to fail. The point I'm making in my post is that all such system issues are nondeterministic. Therefore, retrying the update of X until it succeeds will work, since eventually it will succeed. Therefore, I'm suggesting that we should be retrying transactions instead of giving up on them. This simple change enables getting rid of 2PC.

      Delete
  5. To my readers: a link to this post was put up on hackernews, and there are 136 comments there (as of the time of me writing this). I responded to a bunch of questions there. However, I prefer if the discussion of this post proceeds here because I do not get notified of new comments on the hackernews thread.

    ReplyDelete
    Replies
    1. BTW: the link to the hackernews comment thread is: https://news.ycombinator.com/item?id=18999520

      Delete
  6. Love your work!

    I think this is very similar to RAMP transaction. Both provide many nice attributes like non-blocking, concurrent transactions, etc. at the cost of having pretty significant write amplification for storing transaction metadata and multiple versions of each key.

    I think this proposal should be better compared to RAMP instead of 2pc. And in RAMP paper, it states that this doesn't solve all the problems. E.g. how would you do the traditional read-modify-write with this algorithm? Maybe I am missing something. Thanks for the writeup! Always a good read :)

    ReplyDelete
    Replies
    1. To answer my own question, I guess read-modify-write can be modeled as a condition on the write. So it should work. I guess this means the only advantage of 2pc is smaller write amplification.

      Delete
  7. Hi Daniel,

    Thanks for this interesting post. I have a few questions:

    1) Does this model only work for one-shot transactions? How can a worker know ahead of time which other workers' "boolean flags" it will need to check if the transaction is conversational?
    2) Since no worker can abort voluntarily, a network partition between two workers can have severe impact - a participant will have to keep its resources locked for as long as the partition lasts, right?

    ReplyDelete
    Replies
    1. To answer your questions:

      (1) The assumption is that there is some process that is dividing the work across the workers. This process, when it sends the work to the workers, includes annotations on which other workers may have state-based aborts. If it doesn't know for sure about a particular worker, it must be conservative. [Obviously, this is just an optimization. Without this optimization, all workers would have to be checked.]

      (2) Yes. But two things to mention:

      (i) In most Calvin deployments, there was a full replica of the DB within each region, so it never had to worry about having different shards for the same xact located in different regions. [Network partitions are more likely across regions.] But in general, if you have different shards in different regions, a network partition would cause unavailability.

      (ii) Most systems that run 2PC are CP systems from CAP, in which case network partitions cause unavailability anyway.

      Delete
  8. Hi Daniel,

    To intro myself, I'm the co-creator of the vitess.io project that helps scale MySQL databases through sharding. I've been trying to improve our story on distributed transactions. My blog is https://ssougou.blogspot.com (I should blog more often).

    I discovered your work recently, and have been a big fan. I've been slowly catching up :).

    The way I would interpret your post is as follows: All transactions are always in the prepared state. The prepared state gives you the following guarantees:
    1. A node will not commit or rollback unless specifically requested.
    2. Will not lose data. If a node crashes, it will resurrect itself back and all the transactions will be restored to their prepared state.

    If all transactions are always in this state, then the implicit decision is to commit (because all transactions are prepared already).

    I'm trying to see how this will map to a transactional system where DMLs are separate from the final commit request. I'm afraid that this is not applicable to such systems.

    The easy part, of course, is the commit.

    The part I'm struggling with is the flow where the app may want to rollback. Let's say the app issues DMLs to node A, then encounters an error half way through changes in node B. It would like to rollback everything so far. I don't see how this can be guaranteed. And if the app crashes at this time, there is no one to even know the decision to rollback.

    By the prepare rule, the node cannot make the decision. Also, no one else can because the app has crashed and lost the context.

    If this requires a long response, feel free to email me at sougou@planetscale.com.

    ReplyDelete
    Replies
    1. Transactions are presumed to definitely commit (eventually) under what's described here unless the state of the data prevents the transaction from committing. Application aborts that are based on data state work as described in this post. Application aborts that are not based on data state require a separate agreement protocol in order to override the presumed commit nature of what is described here.

      Delete
  9. This is a great read, thank you Daniel.

    Just wanted to point out that Kafka's transactional messaging (https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging) did not employ the 2PC protocol, although admittedly its implementation function names may be confusing. The key idea is actually similar to your proposal here:

    1) We have a single "coordinator" module that maintains a transaction log which is the source-of-truth of the txn state, stored as an internal Kafka topic (think of it as the sequencer that orders all the txns to workers).

    2) Once the client / this coordinator module determines to commit or abort a txn, a prepare_abort / commit entry is appended to the txn log -- again, the `prepare` prefix here may be misleading, but it is actually the "final decision" of this txn, such that even if the coordinator fails after that, upon failing over the new coordinator will still enforce the txn to be completed as commit / abort. The entry itself is only to remind the (newly elected) coordinator that this txn decision may not yet propagated to all the brokers.

    3) There are some additional requirement for Kafka such as per-partition ordering, such that consumers should read messages in offset ordering. And hence if the decision of a previous txn that appends message at, e.g. offset 10 has not been propagated from coordinator, then even if offset 11's messages are from another txn that is determined to be committed, the consumer still cannot fetch offset 11 before offset 10. But this is less related to your point in this blog post.

    ReplyDelete
    Replies
    1. Thanks for the clarification on Kafka's protocol.

      (Note that one difference between Kafka's protocol and the one described here is the lack of a need for a coordinator in the one described here.)

      Delete
  10. Hi Daniel,
    Great post and discussion.
    I'm breaking my head on the solving of the cloggage problem though.

    In you example you say that "there is no cloggage problem, because all necessary checks are overlapped with transaction processing instead of after it completes."

    But as far as I understand, this is the case because of the introduction of `Do_Remote_Read(Y)` where we read the specific version of Y as of the current "transaction" using some ID. So `Do_Remote_Read(Y, transactionId)`.
    So in the case of 2 parallel transactions, both accessing the same object/shard which contains Y.

    A, owner of X:
    temp = Do_Remote_Read(Y)
    if (temp > 0)
    X = 42

    B, owner if Y:
    temp = Do_Remote_Read(X)
    IF (Y > 0 && temp > 0)
    Subtract 1 from Y

    C, owner of W:
    temp = Do_Remote_Read(Y)
    IF (temp > 0)
    Subtract 1 from W

    In this case A and B will block on `Do_Remote_Read(Y/X,1)` until they have a response.
    If in the mean time C does `Do_Remote_Read(Y, 2)`, B can not give a the read response yet to C, until it itself has received its response from A. It has to wait on A to do the local `IF` check, and update the value of Y. Only then it can answer C.
    So to me it seems that there is still a cloggage problem on B, since responses have to be delayed.

    Am I making a mistake somewhere?

    Best regards,
    Tim

    ReplyDelete
    Replies
    1. Hi Tim,

      The following papers have some more discussion on the cloggage problem of 2PC:
      http://vldb.org/pvldb/vldb2010/pvldb_vol3/R06.pdf

      and

      http://www.vldb.org/pvldb/vol7/p821-ren.pdf.

      Please note that primary focus of both of those papers is different than the focus of this blog post (those papers focus on deterministic database systems, while this post proposes a more general framework for removing 2PC). However, the 2PC discussion in both those papers are relevant. The bottom line is that most cloggage comes the failure to overlap the commit protocol with xact execution. By overlapping the commit protocol with execution, as described in this blog post, the time in which conflicting transactions are not allowed to run is significantly reduced. But there is still a potential for transactions to stall under conflicting workloads. The above cited papers show the big difference in available concurrency and transaction throughput when you compare traditional 2PC systems to systems that do not allow arbitrary transaction aborts.

      Delete
  11. Hi Daniel,
    Can this way of transaction modelling be done entirely on the application level?

    I'm trying to understand it with a scenario.
    For example if I have to insert data into two tables on separate machines:
    Table1 is User on machine1 (with email as primary key)
    Table2 is UserMetadata on machine2.
    My txn is "create an entry for a user in Table1 and then create some metadata in Table2". It needs ACID guarantees, especially all-or-nothing.

    With your new idea, we would frame the txn like this:
    First ask machine holding Table2 to write following instruction to stable storage:
    "poll table1 till it has en entry for email=abc@xyz.com (issue select where statement) and upon getting a result, insert a metadata entry in my table2."

    Once machine2 has written this to stable storage, we can proceed with entering an entry into Table1 in the usual no-distributed simple ACID txn way.

    So if the first instruction write to machine2 fails, nothing wrong has happened yet.
    When it succeeds, that's okie since it'll be triggered sometime in future ONLY when table1 actually has an entry with the provided email.


    Am I understanding it right?

    So it's like defining rules and their triggering actions?

    Thanks.

    ReplyDelete
    Replies
    1. I'm not sure I understand your example since the write to table 1 seems to happen twice in your example --- both before and after the write to table 2.

      But to answer your question at a high level: most things that can be in inside the system can alternatively be done in the application. But it's usually much easier (and more likely to be correct) if the system takes care of ACID guarantees instead of the application.

      Delete
  12. Isn't what you've described same as having a strong leader, that controls transaction state and replicates state to workers using distributed consensus algorithm like raft? This also removes the need for the 2nd phase and removes ability to cancel transaction on the worker side.

    ReplyDelete
    Replies
    1. Consensus algorithms and commit protocols are applied in different scenarios. You see many systems (e.g. Spanner) that use both (they run 2PC over Paxos). 2PC is designed for atomically committing a transaction in which many different nodes were involved in changing their local state from running transaction code. Having only one node controlling all state changes defeats the purpose of distributed transactional processing that 2PC was designed for.

      Delete
  13. Hi,

    It is not clear to me how strict serializability is guaranteed for "dependent transactions". I appreciate it if you help me understand. Suppose we have this scenario:

    We have two transactions T1 and T2.
    T1 is a dependent transactions, as follows:
    T1{
    read (x)
    if (x == 1)
    update y
    else
    update z
    }

    T2 reads the value of y:
    T2 {
    read (y)
    }

    Now, suppose the client first submits T1 and then T2 in the next round. To guarantee strict serializability, T2 MUST see the effect of T1, as from the client’s point of view once the client successfully appends its transaction it is committed. Any future transaction that is appended to the log must see the effect of T1. However, that is not guaranteed, because T1 may be restarted and re-tried in the next round causing T2 not to see the effect of T1.

    For example:
    The value of x in the initial query is 2, thus, T1 decides to update z. But when executing T1 it sees its value has changed to 1, thus it will be retried in the next round with T2. T2 reads the old value of y, and then T1 updates it.

    I was thinking T1 and T2 should be considered concurrent as they are in the same round. However, that is not true, as from the client’s perspective T1 was submitted in the previous round; T1 has already returned to the client as committed. Thus, to guarantee strict serializability T2 must see the effect of T1.

    ReplyDelete
    Replies
    1. Strict serializability only guarantees ordering if the second transaction is submitted after the first one *commits*. A commit is defined as a DB commit. Therefore, I don't understand your statement: "as from the client’s point of view once the client successfully appends its transaction it is committed".

      If the DB hasn't told any client that the txn committed, then it is not committed as far as strict serializability ordering is concerned.

      Delete
    2. Thanks for your response.

      You are right, strict serializability, more specifically the linearizability part of it, concerns about non-concurrent transactions, i.e., T1 and T2 when T2 starts after T1 is committed. In other systems, like Spanner, it is clear to me when a transaction should be considered committed by the client application-- when the transaction request returns to the client. In Calvin, on the other hand, it is not clear to me when a client can consider its transaction committed. I thought once a client successfully appends its transaction to the replicated log, it can consider it committed, as in Calvin there is no abort. All future transactions will be after earlier transactions in the log, so they will see their effect, except dependent transactions that may change their position in the log.

      So, I appreciate if you answer this question then: when a client can consider its transactions committed? Maybe I am missing some part, but does Calvin explicitly inform the client when its transaction committed? If yes, does it wait for all storage nodes involved in the transactions to acknowledge it?

      Delete
    3. A few things:

      (1) In our Calvin implementation, Calvin does not inform client when the transaction has been successfully has been appended to its input log. The client interface is like any other DB --- the client submits transactions and eventually gets back notification that a transaction has been committed. There are no intermediate notifications between txn submission and txn response.

      (2) In some, but not all cases, Calvin responds back to the client with a 'commit' message as soon as it is appended to the input log.

      (3) Dependent transactions are an example of a type of transaction that *cannot* be committed at this early stage. Rather, as soon as the first replica checks the reconnaissance values from OLLP and determines they are correct, only then can the client be notified of the commit. This is still earlier than most systems --- this check happens at the beginning of transaction processing, whereas most systems have to wait until the commit protocol to commit. But still, this is not as early as non-dependent transactions which can usually commit as soon as they reach the log.

      Delete
    4. I see. Just to rephrasing what you said (Please correct me if I am wrong)


      For non-dependent transactions, once we appended the transactions to the log, we can return to the client and client can consider it committed.

      For dependent transactions, we won't return to the client util we make sure reconnaissance values are valid and before actually executing the transaction.

      Delete
    5. There are some other classes of transactions besides dependent transactions that also cannot commit immediately when they hit the log (that are beyond the scope of this discussion), but aside from that, your summary is accurate.

      For dependent transactions, the commit is returned after reconnaissance values are confirmed accurate (in the middle of transaction execution).

      Delete