Monday, October 29, 2012

IEEE Computer issue on the CAP Theorem

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed. The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago). 

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The  reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6122007&tag=1. The Birman et. al. article can be found for free at: http://www.cs.cornell.edu/Projects/mrc/CAP.pdf.

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles. The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is  possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?