Friday, April 23, 2010

Problems with CAP, and Yahoo’s little known NoSQL system

Over the past few weeks, in my advanced database system implementation class I teach at Yale, I’ve been covering the CAP theorem, its implications, and various scalable NoSQL systems that would appear to be influenced in their design by the constraints of CAP. Over the course of my coverage of this topic, I am convinced that CAP falls far short of giving a complete picture of the engineering tradeoffs behind building scalable, distributed systems.

My problems with CAP

CAP is generally described as the following: when you build a distributed system, of three desirable properties you want in your system: consistency, availability, and tolerance of network partitions, you can only choose two.

Already there is a problem, since this implies that there are three types of distributed systems one can build: CA (consistent and available, but not tolerant of partitions), CP (consistent and tolerant of network partitions, but not available), and AP (available and tolerant of network partitions, but not consistent). The definition of CP looks a little strange --- “consistent and tolerant of network partitions, but not available” --- the way that this is written makes it look like such as system is never available --- a clearly useless system. Of course, this is not really the case; rather, availability is only sacrificed when there is a network partition. In practice, this means that the roles of the A and C in CAP are asymmetric. Systems that sacrifice consistency (AP systems) tend to do so all the time, not just when there is a network partition (the reason for this will become clear by the end of this post). The potential confusion caused by the asymmetry of A and C is my first problem.

My second problem is that, as far as I can tell, there is no practical difference between CA systems and CP systems. As noted above, CP systems give up availability only when there is a network partition. CA systems are “not tolerant of network partitions”. But what if there is a network partition? What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical. So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Having three letters in CAP and saying you can pick any two does nothing but confuse this point.

But my main problem with CAP is that it focuses everyone on a consistency/availability tradeoff, resulting in a perception that the reason why NoSQL systems give up consistency is to get availability. But this is far from the case. A good example of this is Yahoo’s little known NoSQL system called PNUTS (in the academic community) or Sherpa (to everyone else).

(Note, readers from the academic community might wonder why I’m calling PNUTS “little known”. It turns out, however, that outside the academic community, PNUTS/Sherpa is almost never mentioned in the NoSQL discussion --- in fact, as of April 2010, it’s not even categorized in the list of 35+ NoSQL systems at the nosql-database.org Website).

PNUTS and CAP

If you examine PNUTS through the lens of CAP, it would seem that the designers have no idea what they are doing (I assure you this is not the case). Rather than giving up just one of consistency or availability, the system gives up both! It relaxes consistency by only guaranteeing “timeline consistency” where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. However, they also give up availability --- if the master replica for a particular data item is unreachable, that item becomes unavailable for updates (note, there are other configurations of the system with availability guarantees similar to Dynamo/Cassandra, I’m focusing in this post on the default system described in the original PNUTS paper). Why would anyone want to give up both consistency and availability? CAP says you only have to give up just one!

The reason is that CAP is missing a very important letter: L. PNUTS gives up consistency not for the goal of improving availability. Instead, it is to lower latency. Keeping replicas consistent over a wide area network requires at least one message to be sent over the WAN in the critical path to perform the write (some think that 2PC is necessary, but my student Alex Thomson has some research showing that this is not the case --- more on this in a future post). Unfortunately, a message over a WAN significantly increases the latency of a transaction (on the order of hundreds of milliseconds), a cost too large for many Web applications that businesses like Amazon and Yahoo need to implement. Consequently, in order to reduce latency, replication must be performed asynchronously. This reduces consistency (by definition). In Yahoo’s case, their method of reducing consistency (timeline consistency) enables an application developer to rely on some guarantees when reasoning about how this consistency is reduced. But consistency is nonetheless reduced.

Conclusion: Replace CAP with PACELC

In thinking about CAP the past few weeks, I feel that it has become overrated as a tool for explaining the design of modern scalable, distributed systems. Not only is the asymmetry of the contributions of C, A, and P confusing, but the lack of latency considerations in CAP significantly reduces its utility.

To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Systems that tend to give up consistency for availability when there is a partition also tend to give up consistency for latency when there is no partition. This is the source of the asymmetry of the C and A in CAP. However, this confusion is not present in PACELC.

For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC --- upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler --- once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.

Fully ACID systems are PC/EC in PACELC. They refuse to give up consistency, and will pay the availability and latency costs to achieve it.

However, there are some interesting counterexamples where the C’s of PACELC are not correlated. One such example is PNUTS, which is PC/EL in PACELC. In normal operation they give up consistency for latency; however, upon a partition they don’t give up any additional consistency (rather they give up availability).

In conclusion, rewriting CAP as PACELC removes some confusing asymmetry in CAP, and, in my opinion, comes closer to explaining the design of NoSQL systems.


(A quick plug to conclude this post: the PNUTS guys are presenting a new benchmark for cloud data serving which compares PNUTS vs. other NoSQL systems at the first annual ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) in Indianapolis on June 10th and 11th. SOCC 2010 is held in conjunction with SIGMOD 2010 and the recently released program looks amazing.)