Sunday, October 8, 2017

Hazelcast and the Mythical PA/EC System

(Editor’s note: I was unaware that Kyle Kingsbury was doing a linearizability analysis of Hazelcast when I was writing this post. Kyle’s analysis resulted in Greg Luck, Hazelcast’s CEO, to write a blog post where he cited the PACELC theorem, and came to some of the same conclusions that I came to in writing this post. This post, however, was 98% written before both Kyle’s and Greg’s posts, but their posts got me to accelerate the completion of my analysis and publish it now.)

Seven years ago, I introduced the PACELC theorem as a mechanism to more clearly explain the consistency tradeoffs in building distributed systems. At that time, many people were familiar with the consistency vs. availability trade-off that was made well-known by the CAP theorem. However, it was common amongst people unfamiliar with the details of CAP theorem to believe that this tradeoff is always present in a distributed system. However, the truth is that the CAP consistency-availability tradeoff actually describes a very rare corner case. Only when there is an actual network partition --- an increasingly unusual event in modern day infrastructures ---  does the consistency-availability tradeoff present itself. At all other times, it is possible to be both available and consistent. Nonetheless, many systems choose not to be fully consistent at all times. The reason for this has nothing to do with the CAP tradeoff. Instead, there is a separate latency vs. consistency tradeoff. Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost. By relaxing consistency and serving reads and writes directly from a closest replica (without synchronization with other replicas), latency can be improved --- sometimes by an order of magnitude.
Therefore, I felt that it was important to clearly tease apart these separate consistency tradeoffs. This led to the PACELC theorem: if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?
In general, the PACELC theorem leads to four categories of systems: PC/EC, PA/EL, PC/EL, and PA/EC. However, in practice, an application will either go to the effort of building on top of a reduced consistency system or it will not. If it goes to this effort, it stands to benefit in two areas: availability upon a partition, and latency in everyday operation. It is unusual for a system to go to this effort and choose only to attain benefit in one area. Hence, two of these four categories are more common than the other two: PC/EC systems designed for applications that can never sacrifice consistency, and PA/EL systems that are designed for applications that are capable of being built over a reduced consistency system. Despite being less common, PACELC nonetheless theorizes about the existence of PC/EL and PA/EC systems. At the time when I originally introduced PACELC, I gave the example of PNUTS as a PC/EL system. However, I could not think of any good examples of PA/EC systems. Even in my extended article on PACELC in the CAP-anniversary edition of IEEE Computer, I only gave a somewhat hand-wavey example of a PA/EC system.
The basic problem with PA/EC systems is the following: although partitions are a rare event, they are not impossible. Any application built on top of a PA system must have mechanisms in place to deal with inconsistencies that arise during these partition events. But once they have these mechanisms in place, why not benefit during normal operation and get better latency as well?
Over the past few weeks, I have been looking more deeply at the In-Memory Data Grid (“IMDG”) market, and took an especially deep dive into Hazelcast, a ubiquitous open source implementation of a IMDG, with hundreds of thousands of in production deployments. It turns out that Hazelcast (and, indeed, most of the in-memory data grid industry) is a real implementation of the mythical PA/EC system.
In order to understand why PA/EC makes sense for Hazelcast and other IMDGs, we need to first discuss some background material on (1) Hazelcast use cases, (2) data replication and (3) PACELC.

Hazelcast use cases
The most common use case for Hazelcast is the following. Let’s say that you write a Java program that stores and manipulates data inside popular Java collections and data structures, e.g., Queue, Map, AtomicLong, or Multimap. You may want to run this program in multiple different clients, all accessing the same Java collections and data structures. Furthermore, these data structures may get so large that they cannot fit in memory on a single server. Hazelcast comes to the rescue --- it provides a distributed implementation of these Java data structures, thereby enabling scalable utilization of them. Users interact with Hazelcast the same way that they interacted with their local data structures, but behind the scenes, Hazelcast is distributing and replicating them across a cluster of machines.
The vast majority of Hazelcast use cases are within a single computing cluster. Both the client programs and the Hazelcast data structures are located in the same physical region.

Data replication
In general, any arbitrary system may choose to replicate data for one of two primary reasons: Either they want to improve fault tolerance (if a server containing some of the data fails, a replica server can be accessed instead), or they want to improve request latency (messages that have to travel farther distances take longer to transmit; therefore, having a replica of the data “near” locations from which they are typically accessed can improve request latency).
As mentioned above, in-memory data grids are typically running in the same region as the clients which access them. Therefore, only the first reason to replicate data (fault tolerance) applies. (This reason alone is enough to justify the costs of replication in any scalable system. The more physical machines that exist in the system, the more likely it is that at least one machine will fail at any given point in time. Therefore, the bigger the system, the more you need to replicate for fault tolerance).
If the replicas only exist for fault tolerance and not for performance, there is no fundamental requirement to ever access them except in the case of a failure. All reads and writes can be directed to the primary copy of any data item, with the replicas only ever accessed if the primary is not available. (In such a scenario, it is a good idea to mix primary and replica partitions on servers across the cluster, in order to prevent underutilization of server resources.) If all reads and writes go to the same location, this leads to 100% consistency and linearizability (in the absence of failures) since it is easy for a single server to ensure that reads reflect the most recent writes.

What this means for PACELC
Recall what I wrote above about the latency vs. consistency tradeoff: “Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost.” In truth, option (2) does not come with a latency cost when all requests originate from a location closest to the master replica. It’s only when messages travel for longer than the distance to the nearest replica where a cost materializes. In order words, there is no consistency vs. latency tradeoff in the typical Hazelcast use case.
Thus, we should clarify at this point that the PACELC theorem assumes that requests may originate from any arbitrary location. The ELC part of PACELC disappears if all requests come from the same location. I would argue that the CAP theorem makes the same assumption, but such an argument is not as straightforward, and requires a refined discussion about the CAP theorem which is outside scope of this particular blog post.

Failures and partitions
Up until now, we have said that as long as the master node does not fail, if it serves all reads and writes, then full consistency is achieved. The obvious next question is: what happens if the master node fails and a new master takes over? In such a scenario, the ability of the system to maintain consistency depends on how replication is performed. If replication was asynchronous, then consistency cannot be guaranteed, since some updates may have been performed on the old master, but had not yet been replicated to the new master before the old master failed. If all data had been synchronously replicated to the new master, then full consistency can still be guaranteed.
A failed node is logically equivalent to a partition where the failed node is located in one partition and every other node is in the other partition, and all client requests can reach the second partition but not the first. If the failed node is the master node, and replication was asynchronous, then both the CAP theorem and the PAC part of PACELC state that there are only two choices: quiesce the entire system since the only consistent copy is not accessible (i.e. choosing consistency over availability), or serve reads and writes from nodes in the available partition (i.e. choosing availability over consistency).  
Hazelcast by default uses “synchronous” replication, which is actually an interesting hybrid between asynchronous and synchronous replication. The master asynchronously sends the writes to the replicas, and each replica acknowledges these writes to the client. The client synchronously waits for these acknowledgments before returning from the call. However, if the requisite number of acknowledgments do not arrive before the end of a time out period, the call either returns with the write succeeding or throws an exception, depending on configuration. If Hazelcast had been configured to throw an exception, the client can retry the operation.  Hazelcast also has an anti-entropy algorithm that works offline to re-synchronize replicas with the master to repair missed replications. However, either way --- until the point where the missed replication has been repaired either through the anti-entropy algorithm or through a client retry, the system is temporarily in a state where the write has succeeded on the master but not on at least one replica.
In addition to the hybrid synchronous algorithm described above, Hazelcast also can be configured to use standard asynchronous replication. When configured in this way, the client does not wait for acknowledgments from the replicas before returning from the call. Thus, updates that failed to get replicated will go undetected until the anti-entropy algorithm identifies and repairs the missed replication.  
Thus, either way --- whether Hazelcast is configured to use standard asynchronous replication or to use the default hybrid “synchronous” model --- it is possible for the write call to return with the write only succeeding on the master.
If the master node fails, Hazelcast selects a new master to serve reads and writes, even though (as we just mentioned) it is possible that the new master does not have all the writes from the original master. If there is a network partition, the original master will remain the master for its partition, but the other partition will select its own master. Again, this second master may not have all the writes from the original master. Furthermore, a full split brain situation may occur, where the masters for the two different partitions independently accept writes to their partition, thereby causing the partitions to diverge further. However, Hazelcast does have a “split brain protection” feature that prevents significant divergence. The way this feature works is that the system can be configured to define a minimum size for read and write operations. If this minimum size is set to be larger than half of the size of the cluster, then the smaller partition will not accept reads and writes, which prevents further divergence from the larger partition. However, it can take 10s of seconds for the smaller partition to realize how small it is (although Hazelcast claims it will be much faster than this in 3.9.1 and 3.10). Thus there is a delay before the split brain protection kicks in, and the partitions can diverge during this delay period.
The bottom line here is that both if the master fails and also in the (rare) case of a network partition, a new master is selected that may not have all the updates from the original master. The system always remains available, but the second master is allowed to temporarily diverge from the original master. Thus, Hazelcast is PA/EC in PACELC. If the master has failed or partitioned, Hazelcast choses availability over consistency. However, in the absence of failures or partitions, Hazelcast is fully consistent. (As mentioned above, Hazelcast also achieves low latency in the absence of failures or partitions in their primary use case. However, it is appropriate to label Hazelcast EC rather than EL since if a request were to theoretically originate in a location that is far from the master, it would still choose consistency over latency and serve the request from the master.)
Indeed, any system that that serves reads and writes from the master, but elects a new master upon a failure, where this new master is not 100% guaranteed to have seen all of the writes from the original master, will be PA/EC in PACELC. So the PA/EC category is larger than I originally had expected.
I would still argue, however, that PA/EC systems are fundamentally confusing to the end user. If the system cannot guarantee consistency in all cases, then the end user is forced to handle cases of inconsistency in application logic. And once they have the code written to handle these cases (e.g., by including merge functions that resolve situations where replicas may diverge), then the value of the system being consistent in the absence of failures or partitions is significantly reduced. PA/EC systems thus only make sense for applications for which availability takes priority over consistency, but where the code that handles inconsistencies needs to be run as infrequently as possible --- e.g. when the code involves a real world charge (such as refunding a customer’s account) or significant performance costs.
Since not all applications fit into the above category, I suspect that many PA/EC systems will have settings to either increase consistency in order to become fully consistent (i.e. become PC instead of PA) or reduce consistency guarantees in the “else case” (i.e., become EL instead of EC).
Indeed, Hazelcast is such a system and can be configured to be EL rather than EC. There are several ways to accomplish this, but the primary mechanism is through their Near Cache feature. Near Cache is a client side cache of recently accessed data items. If the data items stored in the Near Cache are updated by a different client, these changes are not synchronously replicated to the first client’s Near Cache. Hence, the Near Cache is not kept consistent with the master version of the data (instead it is “eventually consistent”). However, reads by the client are served by its Near Cache if a copy of the data item to be read is stored there. Therefore, excellent latency (less than one microsecond) can be achieved at the cost of consistency --- EL in PACELC.
Furthermore, Hazelcast also supports replication of clusters over a WAN. For example, in a disaster recovery use case, all writes go to the primary cluster, and they are asynchronously replicated to a backup cluster. Alternatively, both clusters can accept writes, and they are asynchronously replicated to the other cluster (the application is responsible for resolving conflicts of conflicting writes to the different clusters using a conflict resolution strategy registered with Hazelcast). Unlike what we discussed earlier, in this case read requests may originate from arbitrary locations rather than always from a location near the master. Hazelcast serves these reads from the closest location, even though it may not have the most up to date copy of the data. Thus, Hazelcast is EL by default for WAN replication.  
In summary, through my investigation of Hazelcast (and in-memory data grids in general), I have discovered a new category of PA/EC systems. However, due to the confusing nature of PA/EC systems, it is no surprise that Hazelcast can be configured to be PA/EL in addition to its PA/EC default.