Author: E Brewer
note: this is an amazingly well written paper that has many great ideas and is super easy to understand, a bit hard to summarize because it’s already quite packed and phrased very succinctly.
The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
- consistency (C) equivalent to having a single up-to-date copy of the data;
- high availability (A) of that data (for updates); and
- tolerance to network partitions (P).
The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A. Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P.
The C in CAP and C in ACID are different and they have complicated interactions, so for context the definitions of ACID again (with useful comments about CAP):
- Atomicity (A). All systems benefit from atomic operations. When the focus is availability, both sides of a partition should still use atomic operations. Moreover, higher-level atomic operations (the kind that ACID implies) actually simplify recovery.
- Consistency (C). In ACID, the C means that a transaction pre- serves all the database rules, such as unique keys. In contrast, the C in CAP refers only to single‐copy consistency, a strict subset of ACID consistency. ACID consistency also cannot be maintained across partitions—partition recovery will need to restore ACID consistency. More generally, maintaining invari- ants during partitions might be impossible, thus the need for careful thought about which operations to disallow and how to restore invariants during recovery.
- Isolation (I). Isolation is at the core of the CAP theorem: if the system requires ACID isolation, it can operate on at most one side during a partition. Serializability requires communication in general and thus fails across partitions. Weaker definitions of correctness are viable across partitions via compensation during partition recovery.
- Durability (D). As with atomicity, there is no reason to forfeit durability, although the developer might choose to avoid needing it via soft state (in the style of BASE) due to its expense. A subtle point is that, during partition recovery, it is possible to reverse durable operations that unknowingly violated an invariant during the operation. However, at the time of recovery, given a durable history from both sides, such operations can be detected and corrected. In general, running ACID transactions on each side of a partition makes recovery easier and enables a framework for compensating transactions that can be used for recovery from a partition.
This inspired a range of systems on different parts of the trade-off frontier, especially inspiring debates about relative merits of consistency and availability.
Brewer argues that there are many nuances between the extremes (essentially bringing in application semantics) and 2 out 3 is a bit misleading.
The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.
There a few reasons:
- Partitions are rare: there is little reason to forfeit C or A when the system is not partitioned.
- Not just black and white:
- Choices at multiple levels : he choice between C and A can occur many times within the same system at very fine granularity; not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved.
- Each choice itself is not binary. E.g. availability is continuous from 0 to 100%, multiple levels of consistency etc. Exactly what it means to forfeit P or C is unclear.
- Additionally, new features such as HTML5’s on-client persistent storage make disconnected operation easier going forward.
The natural way forward should be allow perfect C and A most of the time, and only change the strategy when P is detected.
Latency and Partition
The essence of CAP takes place during a timeout, a period when the program must make a fundamental decision—the partition decision:
- cancel the operation and thus decrease availability,
- proceed with the operation and thus risk inconsistency.
Retrying communication to achieve consistency, for example, via Paxos or a two-phase commit, just delays the decision. Thus, pragmatically, a partition is a time bound on communication.
Based on this there are 3 consequences/strategies:
- “No global notion of a partition”
- “Nodes can detect a partition and enter a partition mode a central part of optimizing C and A”
- “Designers can set time bounds intentionally according to target response times; systems with tighter bounds will likely enter partition mode more often and at times when the network is merely slow and not actually partitioned.”
Systems should manage partitions very explicitly. Brewer prescribed the following steps:
- detect the start of a partition,
- enter an explicit partition mode that may limit some operations
- limit some opeartions
- record extra information to help with partition recovery
- initiate partition recovery when communication is restored.
What Operations Should Go?
Fundamentally this ties back to application semantics:
Deciding which operations to limit depends primarily on the invariants that the system must maintain.
Bring users into the loop! Tell the user that it’s in progress and not complete (e.g. Bayou).
Do transactions, not just reads and writes!
Vastly easier to analyze the impact of higher-level operations on invariants.
Vector clocks are your friend. I always find it impressive when people prove negative results in systems…
Recent work proved that this kind of causal consistency is the best possible outcome in general if the designer chooses to focus on availability.
Again, bringing the user into the process expands the design space!
Designers can choose to constrain the use of certain operations during partitioning so that the system can automatically merge state during recovery.
Using commutative operations is the closest approach to a general framework for automatic state convergence. Unfortunately, using only commutative operations is harder than it appears; for example, addition is com- mutative, but addition with a bounds check is not (a zero balance, for example).
CRDTs is a way to generalize the above way of thinking:
- ensure that all operations during a partition are commutative, or
- represent values on a lattice and ensure that all operations during a partition are monotonically increasing with respect to that lattice.
The benefit: choose A and still ensure that state converges automatically after a partition.
Issues: CRDTs allow only locally verifiable invariants. The fix is to compensate for mistakes afterwards.
Example of Compensating for Externalized Mistakes
- ATM: In general, because of communication delays, the banking system depends not on consistency for correctness, but rather on auditing and compensation.
- Airplane: If there are too many passengers, some will lose their seats, and ideally customer service will compensate those passengers in some way.
The system could log additional information to be able to automatically detect mistakes, and not shift the burden on the customers/users to identify a problem.
Compensations should leave the database in a consistent state and not cause cascading aborts. It seems like the compensation line of research is rather aged.
System designers should not blindly sacrifice consistency or availability when partitions exist. Using the proposed approach, they can optimize both properties through careful management of invariants during partitions. As newer techniques, such as version vectors and CRDTs, move into frameworks that simplify their use, this kind of optimization should become more widespread. However, unlike ACID transactions, this approach requires more thoughtful deployment relative to past strategies, and the best solutions will depend heavily on details about the service’s invariants and operations.