Spanner promises a lot (and the 14 page paper is also jam packed with cool things), with this paper’s focus on the hardest problem: replicas, but the buzz-wordy list is the follows:
- fine grained by application:
- distance: write latency
- how many replicas: durability, availability, read perf
- MVCC, makes possible atomic schema updates, consistent backups etc:
- externally-consistent distributed (general purpose) transactions
- globally-consistent reads at a time-stamp (globally meaningful, due to TrueTime)
- automatically reshards data across machines
- SQL-based query language
Looks something like this:
- Tablet: each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet. A tablet is similar to Bigtable’s tablet abstraction, in that it implements a bag of the following mappings:
(key:string, timestamp:int64) -> string.
- The state of a Tablet is stored on Colossus (successor to GFS) with B-tree like files and WAL.
- “To support replication, each spanserver implements a single Paxos state machine on top of each tablet.”
- A Paxos group: “The Paxos state machines are used to implement a consistently replicated bag of mappings. The key-value mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. The set of replicas is collectively a Paxos group.” The Paxos group look like the follow:
Referenced these slides for media.
The concept of time is fundamental to our way of thinking. It is derived from the more basic concept of the order in which events occur. L Lamport
Ruminating over Lamport’s classic Time, Clocks, and the Ordering of Events in a Distributed System, which defines partial orders. The deep insight is that time is based on ordering. Time is fundamentally challenging in distributed systems because processes need to exchange messages over a network (doesn’t even need spacial separation, for example with multiprocessors).
In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation “happened before” is therefore only a partial ordering of the events in the system.
Definition of causality (and concurrency):
a->bdenotes that event
b, and it must be the case that
clock(a) < clock(b)
boccur on different processes that do not exchange messages, then neither a -> b nor b -> a are true, so these events are concurrent, otherwise, they are causal.
One way is to have the message carry time across the system to ensure that the receiver always gets a later time. A strawman of this idea is illustrated below:
Then Vector clocks were introduced, where each process carries its own time, illustrated below (all the timestamps that do not have an arrow are concurrent events):
Background: External Consistency (aka Linearizability)
Calvin and Spanner (wish it were Hobbes) got me really confused about the time business again. So I hunted down the internet for what this external serializability (from Spanner) beast is, and what kind of guarantee Calvin provides.
This blog post by Irene Zhang (grad student at UW) states the following: > Linearizable consistency ensures that the same operations are applied in the same order to every copy of the data item and that the order reflects the order in which the operations appear to execute to an external observer (like the application).
It’s definitely stronger than serializability because serializability just means that it happens in some serial order, without needing to be consistent with an external observer.
Bailis’ post makes the following distinction which is another way to look at the definitions:
- Linearizability: single-operation, single-object, real-time order
- Serializability: multi-operation, multi-object, arbitrary total order
- Strict Serializability: combining both
- “linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.” (from paper *Linearizability: A Correctness Condition for Concurrent Objects * by Herilihy, Wing)
Bailis’ thesis states that Spanner achieves “one-copy serializability and linearizability”, where the former is defined as: “a history is one-copy serializable if it is view equivalent to a serial execution of the transactions over a single logical copy of the database”.
Zhang mentioned in the blog that one reason why this is so confusing is because the communities in consistency and isolation are separate? (It was a bit ambiguous from the blog). I definitely found that consistency and isolation started mixing together, especially given that consistency is such a contextual word. Apparently their research TAPIR shows that one could provide strong isolation without any consistency guarantees… Great.
To get a sense of how confusing the terms are, Jim Gray has a famous paper that introduced the four lock protocols (read uncomitted, read committed, repeatabel reads, and serializable), and the title contained, Degrees of Consistency, where the 4 levels have since been referred to as isolation levels!!! (Rant: I find it maddening how complicated this whole consistency business is. I have finished my first year of PhD and barely feels comfortable with the material. And the fact that it is somewhat intrinsically confusing either because I asked the wrong question or the person who’s answering doesn’t know, or I was processing the wrong mapping of the terms they used makes learning about it very hard as well. I think this is probably the hardest computer science subject to teach.).
TrueTime and Timestamp
What is TrueTime? “Global wall-clock time” with bounded uncertainty. How its implemented is more physics than CS, and the paper focuses on talking about how its used, i.e. the API:
Which basically means the following:
Spanner supports the following invariants (which is central to their consistency semantics and guarantees):
- Timestamp order == commit order
- Timestamp order respects glocal wall-time order
The big question is, are the clocks trustworthy? The empirical data suggest that they are.
Here are some details for each transaction type:
- Read-Write: use two-phase locking, and assigned timestamp when all locks have been acquired and before any lock is released (notice that this makes it have the benefit of strict 2PL without paying additional cost). Here is an illustration from the OSDI slides of how 2PC works
- Read-Only: executes in two phases: assign a timestamp
s, and then execute the transaction’s reads as snapshot reads at
s. The snapshot reads can execute at any replicas that are sufficiently up-to-date.
- Snapshot Read: a read in the past that executes with-out locking. A client can either specify a timestamp for a snapshot read, or provide an upper bound on the desired timestamp’s staleness and let Spanner choose a time-stamp.”
In summary, this makes possible “externally consistent transactions, lock-free read-only transactions, and non-blocking reads in the past”.
Random: I think the Spanner folks are making a parallel universe joke when they name a Spanner deployment a universe.