Amazon needs to serve its customers across the world and always be on, with good SLA. It doesn’t need most of the relational structure of a traditional DBMS and is willing trade off ACID properties. It also needs to be able to scale incrementally. To this end they combined many existing tools and tricks to design the eventually consistent system Dynamo. (Its a really dense paper… but in a good way!)
Why is it needed? So that divergent versions can be reconciled later.
Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously.
Why is this still useful to return wrong results?
There is a category of applications in Amazon’s platform that can tolerate such inconsistencies and can be constructed to operate under these conditions. For example, the shopping cart application requires that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved.
Background: Vector Clocks
Vector clocks (proposed by Lamport) help reason about causality. It basically works by the last writer appending a tuple of its ID and a local monotonically increasing timestamp. Basho has a neat example on their blog. One challenge is to ensure that the vector clock doesn’t grown too large and this could be achieved with having the servers managed the vector clock (as opposed to the clients), so the theoretical total length of vector clock is bounded by the number of servers participating, which unfortunately could also be a bit too big, and it could introduce bugs where some updates get silenced. The Dynamo folks ran the thing and the theoretically long chain of timestamp never really happens in practice (given their preferred servers). Problem solved.
Get and Put
put() functions are handled by a preferred node (called the coordinator), and for the first N healthy nodes its replicated over with a quorum system (where the parameter R and W, respectively for the number of readers and writers). The coordinator modifies the vector clock of course!
Note that the parameters R and W could help the system developers slide along on the multiple objective curve and choose at a finer granularity what they care about.
get(), the coordinator gets all the versions of the data for the key from the N preferred neighbor nodes and reconciles the versions.
Sloppy Quorum: When a coordinator notices that a neighbor node A is down, the data will be sent to another node B instead and this new node B will try to send the data back to the originally intended node A, and after success delete its local data.
Permanent Failures: some crazy Merkle tree business: “A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. “
And some ways to discovery failure
Adding/Removing Storage Nodes
Nodes transfer the appropriate set of keys based on the updates to the new range they are responsible for in the hash ring.
Implementation Lessons/Tricks: Partition
Background: Consistent hashing
In consistent hashing, the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
- Strategy 1: T random tokens per node and partition by token value. Issue: schemes for data partitioning and data placement are intertwined
- Strategy 2: T random tokens per node and equal sized partitions. Good because
- (i) decoupling of partitioning and partition placement, and
- (ii) enabling the possibility of changing the placement scheme at runtime.
- Strategy 3: Q/S tokens per node, equal-sized partitions