Year: 2014

Authors: Mu Li, DG Andersen, JW Park et al

Note: I referenced the author Mu Li’s slides, which is very good.

Summary

Motivated by the larger and larger training set required by machine learning (the training data size is almost 700TB for ad click prediction).

System description at a high level:

  • Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.
  • The framework manages asynchronous data communication between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance, with ease of use.

None of which sounds very novel, and the authors argue that the novelty is in combining ML semantics with systems techniques to deal with the core issue of reading and updating parameters shared between different worker nodes.

Machine Learning Analysis

  • The goal of many machine learning algorithms can be expressed via an objective function. The most intuitive variant of machine learning problems is that of risk minimization
  • Generative models for unsupervised learning to capture the underlying structure of the data, like LDA.

Architecture

It looks like the following:

arch

  • Data Representation: The model shared among nodes can be represented as a set of (key, value) pairs. This is possible because machine learning algorithms typically treat the model as a linear algebra object
    • They have vector and matrix semantics, where non-existing keys are associated with zeros
  • Range Push and Pull: the workers push and pull from server machines to execute tasks asynchronously.
  • User-defined functions on the server supported because server nodes often have more complete or up-to-date in- formation about the shared parameters.
  • Flexible Consistency (trade off convergence time): users could declaratively specify one of the three: sequential, eventual, and k-bounded delay (at most k behind). The authors experimented and found that 8 bounded delay had the best trade-off between computing and waiting time.
  • User-defined filters: selectively synchronize individual (key,value) pairs, allowing fine-grained control of data consistency within a task

Implementation

  • Each (key,value) pair is associated with a vector clock for potentially complex task dependency graph and the need for fast recovery.
  • Nodes may send messages to individual nodes or node groups. A message consists of a list of (key,value) in the key range R and the associated range vector clock* Consistent Hashing via hash ring
  • Replication and Consistency: server node stores a replica of the k counterclock- wise neighbor key ranges relative to the one it owns. Worker nodes communicate with the master of a key range, and any modification on the master is copied with its timestamp to the slaves.
  • Server management: support addition and removal of node. It’s a bit complex with corner cases of vector clock time and consistency.
  • Worker Management: similar to server but simpler. When a worker departs, the task scheduler may start a replacement.
  • Smart Fault Tolerance: only replicate after aggregation: “Server nodes often aggregate data from the worker nodes, such as summing local gradients. Servers may therefore postpone replication until aggregation is complete.”