Year: 2012

Authors: M Zaharia et al

Summary

The authors realized that MapReduce is not great for “an important class of emerging applications: those that reuse intermediate results across multiple computations” “We present Resilient Distributed Datasets (RDDs), a dis- tributed memory abstraction that lets programmers per- form in-memory computations on large clusters in a fault-tolerant manner.”

Resilient Distributed Datasets

What is an RDD: a read-only, partitioned collection of records.

How to create? Through deterministic operations on either (1) data in stable storage or (2) other RDDs— these operations are called transformations and include map, filter, and join.

Lazy evaluation: RDDs do not need to be materialized at all times. In- stead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage.

User control: persistence and partitioning (useful for placement op- timizations).

Programming Interface: chained transformations.

Advantages:

  • more efficient fault tolerance: “RDDs can only be created through coarse-grained transformations”
  • mitigate stragglers: running backup copies of slow tasks (as in MapReduce), easy due to immutability

Spark Programming Interface

They enumerated through a lot of examples to show how friendly the language is, and how expressive.

Notions of dependency to improve system performance (for pipelined execution and recovery):

  • narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD e.g. Map
  • wide dependencies: multiple child partitions may depend on it. e.g. join

They also mentioned implementation details that make Spark efficient, including good design on every level:

  • Job Scheduling: scheduler is similar to Dryad’s, but takes into account user specification of persistence and memory residence.
  • Interpreter integration: run Spark interactively
  • Memory management: lRU
  • Support for checkpointing

Major Success

Spark is a major industry success! I think mostly due to the elegant design, fit for current ML/analysis workflow, amazing usability, and strong engineering.