Year: 1996

Author: Patrick O’Neil et al


The wide adoption of long-lived transactions generates the following

  • history table to provide an activity trace
  • log records for purposes of system recovery

Motivates LSM-Tree, which is provides low-cost indexing updates.

How: batch index changes > The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort.

Why: five minute rule! > we can reduce system costs by purchasing memory buffer space to keep pages in memory, thus avoiding disk I/O, when page reference frequency exceeds about once every 60 seconds

Use cases: high update rates compared to retrieval rates

The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve cost- performance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs.

Algorithm Sketch

Basic: One tree in memory (C0) and one on disk (C1). Update to the smaller tree in memory, then asynchronously merge to the larger tree on disk (rolling merge that takes multiple steps).

Advanced: a chain of indexes, C0 through Ck, where C1 to Ck are on disk in increasing size. Below is an illustration

LSM tree


  • C0 doesn’t have to be a B-tree
  • movement by units of multi-page blocks for C1
  • Keep track of the start-time for each transaction and limit the search range among the C0 through Ck.
  • Find note entry: Piggy back long runing searches that are not time sensitive with updates to the C1+’s


Cursors that circulates the components: > We picture the rolling merge process in a two component LSM-tree as having a conceptual cursor which slowly circulates in quantized steps through equal key values of the C0 tree and C1 tree components, drawing indexing data out from the C0 tree to the C1 tree on disk. The rolling merge cursor has a position at the leaf level of the C1 tree and within each higher directory level as well. At each level, all currently merging multi-page blocks of the C1 tree will in general be split into two blocks: the “empty- ing” block whose entries have been depleted but which retains information not yet reached by the merge cursor, and the “filling” block which reflects the result of the merge up to this moment. There will be an analogous “filling node” and “emptying node” defining the cursor which will certainly be buffer resident.

Crash Recovery

tldr: just use existing logs to reconstruct, everything else is an optimization: > we don’t need to create special logs to recover index entries on newly created records: transactional insert logs for these new records are written out to a sequential log file in the normal course of events, and it is a simple matter to treat these insert logs (which normally contain all field values together with the RID where the inserted record has been placed) as a logical base for reconstructing the index entries


  • Checkpoint: it blocks, but it’s fine because C0 is quite to flush, and when the rest is flushing the system could use C0.
  • no overwrite: > Newly merged blocks are written to new disk positions, so that the old blocks will not be over- written and will be available for recovery in case of a crash.


Invariants that need to be maintained:

  • A find operation should not access a node of a disk-based component at the same time that a different process performing a rolling merge is modifying the contents of the node.
  • A find or insert into the C0 component should not access the same part of the tree that a different process is simultaneously altering to perform a rolling merge out to C1.
  • The cursor for the rolling merge from Ci-1 out to Ci will sometimes need to move past the cursor for the rolling merge from Ci out to Ci+1


  • Connection to Log-Structured File System
  • The paper spends the first 8 pages on describing the index and the rest 20 some pages on evaluation, and defining parameters that could be tuned to make the algorithm better.
  • How does it only have less than 400 citations?!