Authors: M Stonebraker, D Abadi, S Zdonik, P O’Neil et al
Goal of C-store is read optimized (while supporting OK OLTP), to that end:
- storage of data by column rather than by row
- The authors realized that most ad-hoc queries read 2 columns out of 20, which means loading the other 18 columns into memory is a big waste of I/O time
- query processing in main memory
- overlapping collection of column- oriented projections
- high availability and snapshot isolation for read-only transactions
- extensive use of bitmap indexes to complement B-tree structures
- avoid bringing into memory irrelevant attributes
- could use more CPU to save I/O (since most systems are I/O bound)
- code data elements into a more compact form
- densepack values: basically defragment
- multiple sorts: redundant storage so that query evaluation could use best projection
- k-safety for high availability (also multitasking the projections for recovery)
- keep a separate store for writes, and in order to avoid excessive conflicts, run reads in historical mode (timestamp based).
- snapshot isolation to avoid 2PC and locking
Groups of columns sorted on the same attribute are referred to as “projections”; the same column may exist in multiple projections, possibly sorted on a different attribute in each.
Overall Data Model
- Segments: every projection is horizontally partitioned by key range into segments (identified by sid).
- Covering set: there need to be a covering set of the projections for every table and be able to reconstruct the rows with join indices.
There is a decent amount of overhead and one could see why they need a separate Write Store (WS)
They have different types of encodings for columns with different characteristics:
- Type 1: self-order (primary key), then the column encoding is (v, f, n) such that
- v is a value stored in the column,
- f is the position in the column where v first appears, and
- n is the number of times v appears in the column
- Type 2: Foreign-order, few distinct values, then (v, b):
- b is a bitmap indicating the positions in which the value is stored
- for retreival they use this cool offset index where they use a b-tree to get positions
- Self-order, many distinct values, every value in the column as a delta from the previous value in the column. E.g.
1,4,7,7,8,12is represented by
- Could use (densepack) B-trees again for indexing these
- Type 4: Foreign-order, many distinct values: leaving the value as is.
Join Index: a join index is a collection of
(sid, storage_key) pairs. Each of these two fields can be stored as normal columns.
Surprisingly write store is also a column store (to avoid writing two optimizers apparently) but the storage representation is very different to be able to update transactions quickly.
- There is a 1:1 mapping before RS segments and WS segments.
- It’s not compressed (because its much smaller than RS), represented as (v, sk), where
skis its corresponding storage key.
So every projection is represented as a pair of segments, one from WS and one from RS. Note that the join is directed, so there is a notion of a “sender” and that of a “receiver”
- all columns in a single segment of a projection
- join indexes & their “sender” segments.
Storage keys are globally unique.
They have a very large main memory buffer pool so they expect “hot” WS data structures to be memory resident.
Snapshot isolation: allowing read-only transactions to access the database as of some time in the recent past, at which point there are no uncommitted transactions.
The most recent time a snapshot isolation could run is the high water mark (HWM), and the low water mark (LWM) (so there is no time traveling until some time too far back which would incur a large overhead).
Key challenge: which of the records in WS and RS should be visible to a read-only transaction running at effective time ET. Since it’s SI, there cannot be update in place, and this is a problem because a record is visible if it was inserted before ET and deleted after ET. To fix this, C-store uses the following designs:
- coarse granularity epochs (many seconds), controlled by timestamp authority (TA) that allocates timestamps. It sends end of epoch messages to other sites and incrementing epoch by one to e+1 , causing new transactions to take on this new timestamp, and the TA doesn’t set the HWM to e on until all sites send epoch complete back.
- insert vectors & deleted record vector: contains for each record the epoch in which the record was inserted and deleted. The runtime system can now consult IV and DRV to make the visibility calculation for each query on a record-by-record basis.
They use strict 2PL for concurrency control and WAL for recovery and use No Force/Steal, but only log UNDO (logical) and have special REDO.
They have a more trimmed version of 2PC: without the PREPARE phase. They make up for the issues caused by using other projections.
A background task that finds worthy segment pairs and then merge-out. This has to do with maintaining the LWM. The ones not deleted or deleted after LWM is moved to RS. The LWM is also periodically updated (as is HWM).
They have a new set of operators, notably, decompress, mask (bitstring on projection), permute (projection according to join index).
To do query optimization, the big decision is which sets of projection to work on. They didn’t go into many detail here though I imagine it to be a pretty rich problem.
C-Store performs significantly better than both row and column stores for the TPC-H (warehouse) workload. They made the other systems run with materialized views, which closed the gap but impose a lot of extra storage overhead. However their analysis seems to be only on the RS (because WS is underdevelopment, which is also perhaps why the tuple mover and query optimizer wasn’t extensively discussed).
- Surprising that no one thought of this before! All of the individual ideas are pretty commonly used.
- Prof. Stonebraker seems to like to design systems that “contrasts sharply with most current systems”.