Summary

Problem: try to use indices answer as a broad range of queries as possible. Traditional indices can only answer a certain type of queries well.

Solution: the authors propose value-list index, projection and bit sliced indexes, and then show how to answer aggregates using these indices, and lastly how to work with “non-dimensional” selection.

Indices

The authors proved a cool theorem (2.1) that any two of the three indices could be used to create the other one.

Value-List Index

  • Background: the leaf level of B+-tree index consists of a sequence of entries for index keyvalues, which reflects the value of the indexed column or columns in one or more rows in the table. In traditional B-trees the values only appear once so the keyvalue partitions the rows, but in object-oriented DBs they don’t. To address each individual rows, there is a Row IDentifier (RID), which is a list stored with each distinct keyvalue entry. The problem with RIDs: large number of RIDs. Bit map is an alternative method to represent RID.
  • Definition of Bitmap: “A “Bitmap B” is defined on table T as a sequence of M bits, where for each row with ordinal number j, we set the jth bit in B if that row satisfies the property of the index”
  • Advantages:
    • Space efficient: easy to compress, good for caching
    • CPU efficient: boolean operations used for AND, OR, and NOT are fast
      • Note that to work with NOT, we need an additional “Existence Bitmap” since SQL has True, False and NULL.

Projection Index

  • Definition: “A projection index for column duplicates all column values for lookup by ordinal number” (yifan’s note: this feels like a precursor to column store ideas)
  • Advantage: don’t need to retrieve other columns in the row and could fit in smaller number of pages.

Bit Sliced Index

  • Definition: “A Bit-Sliced index for a column creates Bitmaps for significant bits in the column value”.
    • So for example, a list of [1,2,3] would have a bit sliced index of [[0,0,1],[0,1,1]]
  • This is a very neat idea as it makes RANGE selection, MAX calculation, and SUM extremely fast (all bit operations). Highlight of the contribution.
  • Example: Algorithm 4.2 shows how to do Bit-Sliced Index do range predicate. As a concrete example, for greater than comparison, one just need to compare for the most significant bit for which the two numbers differ (and smaller than is vice versa).
  • Intuitively it’s a vertical partitioning of the data — instead of looking at the numbers one by one, we are looking at the i-th significant bit all together, so they are literally “orthogonal” to the data in the Projection index.

Using the Indexes

  • Aggregation: sum could be via projection index, or bitmap index (add over the bit-shifted counts) and bit sliced (the two algorithms shown below, respectively):
if  (COUNT (Bf AND Bnn) == 0)
  Return null;
SUM = 0.0;
for each non-null value v in the index for C {
  Designate the set of rows with value v as Bv
  SUM += v * COUNT(Bf AND BV);
}
Return SUM;
if  (COUNT (Bf AND Bnn) == 0)
  Return null;
SUM = 0.0;
for i = 0 to N
  SUM += 2i * COUNT(Bi AND Bf);
Return SUM;

For how to implement other types of queries the paper goes into details.

  • Queries that group by different combinations of columns are called dimensions, and if there isn’t an existing summary table (e.g. CUBE) other indices are completely useless. This is a problem. The authors created Join Indexes and Bitmap-Join-Indexes to fix this (note that this paper did not contribute the index, a previous 1995 paper did).
    • Naive join index: “an index on one table that involves a column value from different table through a commonly encountered join.”
    • Issue: combinatorial explosion of join indexes in terms of the number of useful columns
    • How bitmap-join-index solve the problem? just do pairwise joins, “an index on a table T based on a single column of a table S, where S commonly joins with T in a specified way”

Evaluation

provenance provenance

These results really motivate the need of different indices.

Bit-Sliced Index is not the best for MAX or MIN makes sense because Value-List Index apparently works better for narrow ranges, but I’m not sure why this would still hold if the selectivity of the predicate is high…