Year: 1996

Authors: P Deshpande, S Agarwal, J Naughton, R Ramakrishnan

Summary

The authors focuses on improving the computation time of all the cuboids (given limited memory).

note the technical report seemed somewhat different from the VLDB publication; the former did not make explicit references to PipeSort or PipeHash, but rather SortedRun and Partition and focuses on the sorting case (unclear how the overlap method will apply to hashing). I personally didn’t enjoy this paper much because everything felt unnecessarily confusing despite the ideas being relatively simple.

Setup

Overlap Method: naively, the cuboids could be computed independent of each other, i.e compute all group-by’s from the base table; a step ahead is to use the hierarchy of group-by and use the “parent” to compute the current cuboid, e.g. using {A,B,C} to compute {A,B}. Taking this a step further, the children cuboids could be computed together (if memory allows) and save extra passes needed. To see this a bit better, consider a limitless main memory, then we could in theory compute all the cuboids in one go by keep a large set of hash table. Combining the overlap method with sorting we could in theory only sort once, over the base table.

Sorted Run: I thought the paper’s definition was a bit confusing notationally (partially because the word “run” was not defined and also a lot of definitions are hard to describe in words, despite them being very simple once understood…), but basically its the best ordering in the parent relation that’s consistent with the child’s. For example, AB from ABC is better than AC from ABC. When computing AB from ABC, we essentially just need two buffers, one to read a tuple from ABC, and the other to write to the current entry of AB (assuming the aggregate is mergable, like SUM or COUNT). However for AC we will have to keep around all the entries for aC (or have to reload from disk), where a is a specific value in column A, because we can not be sure that a value of AC will not show up again. The authors define the size of a partition to be the size of parent cuboid divided by the cuboid that is as large as the matching order (which is essentially the number of unique values in the columns that are not prefix-ed). In the example of ABC and AC, the balue would be Size of {AC}/Size of {A}.

Algorithms

Basically scan the parent and add value to the current entry in the child. Nuances happen when things don’t fit into memory.

Constraints:

  • A cuboid can be considered for computation if its parent is the root of the subtree (has already been materialized) or the parent is in the partition state (basically when the partition fits into memory).
  • Total memory for the cuboids should not exceed available memory.

The author show that the problem is NP-hard and move on to provide a slew of heuristics. They traverse the tree (formed by the hierachy) in breadth first search order:

  • Consider cuboids to the left (since they align better with the parent, by the construction of the tree)
  • Consider higher level cuboids first since they are larger.

example