Summary

Written in year 1999 following the authors’ work in online aggregation as a new way of human data interaction, proposing a new family of join algorithms that optimizes for time until an acceptable estimate of the query result. It also allows the user more control over the desired properties in the trade-off between update frequency and confidence update — the first join algorithm that also exposes information about query evaluation confidence interval. They found that often one could save a lot of time by letting go of some precision. “Ripple joins are designed to avoid complete relation scans and maximize the flow of statistical information during join processing.”

The core solution idea is simple: stream one random sample from each relation to perform the join. The sample rate could be controlled. The key insight that empowers this random sampling is adapting dynamically to the data’s statistical properties.

Algorithm

Basic algorithm

  • Two table simple ripple join: “one previously-unseen random tuple is retrieved from each of R and S at each sampling step; these new tuples are joined with the previously-seen tuples and with each other”. Below is sample code for this simple case (note that the following could be implemented with the iterator model, which most DBMS uses.) for (max = 1 to infinity) { for (i = 1 to max-1) if (predicate(R[i],S[max])) output(R[i],S[max]); for (i = 1 to max) if (predicate(R[max],S[i])) output(R[max],S[i]); }

  • Sometimes it makes sense to sample one table more quickly than the other to compensate for statistical differences (so in general the more variable one needs more samples), so the sample raio of the two relations could be not equal to 1. When the sample rate for one relation is its total size, then this reduces to streamed nested loop join.
  • Ripple join could use existing indices as well, and in that case there is no sampling of the indexed table and it’s the same as the traditional join algorithm.
  • Ripple join could add its own hash index while maintaining the streamed data points.

Stats!

So the whole point of doing ripple join is to optimize for statistical guarantees in a online aggregation setting. Now we talk about how to compute some of these properties.

  • Estimators for SUM, COUNT, and AVG (note that these 3 are essentially the same so far as statistical properties are concerned): basically scale the current value of either SUM or COUNT by how many data points is seen out of all the estimated data points.
  • Confidence interval: basically use central limit theorem, and model the sampling as a normal distribution, and the parameters needed for CLT using the values from the sample. It’s still tricky however because values like AVG that are derived from two estimations are not independent. Details omitted here.
  • Given the previous setup, how do we actually find the aspect ratio. It is formed as an optimization problem in the paper. Intuitively, we want to minimize the confidence interval subject to (fixed error rate and) an upper bound on the animation-speed setting, which is the user input that specifies the trade-off: time between successive updates of the running estimate <-> the amount by which the confidence-interval length decreases at each update.

Cool Ideas

  • “A key observation is that the time required to achieve a confidence interval of a specified “acceptable” length is a sub-linear (and sometimes constant!) function of the cardinality of the input relations”

Open questions

  • “Although the ripple join is symmetric, it is still not clear how a query optimizer should choose among ripple join variants, nor how it should order a sequence of ripple joins.” (I think this might still be open?)
  • “If base relations are horizontally partitioned across processing nodes, random retrieval of tuples from different nodes can be viewed as a stratified sampling scheme, and the confidence-interval formulas presented here can be adjusted accordingly.” – Very keen insight! I think BlinkDB speaks a bit to this.

Comments

  • Use algorithms semantics to improve systems performance.
  • The paper was structured in a way that iteratively added complexity, preserving the clarity of the original insight.
  • The statics seem to have pretty strong assumptions and I could imagine the confidence to grow fairly slowly for k-way joins.
  • Question: I’m not sure why the slider for animation time is so important. It seems that we should have a fairly low update rate if there is a human observer?