Year: 2012 (PODS)

Authors: HQ Ngo, E Porat, C Re, A Rudra

Summary

Historically databases have mostly been using a binary join where a multiway join is broken down into combinations of two way joins (the tree structure analyzed in System R optimizer). Then some researchers, Atserias, Grohe, Marx (AGM) proved that joins have a worst case optimal bound (which obviously databases are not hitting) in 2008 (FOCS).However they didn’t provide a good algorithm at the bound. That’s where this paper comes in. The authors construct an algorithm whose running time is worst-case optimal for all natural join queries. There is a whole world of join algorithms out there!! There is the AGM, then Leapfrog, Minesweeper, and more recently Wander Join etc. Interestingly, they also mentioned that adaptive join processing like Eddies is related but the approach is fundamentally different because theirs is a heuristic but the authors are more rigorous.

Its a very dense paper, so I omit the more mathematical aspects (which I dont quite understand).

Previous Work by AGM

The Theory

We have a schema with three attributes, A, B, and C, and three relations, R(A,B), S(B,C) and T(A,C), and natural join query: q = R ⋈ S ⋈ T. The question is the bound of the cardinality of the result(denote as q(I)) when ran over the database containing the relations. When |R|=|S|=|T|=N, the trivial upper bound is N^3. However one could tighten the bound by “noticing that the output of any pair-wise join will be a superset of q(I)”, so a tighter bound is N^2 (note that there are constraints on the attributes, so it’s not just a cross product). Use “fractional cover” tightens the bound to N^1.5. The graph looks like the following:

join_hyper

A query is a hypergraph to cover. An edge cover of a hypergraph H is a set C of edges of H such that each vertex is contained in at least one edge in C. And a fractional edge cover of H is a feasible solution for the linear programming relaxation of the natural integer linear program describing edge covers.

To get a hypergraph H(Q) from the query Q, we construct vertex set to be the set of all attrivutes of Q and the edges are the attribute set of the relations R_i. So the previous three way join becomes a triangle.

Its obvious (apparently…) that the size of Q(I) could be bounded by |I|^{\rho(Q)} where \rho(Q) is the edge cover number, but the authors showed that it could actually be bounded by the fractional cover! This is exciting because “previously unknown, nontrivial methods to estimate the cardinality of a query result” and could help with query processing.

The Algorithm

The original research also presented an algorithm that produces a join-project plan, which is based on binary join operators and unary project operators. It would look something like the following

project join

Personally I dont have any intuition why projection is not done the earliest possible since it seems to involve less data movement.

The complexity is polynomial to the product of (1) the cardinality of largest input relation (2) bounded size of join result (3) query size squared. However the algorithm is not optimal (one example was described in the previous when the algorithm yields N^2 result). So this begs the question can we do better in the worst case?

Another Algorithm

For the actual algorithm, I think nodes are data values (not just attributes, and thus a different graph!) and the join query just becomes finding sets of nodes with a path pattern. For example the three way join becomes triples of nodes that form R-S-T triangles

Partition the values of the join key on each side of the join into values that are heavy (high fanout, where joining all the keys could violate the size bound) and those values that are light (the opposite) — actually used in production in companies like Teradata. Below is the algorithm, the math that makes it work is pretty cool (and simple):

  • If v in H, (the heavy group) check whether each edge e in E forms a triangle with v

worst_formal

  • If v not in H, for each pair of edges check

worst_formal

Problem Formalization

A lot of notations, so screen shot here:

worst_formal

Solution

The authors emphasize that this problem has been implicitly studied by the database community but they have found a connection with a geometric inequality. For this paper, intuitively they find a way to figure out which nodes are heavy and which are light (per the previous solution) – I found this on a slide, and I dont think this is mentioned in the original paper. The steps taken are:

  • Build a query plan tree
  • Build a family of indexes
  • Compute the join recursively in a nested-loop-like fashion.

Aside: Leapfrog Triejoin

I was trying to gain some intuition for what a non binary join would look like (because the algorithm in this paper was too complex…) and the LogicBlox researchers have this neat algorithm, illustrated below (for arity = 1):

leap

For relations with arity > 1, the tuples could be represented as a trie. This is useful when the join are on different columns. Below is an example where the result is (7, 4, 5).

leap

Misc

Prof Re’s group has done more work in going “beyond” the worst case and “joining” theory with practice (ugh puns) because the worst case often doesn’t happen!