Note: these questions were extracted/inspired from Berkeley Prelim Questions
Q: What are the goals of the traditional System R relational query optimizer?
To minimize total I/O and compute (weighed by a parameter) for a complete query.
Q: What is the cost model used to evaluate alternative query plans and the implications of that cost model
The overall cost is the cost of each individual scans of the intermediate relations. This is calculated using selectivity, computed using the predicate range and total column range (using index information). Joins have specific cost functions.
Q: What if our goal is to reduce the latency to getting the first tuple of an answer?
The first technique to change would be predicate pushdown, since we want to have the join start as early as possible. One extreme is that as soon as any predicate is selected, send it through for the joins, but the risk is that the first predicate doesn’t match for the join results. One potential consideration is to apply predicate on the most selective (aligned with predicate pushdown) and stream the join. One naive algorithm would be to stream the nested loop join.
Note: this relates to the techniques used in ripple join, and ripple join is a generalization of online nested loop join where both the “outer” and “inner” loop could be streamed.
Q: Finally we discussed how to improve time to first answer in a parallel system, where data is partitioned across multiple nodes.
I think for parallel system where a table is distributed, that’s almost the same as adding dimensions to the join (instead of the two dimension inner and outer loop). If the columns joins are sharded to begin with it would improve the time to first join should a match exist since that’s effectively reducing the number of dimensions. Alternatively we could perform the joins in parallel. The more joins in parallel the faster the time to first answer.
Q: Describe the notion of “interesting orders” in the context of cost-based query optimization.
In the initial simple formulation, “interesting orders” is whether the ordered column is in the ORDER BY or GROUP BY clause.
Q: In the absence of sort-based join algorithms, are there other uses for this concept? E.g. in the context of hash-based algorithms?
I think GROUP BY could also use the hashing intermediate result, so it should be considered interesting. Depending on the hash algorithm, it might be designed to inform sorting — e.g. if the hash is partitions across the value range then sorting could speed up.
Q: Is there a more general notion at work in this setting?
I think it’s basically accounting for potential gains if memoization is used.
Before this paper, most of the query optimization are based on very simple heuristics. This paper defines a cost function and the search space. Albeit the many strong assumptions, it provides a much better solution to approximate an inexact/intractable problem.
It’s amazing how the simple model generalizes. Even in the case of complex queries (i.e. with subqueries) the cost scenario could be broken down into subqueries that could be computed once (thus the cost is additive), or for correlated subqueries, which might make the cost multiplicative, but the paper also mentions neat tricks to reuse past results.
It turns out that this is a very common pattern for computer science research — the way to build a system to solve a problem is first to define a cost function, then solution space, and then how to search in the solution space.