Authors: S Agarwal, B Mozafari, A Panda, H Milner, S Madden, I Stoica
BlinkDB is a massively parallel, approximate query engine for interactive SQL queries on large data, trading off accuracy for query time. They could answer queries 200x faster than Hive with error under 10%. Though Hive is not the fastest, it’s comparable since is what they implemented on top of.
The BlinkDB system exposes
ERROR WITHIN and
WITHIN <time> specification as part of the query.
Of course the idea to sampling for approximate queries has been around forever, and BlinkDB is different in that it exploits the patterns that most of the queries’ predicate clauses have recurring combination of columns and are stable over time they call this “query column sets”, QCS). Prior research has mostly focused on predictable queries or predicates. Note that this is an insight (verified over Facebook and Conviva data) and an assumption (that generalizes to most other OLAP workloads).
The image below shows the standard 20-80 behavior.
Leveraging the QCS, BlinkDB creates samples in a stratified manner (to account for rare subgroups) to ensure that queries about any subgroup could be answered. Usually OLAP queries without a predicate works fine with uniform sampling, but when there is a predicate the less common subsets might not be presented. This is formulated as an optimization problem: given QCS (from past queries) and historical frequencies, and storage limit, choose samples to optimize for expected match (details in the following section).
Below is an illustration of stratified sample where the most popular ranges are truncated by K, which is the cap of the freuqency of every group x (tuple) in the QCS for a query:
They create an error-latency profile (ELP) to help chooses the right samples when answering. I don’t think BlinkDB deals with joins.
Below is their system architecture. Note that the ELP is part of the runtime in sample selection:
The Sample Creation Formalization
The paper is a great example of formalizing different heuristics into one optimization function (in this case a mixed integer linear program). The formulation looks like the following:
Where the $p$ are the frequency distribution, $y$ accounting for the partial coverage of the QCS for a specific query.
The right size to sample from is determined by the tracked ELP and is fairly simple (not an optimization problem).
There may need to be query rewrites to accomodate for partial coverage of QCS.
Due to the different sampling rates for the subgroups BlinkDB adjusts for this bias. For example, when calculting the total SUM one needs to account for the percentage of data points sampled.
Some Issues Usually With Sampling
- If there is some bias in the samples, then that bias will show up over and over again
- Need to do a single pass to create samples
- People don’t do well with error bounds