Author: G. Graefe
This lengthy (~100 p) survey contains basic algorithms for data access methods (hashing and sorting, and their duality), disk access, aggregation algorithm, binary join algorithms, parallel query execution, nonstandard query processing (like nested relations, timeseries, and object oriented) and some miscellaneous item. Note that maintaining update ACID properties is not in the scope of this survey — it focuses on access. Most of the discussion is based on the Volcano system.
Here I will attempt to point out some interesting ideas that’s not already included in textbook materials, and skim some of the engineering best practices details.
Synchronization and data transfer between operators (note that exchange was written three years earlier): its difficult to schedule on a global level, and the most practical alternative is to have them schedule each other within a single OS process : iterate over all records (or other granularity). This is a cheap procedure call and no interprocess communication.
Below are some examples of iterator functions:
Highlights on a few nice things we get:
- items never wait in a temporary file or buffer between operators
- iterators can schedule any tree
- generalizes nicely for parallel processing (volcano paper)
Merging: > Instead of always merging runs of only one level together, the optimal strategy is to merge as many runs as possible using the smallest run files available.
A big thing people could tune is cluster size, which is the amount of buffer memory dedicated to input and output of the merge.
- larger cluster size reduce the fan-in (may increase number of merge levels) but reduce latency for each level due to reduced disk seek.
Illustrations from these slides:
Partitioned Hash Join
Hybrid Hash Join, taken from the cow book: > The idea behind hybrid hash join is to build an in-memory hash table for the first partition of R during the partitioning phase, which means that we don’t write this partition to disk. Similarly, while partitioning S, rather than write out the tuples in the first partition of S, we can directly probe the in-memory table for the first R partition and write out the results. At the end of the partitioning phase, we have completed the join of the first partitions of R and S, in addition to partitioning the two relations; in the probing phase, we join the remaining partitions as in hash join.
- File scans: issues with mixing range scans and random access, solution is to choose middle-of-the-road page size.
- Associative access: build indices! Should separate index scan from record lookup because:
- sometimes just need the index info
- can combine results from multiple indices: values directly or using RID list (or bit maps)
- Buffer: cache! Research include algorithms such as marginal gain, saving reference patterns to predict future. The interface exposed is not read and write but fixing and unfixing (or pin).
Broadly two types:
- Scalar aggregates: one result e.g. sum of all
- Aggregate functions: multiple results e.g. group by. Duplicate removal treated the same as aggregation (grouping)
Algorithms to be used:
- nested loop: simple and most flexible (can do more exotic aggregation), but inefficient
- sorting: easy for duplicate removal
- hashing: could hash on the grouping attributes
Cool graph to compare:
Example query: “find students who have taken all database courses”. The relational algebra would need a universal quantifier with a division operator (good wiki example). It’s pretty difficult to write in SQL.
Its been ignored in the past:
- not usually needed
- could write a different query (cartesian product)
- hard to implement
Duality of Sort & Hash
- Both fit into divide-and-conquer appraoch
- Disk access: sequential scan to later random I/O
- Interesting order is also useful for both (e.g. aggregation followed by a join)
Two cases when one is better than the other:
- when two tables have different size: hybrid hash is better than merge-join
- if hash function is bad (skew) merge join is better (in general better for non-uniform access).
Everything else? It depends.
Parallel Query Execution
Bushy plans make things more complicated:
- stop point: a point at which all data are in temp files on disk and no intermediate results are in memory. Used to switch between subplans. An example is given where there are two merge-sorts (on two branches), if the first merge-sort is at final round, the other merge sort should go ahead first (it’s a good point to switch).
- binary operator (join) need to know which subplan is initiated first
- allocation of resources (memory)
Parallel vs Distributed? Distributed means “locally autonomous” (e.g. R*). And if the cooperation between the systems is limited its called federated. Parallel systems only has one DBMS that executes the fragments of queries in parallel (e.g. Gamma, Volcano). Distributed systems may employ parallelism and parallel systems can be based on some kind of shared-memory or shared-nothing, so there are mixing of ideas and designs.
Forms of parallelism
(also discussed in Volcano paper):
- Interquery: concurrent execution of transactions; resource contention
- Interoperator: puplining of different operators in a single query
- vertical: producer and consumer pipeline
- horizontal: bushsy e.g. let merge join receive input from two sort process. Hard to time.
- Intraoperator: data partitioning
Load Balancing and Skew
For optimal speedup and scaleup, the processing load need to be assigned to processors and disks to ensure equal completion time on all pieces. Skew is a big problem, and here are some techniques:
- Skew avoidance: sample the data first or keep histograms (but sampling would halt the pipeline)
- Skew resolution: repartition.
They could happen if all the following conditions are met:
- multiple consumers feed multiple consumers
- each producers produce sorted stream
- hash partitioning is used
- flow control and bad data distribution
This is because the producer needs to ship data in a certain order and the consumer needs to receive data in a certain order and there is conflict on the ordering of data passing the processing boundary. Illustrated here:
Which could be resolved by moving a sort operation into the consumer process.
Here is an example that should shed light on the solution idea: > Consider 10 partitions with key values between O to 99 sorted on site O, between 100 and 199 sorted on site 1, etc. First, each partition is sorted locally at its original site, without data exchange, on the last two digits only, ignoring the first digit. Thus, each site has a sequence such as 200, 301, 401, 902, 2, 603, 804, 605, 105, 705,…,999, 399. Now each site sends data to its correct final destination. Notice that each site sends data simultaneously to all other sites, creating a balanced data flow among all producers. While this method seems elegant, its problem is that it requires fairly detailed distribution information to ensure the desired balanced data flow.
In general if one doesn’t use distributed sort-merge it’s fine.