Authors: D DeWitt, J Gray
Parallelism Goals and Metrics
There are two types of parallelism: speedup (faster execution of the same job) and scaleup (solves bigger problems in same amount of time). The barriers to linear scaling and speedup are three-fold:
- startup time
- interference: shared resource
- skew: high variance of service jobs (when variance dominates the mean, almost no improvement)
The line of reasoning here is very remarkable (visionary engineers!): the dream is to have an infinitely fast processor with an infinite memory with infinite bandwidth. While this is not possible, we could try to build an infinitely fast processor with infinitely many processors (with finite speed) and same with memory and bandwidth. The issue is that scaling is not linear.
There are a few basic designs for organizing resources to simulate this “as one machine” idea:
- Shared-memory: all processors share direct access to a common global memory and disk
- Shared-disk: each processor has private memory but direct access to all disks. How does it actually work? The processor wanting to update some data must first obtain the current copy of data, declare intention to update the data, then proceed after being acknowledged by other processors, and then write the shared data to disk and creates a lot of network traffic.
- Shared-nothing: each memory and disk owned by the processor. Need to minimize resource sharing to better scale up.
The image illustrates:
Consensus on parallel and distributed database systems architecture: shared-nothing, and machine communicate over the network by sending messages. Tuples in relations are partitioned across storage devices, which allows multiple processors to scan large relation in parallel. Its a consensus because systems like Teradata, Gamma has this architecture. Though these systems have very different interconnection networks.
The reason is that this design minimizes interference and scales better. Industry has been late on the uptake mostly because high quality commodity hardware was recently introduced to market.
Parallel Dataflow Approach
Here is an image that summarizes all discussed above
Parallelism is an unanticipated benefit of the relational model.
Amazing how things just work out I guess! Or perhaps fundamentally knowledge and facts are distributable.
Dataflow has two types of parallelism:
- Pipeline: The authors sounded a bit pessimistic on this approach and thinks that its limited due to (1) pipelines are short for relational queries (2) some operators do not emit the first output tuple until they have consumeed all their inputs. Now we have innovations like Eddies!
- Partition: easier to use divide and conquer. Below illustrates different partitioning methods
Hash partitioning is ideal for associative access, so that it avoids the overhead of starting queries on multiple disks.
Harnessing locality and storing them together is also a common technique and could be captured with range partitions, which is also good for sequential and associative access.But it suffers more from potential data skew and execution skew.
Data partitioning naturally motivates the question of whether we could achieve paralleims within relational operators with merge functions.
Going beyond merge function, there could be specialized parallel relational operators, like sort-merge joins, hash-joins etc.
Here is an example dataflow program
(Back to the) Future Directions
Concurrency introduces significant locking overhead.
Scheduling is challenging: batch jobs tend to monopolize the processor and flood the memory cache.
Parallel query optimization: (1) no one back then have considered the distributed case (2) data skew that makes query plan estimate poor (Eddies again!)
How do perform large scale changes (e.g. reorganizing, or dumping large datasets) while making sure that queries could still be served in reasonable times.
Applications that are not well supported by the relational data model?
Comment: I think this paper was spot on about some fundamental trade-offs and trends, and pretty accurately predicted the major research trends in DBMS in the following decade.