Authors: R Miller, LM Haas, MA Hernandez
Schema mapping helps turn old schme (legacy database) to a new schema. The system proposed, CLio, assists schema mapping with an algorithm for query derivation. Traditional data integration task just integrating schemas into a unified representation, but CLio’s goal is to discover queries.
This appear to be related to Potter’s Wheels line of work.
What it is: mapping between source and target schemas. It’s a pair of (1) function that takes a set of values in source database and maps to target database (2) filter: which data values should be used.
This is optimized for the human interacting with the system, since it cannot be fully automated. For example, one cannot know for sure that Emp in one relation corresponds to Employee. Even if the user cannot specify this a prior, the system could use reasonings about query containment to help user derive.
Here are some examples of the mappings (based on the examples set up above in the pictures):
We could see the potential solution space get complex, and this does not yet include discussions of aggregation or outer-joins yet, which requires the construction of the search space (and a formal description is out of scope). They first consider equivalence, then dominance preserving mappings. They also mention that they split via vertical composition and horizontal composition.
The main tool here is query containment.
One principle is to lose as little information as possible, and the author emphasize that sometimes their method may be wrong and the Clio interface allows for user intervention.
The overall process looks like the following:
- Phase 1: value correspondences are partitioned into sets that contain at most one correspondence per attribute of the table, each set is a potential candidate set. It’s complete if it includes all the attributes. There techniques to use to make this process cheaper.
- Phase 2: if the target is derived from several source relations (need vertical composition) and join the tuples. First Clio tries foreign keys, and if there are more than one use the one whose difference in outer and inner join are smallest (less null values I think). If there are more ambiguities, ask the user!
- Phase 3: find a subset of candidate sets that covers all value correspondence, ranked (if more than one) by reverse, the number of candidate sets (i.e. less SQL queries to execute!)
- Phase 4: build query from the selected cover.
This is akin to interactivity where the user wants to see the partial result before adding more value correspondences. The incremental algorithm also contains phases, as shown below:
- Phase 1: do the process of the steps mentioned in prev section
- Phase 2: apply change to new tentative covers
- Phase 3: same as last phase of prev alg.
Rant: the fact that the paper is not OCRed was really cramping my reading style! Can’t do keyword search : ( SIGMOD should invest in prettifying their old papers…