Year: 2007

Authors: J Cheney, L Chiticariu, and WC Tan

Problem Formulation

“Provenance information describes the origins and the history of data in its life cycle.”

Motivation and Applications

Provenance is important to many data management tasks in today’s world where the origin of data is more complex and ever changing.

  • What motivates the different kinds of provenance, from these notes
    • Auditing and compliance
    • Data debugging
    • Access control
    • View maintenance
    • Data quality and trust

Solution Formulations

The survey broadly states that there are three kinds of provenance: why, how, and where.

Here is an illustration of the running example provenance

Why:

  • A subset of provenance is lineage and it forms the foundation of why provenance.
  • Lineage intuitively describes the sequence of subsets of relation for each operator (See Definition 2.2 and Theorem 2.1) that “witness”es the output (Definition 2.3).
  • Cui et al defined a version of why provenance that basically shows the original tuples that contributed towards the final result tuples. It’s very simple and doesn’s capture the relationships between the source tuples, like when two source tuples contributed to the same result tuple (i.e. not all tuples in the lineage is necessary for the output tuple).
  • Buneman et al captures the differences between the witnesses, and this is shown via set of sets, where each subset independently witnesses the output tuple (Fig 1.3).

provenance

How

The previously mentioned model doesnt tell us how the witness contributes to the output tuple (another way to look at it: the structure of the proof). In simple cases one could reverse engineer from the queries but for more complex cases there need to be more annotation. Compare Fig 1.5 to Fig 1.3 to get a sense of the difference.

provenance

  • To compose lineage better, its defined as a partial function that maps from a query, a relation, and a tuple to another relation, or undefined. Which I think turns things into annotations. Maybe seen as below(images were found from slides of chapter 14: data provenance for principles of data integration by Doan et al.)

provenance

  • Originally people used relational operators plus sequences (to deal with duplication) and later due to research of Green et al’s semi-ring, they have generalized the algebra to semirings. I think the following images demonstrate how semirings abstract out the basic lineage provenance that’s based on SQL operators to other types of provenance well. The Doan et al. also points out that the annotation model is a different way of expressing the graph model (I think).

provenance provenance provenance

Where

I think where-provenance is a bit exotic is not as well known. The intuitive idea seem to be that one tracks the “location”. It seems useful for “forwarding provenance infomation during query execusion”. Its implemented in DBNotes, where they implemented where-provenance with propagation rules. provenance More to add later!