Year: 1997

Authors: Roy Goldman, Jennifer Widom


Structure in data gives us a lot of power for querying, processing and users understanding the data, as seen in the relational model. However imposing structure on real world data could be challenging and thus limiting. How to work with semi-structured data was under active research. DataGuides serve as dynamic schemas generated from the database that are helful for browsing, querying and storing statistics.

Background: Object Exchange Model

It is built in the Lore system, which follows the Object Exchange Model (OEM). Each object in OEM contains an object identifier and a value, which could be atomic values (such as a number, string), or complex values, such as a set of subobjects OEM.

Below is an example OEM database.

example oem

We need some OEM definitions to define DataGuides (I have found it easier to just look at the examples):

  • label path: sequence of labels to traverse a path of edges where the edge has corresponding label (In Figure 1, Restaurant.Name and Bar are both valid label paths of object 1).
  • data path: sequence of alternating labels and oids that traverse a path of edges through n objects such that the edge has the label and the object has the oid. Restaurant.2.Name.5 and Bar.4 are data paths of object 1.
  • A data path d is an instance of a label path l if the sequence of labels in d is equal to l: Restaurant.2.Name.5 is an instance of Restaurant.Name and Bar.4 is an instance of Bar
  • Target set: In an OEM object s, a target set is a set t of oids such that there exists some label path l of s where t = {o l1.o1.l2.o2…ln.o is a data path instance of l}. The target set of Restaurant.Entree in Figure 1 is {6, 10, 11}
  • a given source database is identified by its root object and a DataGuide is intended to reflect the structure of a database (or metadata).

Note that one could use these definitions to query the underlying data as well. A query may look like the following:Select Restaurant.Name Where Restaurant.Entree = "Burger"

What is a DataGuide

Definition: A DataGuide for an OEM source object s is an OEM object d such that every label path of s has exactly one data path instance in d, and every label path of d is a label path of s.

Note that DataGuides themselves are also part of the database (hence in the example the nodes all have greater/different ID values). Below is an example of the dataguide for the previous example.

example dataguide

Dataguides could also have annotations to aid query formulation. Its defined as: “In a source database s, given a label path l, a property of the set of objects that comprise the target set of l in s is said to be an annotation of l. That is, an annotation of a label path is a statement about the set of objects in the database reachable by that path.” These annotations come in handy for assisting queries.

Building DataGuides

“Creating a DataGuide over a source database is equivalent to conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA)” It’s really cool that this works out so neatly!

The worst case bound is exponential in the number of objects but in practice its not that bad, and the authors have devised some more relaxed summary structure thats cheaper to create.

There might be multiple dataguides (could derive from automata theory of NFA to DFA), and the algorithm does not find the minimal one due to high maintenance cost (see example below). The annotations (defined in previous section) requires that the dataguide to be unambiguous (e.g. node 20 is ambiguous), which motivates strong dataguides.

example dataguide

Strong dataguides is intuitively defined to enforce that dataguides do not have ambiguous target set of any label path.

There is a lot of emphasis on how to maintain correct dataguides efficiently and the authors present many algorithms.

Using DataGuides

Dataguides is also easy for implementing query by example.

The annotations for dataguides could contain some sample values, so that users could fetch all the tuples despite loose scheme. For example, the country might be stored as USA or United States.

Dataguides could also help with query optimization. Due to the semi-structured nature most of the existing techniques need to be adjusted.