Year: 2012

Authors: TJ Green, SS Huang, BT Loo, and W Zhou

Title: Datalog and Recursive Query Processing

Basics

Syntax: A Datalog program consists of a collection of rules, each of which is made of a head and a body, which is a conjunction of statements, which is either atom or constraint: A :- D, B1, …, Bk.

Semantics: there are three ways to look at it:

  • model-theoretic: a model of a datalog program is a database instances that extends the source database and satisfies all the rules of the program. A theorem (2.1) states that there exist a minimal model for every program and that model is polynomial in size of base database (note: this can be shown by the fact that computing the unions of conjunctive queries is polynomial). Note that this is not polynomial in time! Recursive datalog such as x :- x + 1 doesn’t even finish.
  • fixed-point: intuitively a fixed-point is when there is when no more new derivation could be made. The more rigorous definition uses the notion of “immediate consequence operator”
  • proof-theoretic: a proof of an output atom can be represented graphically as a proof tree, where leaf nodes correspond to sources atoms in EDB (base data) and intermediate nodes correspond to IDB (derived facts).

Evaluation

Note: the survey has very good structure, but I was lost in the notions for the section that described query evaluation, so here I attempt to provide a less rigorous but simpler description to gain the intuition. Panda kindly helped answer a few of my questions and I include some of his answers here.

  • Bottom Up (backward chaining): start with the facts and use the rules to infer new facts. So basically in each step the naive algorithm determines what facts can be derived from the original input facts and by applying rules $k$ times. Algorithmically one could represent this as the following. Observe that in this scheme each recursive step recomputes all the facts from the previous step. What seminaive computation is trying to do is reduce the number of facts computed in each step, it does so by modifying the rules so that in the $k^{th}$ step the algorithm does not produce any tuples that were known in the $(k-1)^{th}$ step.
    • Step 0: Set f_0 = known facts
    • Step 1: Derive f_1 = {f_0, rules applied to all facts in f_0}
    • Step 2: If f_1 != f_0 then { set f_0 = f_1; goto Step 1 } else { return f_1 as the set of all derivable facts }.
  • Top Down (forward chaining): take the query and recursively generate subqueries based on the given rules. The details that goes into generation of subqueries gets notation heavy, with annotations etc. Basically using unification which “introduces constant bindings for variables in the candidate rules” is top down information passing. The binding information is then pushed into the bodies of candidate rules (propagate the binding), and this is side ways information passing. The subqueries generated in this manner will be answered similarly, “by unifying them with candidate rules, possibly resulting in more subqueries”. “The production and answering of queries/subqueries repeat until no more tuples are derived into answer sets, and no more subqueries are produced.”

There are tradeoffs between top down and bottom up. Bottom up always generates correct facts, but the facts may not be needed for the query. Topdown may try incorrect bindings, though the correct ones are always relevant. To get the best of both worlds, the Magic Set concept was invented: allow the query itself to influence what sets of tuples are generated during bottom up exploration; this prunes a large part of the space explored by bottom up exploration. To achieve these, there are two steps/tricks:

  • sideways information passing strategy (SIP): SIP dictates the order in which the query predicates are evaluated. The way to think about these is that given a set of facts, each predicate gets rid of a subset of them which act as input to the next predicate. The SIP therefore allows one to decide whether a fact will get past a particular predicate or not.
  • A transformed set of rules that are modified so that they fire (i.e., can generate tuples) only when the generated fact would be passed from one rule to another (which you know from the SIP). Note that these rewritten rules is what allows for the efficient bottom up evaluation to work efficiently.

Given these two, the exploration for a query with sip S is as below:

  1. You are given some fact “F”
  2. For each rule R, you are given R_{S_1} the rule modified to fire only if the generated fact will get past the first predicate chosen by S. Given this you generate some set of derived facts F’, F’’, … that are guaranteed to get past the first predicate.
  3. Now for the second predicate you consider the set of facts {F, F’, F’’, …} and again apply R_{S_2} to generate the set of fact that would survive this predicate and be evaluated by the third predicate. And so on and so forth.

The good thing now is that while evaluating each predicate P, you do not have to consider any generated facts that would not make it past P, and you can show that this effectively reduces the search space. In fact if you generate a SIP smartly, you can order it so that restrictive predicates appear early on and thus vast sets of facts are removed from the exploration path.

Lastly, the magic set is the subset of a set produced by Datalog rules which every query refers to, that is sufficient to answer a given query.

Application

  • Program Analysis: use to optimize performance. Datalog is suitable because recursion is common in analysis.
  • Declarative Networking: compiles to a dataflow framework for execution, and gives room for optimization and safety checks. Apparently datalog rules can also express distributed computations easily.
  • Data integration and exchange: a source database, a target database, and a set of schema mappings. Data integration allows users to query the target and transform the query to query over source database, and exchange materializes the target database. Propagating data over these queries could use some datalog evaluation tricks (especially with the more complex mapping schemes).
  • And there are other applications to security, concurrent programming, and “cloud programming” (bloom-land).

Yifan’s random and perhaps wrong thought: there is a new wave of software that does workflow programming and each workflow edge could be a rule in a datalog program.