Year: 2011

Authors: S Kandel, A Paepke, J Hellerstein, J Heer

Summary

The authors found that 80% of the development time in data warehousing projects is on data cleaning and identified that the main issues are because the transforms are difficult to specify and evaluate, further, people would want to look at the actual steps that transformed the data for provenance.

To these ends, Wrangler helps simplify specification and minimize manual repetition, using a mixed-initiative user interface with an underlying declarative transformation language (thats based on Potters Wheel).

The overall user experience is GUI for showing intentions of edits, for which Wrangler shows a ranked list of potential operations, which are coupled with natural language descriptions, which the user could refine. A visual preview of the transform results is also shown to help analysts to assess the space of possible transforms.

Interface Design

  • Automated Transformation Suggestions
  • Natural language descriptions of the transform type and parameters
  • Visual transformation previews to enable users to quickly evaluate the effect of a transform
  • Transformation histories and export: users can edit individual transform descriptions and selectively enable and disable prior transforms.
  • Data quality meter (per column): Wrangler tries to infer the data type and semantic role for each column

Inference Engine

Setup

  • Input: Inputs to the engine consist of user interactions; the current working transform; data descriptions such as column data types, semantic roles, and summary statistics; and a corpus of historical usage statistics.
  • Output: ranked transforms

  • Parameters: row, column or text selections and enumerables.
  • Equivalence: defined for each parameter.

Components

  • Inferring parameter sets from user interaction: infer three types of transform parameters:
    • row: row indices and predicate matching
    • column: return the columns that users have interacted with
    • text selections: simple index ranges or inferred regular expressions
  • Generating suggested transforms: loop over each transform type in the language, emitting the types that can accept all parameters in the set.
  • Ranking Suggested Transforms according to five criteria:
    • by type
      • explicit interactions
      • specification difficulty
      • corpus frequency
    • within type
      • frequency of equivalent transforms
      • surface diverse transform types in the final suggestion list