Authors: MJ Cafarella, A Halevy, ZD Wang et al.
The web has usually been modeled as a corpus of unstructured documents, but there are a lot of tables on the web (assume the ones that are surfaced and crawlable), could these tables be combined to improve search results? Its different from previous attempts because the data size is larger and they use the results in more diverse ways.
The authors set out to explore:
- what are effective techniques for searching structured data (at search-engine scale)
- what are the gains?
And their solution:
- how do to keyword search over the tables
- “We describe a ranking method that combines table-structure- aware features (made possible by the index) with a novel query-independent table cohere”
- new object that records corpus-wide stats on co-occurrences of schema elements
- Extracting Relations: combines hand-written detectors and statistically-trained classifiers to filter good relations from bad ones (another research project achieved this).
- Attribute Correlation Stats: how attribute names are used in schemas. It had a powerlaw distribution where a small fraction of schemas showed up very often. It allows the authors to do the following:
- computing probability of seeing various attributes in a schema
- detect relationships between attribute names by conditioning an attribute’s probability on the presence of a second attribute
What it does: rank relations by relevance, with a search-engine-style keyword query as input.
Note that unlike most search engines, WebTables results pages are actually useful on their own, even if the user does not navigate away (like Google’s search card).
- Ranking: key-word ranking of individual databases is a novel problem, to this end they implement the following algorithms:
- naiveRank: sends the user’s query to a search engine and fetches the top-k pages, filterRank similar
- featureRank: uses the relation-specific features, use linear regression estimator and trained on human judges.
- schemaRank: similar to featureRank but adds ACSDb-based schema coherency (attributes are tightly related) score
- Indexing: standard is to use simple inverted index to speed up lookups, but too inefficient. They use a 2-dimentional index that are both performant and useful.
- Schema Auto Complete: assist novice database users when designing a relational schema.
- Attribute Synonym Finding: traditionally it’s hand-made thesaurus but we could generate from ACSBb.
- Join-graph traversal: help user navigate between extracted schemas (and cluster around the more important ones)