Year: 1998

Authors: S Brin, L Page

Search Quality

  • PageRank: a model of user behavior (random surfer’s probability of ending up on a page).
  • Anchor text: accurate description of the web page and sometimes links to sites that are not crawled (which is helpful).
  • Other features such as font size (might seem silly but is a pretty extensible model to add other things of interest).

Architecture

  • Distributed crawlers
  • URLServer: reads anchors file and sends URLs to be fetched to crawlers (note: this needs to be boostraped)
  • Repo: saves the pages from crawlers and compress the web pages, which is assigned docID
  • Indexer: reads the web pages, uncompress, parse page and convert into a set of word occurrences (hits): tuple of word, position in doc, and other attributes, then is stored in “barrels” by docID. Also processes data to anchors file.
  • Anchors file: stored the linking graph of the web, along with text of the link.
  • Sorter: takes the barrels data and sort by wordID to generate inverted index.
  • Lexicon: a program takes the wordID list from sorter and generates new lexicon.
  • Searcher: uses lexicon and the inverted index and PageRank to answer questions.

Query Evaluation

  • Parse query to wordIDs
  • Scan through the document list until there is a document that matches search terms and compute rank
  • sort the top K.

Notes:

  • It’s only been not even 20 years and Google’s architecture now is, well, not going to fit into a 20 page paper.
  • I had to sit myself down and try to think from scratch how I would design a search engine to truly appreciate the contribution of the paper.