Authors: S Brin, L Page
- PageRank: a model of user behavior (random surfer’s probability of ending up on a page).
- Anchor text: accurate description of the web page and sometimes links to sites that are not crawled (which is helpful).
- Other features such as font size (might seem silly but is a pretty extensible model to add other things of interest).
- Distributed crawlers
- URLServer: reads anchors file and sends URLs to be fetched to crawlers (note: this needs to be boostraped)
- Repo: saves the pages from crawlers and compress the web pages, which is assigned docID
- Indexer: reads the web pages, uncompress, parse page and convert into a set of word occurrences (hits): tuple of word, position in doc, and other attributes, then is stored in “barrels” by docID. Also processes data to anchors file.
- Anchors file: stored the linking graph of the web, along with text of the link.
- Sorter: takes the barrels data and sort by wordID to generate inverted index.
- Lexicon: a program takes the wordID list from sorter and generates new lexicon.
- Searcher: uses lexicon and the inverted index and PageRank to answer questions.
- Parse query to wordIDs
- Scan through the document list until there is a document that matches search terms and compute rank
- sort the top K.
- It’s only been not even 20 years and Google’s architecture now is, well, not going to fit into a 20 page paper.
- I had to sit myself down and try to think from scratch how I would design a search engine to truly appreciate the contribution of the paper.