Author: E Brewer
The web contains a lot of ever changing unstructured data (ranging from plain text to multimedia), and query engines take in natural language inputs to provide a ranked list of the search results. In general have similar patterns as search in a database but very different. Brewer points out some of the similarities and walks through a design for such a system. This is written 6 years after the Google paper and he has worked in a few search engine companies.
- Top down design (similar to DBMS, but systems is more bottom up)
- Data Independence (data exists in sets, not pointers!)
- Declarative Query Language
The main structure seems fairly similar to the Google paper, with the following schemas:
Logical Query Plan
The query language is lot simpler than SQL and consist of proposition and expression where expression are nodes with scores based on the original search words. Its interesting to note that
NOT e is a proposition since negation doesn’t quite work with sets.
With the following operators that connects different expressions (e) and predicates (p)
Another proof that (almost) everything is a join!
bay area lang:english is mapped to
((bay AND area) FILTER lang:english)
Access Methods > There is really only one kind of access method: sequential scan of a sorted inverted index
It appears that the processing in a search engine is significantly easier due to the following:
- Cache all of the intermediate values for use by other queries, and therefore do not need to pipeline the plan.
- Because the lists are sorted, binary operators become merging operations: every join is a simple (presorted) merge join.
In fact, there is no reason to do binary operators: every join is a multiway merge join. The use of multiway joins is a win because it reduces the depth of the plan and thus the number of caching and scan steps (remember that intermediate results are not pipelined).
Brewer describes 4 physical operators:
- OR(e1,e2,…,ek) -> expr
- OR(p1,p2,..pk) -> prop
- AND(p1,p2,…,pk) -> prop
- FILTER(e1,…)(p1,…) -> expr
Most queries map onto a one-deep plan using FILTER. It is essentially an AND of all of its inputs, with only the expressions used to compute the score.
To implement these the authors use multiway joins.
- Different from the traditional System R bottom up approach
- More top down: easier to find highest cached subexpression. Bottom up to build partial solution is too expensive. They use a A* search algorithm to prune the space.
- Uses query rewrite (e.g. get rid of typos)
Range partitioning on DocID is essentially random partitioning.
The WordID is presorted and there is a hash index on WordID as well.
Though a search engine is different from a DBMS in many ways:
Search engines are read mostly, but occasionally need to update.
They follow a few principles:
- nodes (along with their replicas) are independent
- only update whole tables
- updates should be atomic w.r.t. queries
Now to describe what they do, first terms: chunk (a range of DocID) is the unit for atomic updates, and also the unit of partition. They split the chunks into different partitions and each partition could act independently (such as frequency of refresh)
- Crawl to get web data
- Index to convert a collection of documents into a chunk, along with parsing and scoring, and the management of metadata, such as tracking incoming and outgoing links.
- Install the chunk atomically: done by updating a version vector (one element for each chunk). Cachiang makes this a bit complicated: cache entries of old version need to be invalidated, and they make sure that a chunk is mapped to a specific cache so they could invalidate the whole thing. The updates are done incrementally (but atomically).
They need to deal with deletes of individual records for special cases, like illegal websites. They achieve it by adding rows that says item deleted and perform an anti-join. For real time ones, every chunk’s property table will have an additional entry.
Most documents are not worth replicating for high availability because most of them never appear in search results and could be crawled. However it’s hard to predicate what these are, so still have to do pretty extensive fault tolerance. Below are three types of faults:
- Disk swap out the whole node with a staging node, and then sort out the failed disks offline.
- Follower master detects via timeout
- Master: the masters are interchangeable, so just reissue the query to a different master, which has to be done by the end user who don’t really mind (so long as it’s not too often).
There are many details, but the paper highlights two:
- make the database smaller dynamically, which we can do by leaving out some chunks.
- Decline to execute some queries based on their cost, which is a form of admission control