The Many Faces of Google

Contents | Query Input | Understanding Results | Special Tools | Links
How Google Works - The Results Page - Spelling Corrections - Definitions - Cached Pages - Similar Pages - News - Product Search - File Type - Translation - Preferences - Advertising - Evaulating Results

Next Page Previous Page
How Google Works

Google consists of three distinct parts, each of which is run on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneiously, significantly speeding up data processing.

Let's take a closer look at each part.

Googlebot, Google's web Crawler

Googlebot is Google's web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It's easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn't traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google's indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it's capable of doing.

Googlebot finds pages in two ways: finding links by crawling the web and through an add URL form, www.google.com/addurl.html, which allows people who want their web sites listed in Google's output to submit them directly.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of "visit soon" URLs must be constantly examined and compared with URLs already in Google's index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it's a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

Google's Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google's index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google eliminates common words called stop words (such as the, is, on, or, of, how, why, to, be, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also eliminates some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google's performance.

Google's Query Processor

The query processor has several parts, including the user interface (search box), the "engine" that evaluates queries and matches them to relevant documents, and the results formatter.

Google considers over a hundred factors in determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance, and tweaks them to improve quality and performance, and to outwit the latest devious techniques used by spammers, but basically it works like this:

  1. Search for Word hits: Whether the search terms appear on the page
  2. Search for Adjacency hits: How close together the search terms appear on the page
  3. Search for Frequency hits: How often the search terms appear on the page
  4. Search for about 100 other secret variables, then ...
  5. Apply Page Rank: There is a premise in higher education that the importance of a research paper can be judged by the number of times it is cited or referenced in other papers. Google applies this premise to the Web: the importance of a web page can be judged by the number of hyperlinks pointing to it from other pages. A page with a lot of (incoming) links to it is judged to be more important than a page with only a few links to it. A page with few (outgoing) links to other pages is judged to be more important than a page with links to a lot of other pages
  6. The pages with the highest ranks appear first on the output list>

There are numerous sources explaining Google's PageRank, including Pagerank Explained Correctly with Examples, which you can find at www.iprcom.com/papers/pagerank/ and Google's PageRank Explained and How to Make the Most of It by Phil Craven, which you can find at www.webworkshop.net/pagerank.html.


Next Page Previous Page