Dos Tae1
Dos Tae1
DOS TAE 1
Application of Distributed Operating System:
Case Study on Google Search Engine
Google is a US-based corporation with its headquarters in Mountain View, California (the
Googleplex), offering Internet search and broader web applications and earning revenue largely
from advertising associated with such services. The name is a play on the word googol, the
number 10100 (or 1 followed by a hundred zeros), emphasizing the sheer scale of information
available in the Internet today.
Google’s mission is to tame this huge body of information: ‘to organize the world’s information
and make it universally accessible and useful’ [www.google.com].Google was born out of a
research project at Stanford University, with the company launched in 1998. Since then, it has
grown to have a dominant share of the Internet search market, largely due to the effectiveness of
the underlying ranking algorithm used in its search engine (discussed further below).
Significantly, Google has diversified, and as well as providing a search engine is now a major
player in cloud computing. From a distributed systems perspective, Google provides a
fascinating case study with extremely demanding requirements, particularly in terms of
scalability, reliability, performance and openness.
Google’s mission:
To organize the world’s information and make it universally accessible and useful’.
The role of the Google search engine is, as for any web search engine, to take a given query and
return an ordered list of the most relevant results that match that query by searching the content
of the Web. The challenges stem from the size of the Web and its rate of change, as well as the
requirement to provide the most relevant results from the perspective of its users.
The underlying search engine consists of a set of services for crawling the Web and indexing and
ranking the discovered pages. Crawling: The task of the crawler is to locate and retrieve the
contents of the Web and pass the contents onto the indexing subsystem. This is performed by a
software service called Googlebot, which recursively reads a given web page, harvesting all the
links from that web page and then scheduling further crawling operations for the harvested links
(a technique known as deep searching that is highly effective in reaching practically all pages in
the Web). It is important for search engines to be able to report accurately on breaking news or
changing share prices. Googlebot therefore took note of the change history of web pages and
revisited frequently changing pages with a period roughly proportional to how often the pages
change. With the introduction of Caffeine in 2010 [googleblog.blogspot.com II], Google has
moved from a batch approach to a more continuous process of crawling intended to offer more
freshness in terms of search results.
Indexing: indexing produces what is known as an inverted index mapping words appearing in
web pages and other textual web resources (including documents in .pdf, .doc and other formats)
onto the positions where they occur in documents, including the precise position in the document
and other relevant information such as the font size and capitalization (which is used to
determine importance, as will be seen below). The index is also sorted to support efficient
queries for words against locations. As well as maintaining an index of words, the Google search
engine also
maintains an index of links, keeping track of which pages link to a given site. This is used by the
PageRank algorithm. This inverted index will allow us to discover
web pages that include the search terms ‘distributed’, ‘systems’ and ‘book’ and, by careful
analysis, we will be able to discover pages that include all of these terms. For example, the
search engine will be able to identify that the three terms can all be founding amazon.com,
www.cdk5.net and indeed many other web sites. Using the index, it is therefore possible to
narrow down the set of candidate web pages from billions to perhaps tens of thousands,
depending on the level of discrimination in the keywords chosen.
Ranking: The problem with indexing on its own is that it provides no information about the
relative importance of the web pages containing a particular set of keywords. All modern search
engines therefore place significant emphasis on a system of ranking whereby a higher rank is an
indication of the importance of a page and it is used to ensure that important pages are returned
nearer to the top of the list of results than lower-ranked pages. As mentioned above, much of the
success of Google can be traced back to the effectiveness of its ranking algorithm, PageRank
[Longville and Meyer 2006]. PageRank is inspired by the system of ranking academic papers
based on citation analysis. In the academic world, a paper is viewed as important if it has a lot of
citations by other academics in the field. Similarly, in PageRank, a page will be viewed as
important if it is linked to by a large number of other pages (using the link data mentioned
above). PageRank also goes beyond simple ‘citation’ analysis by looking at the importance of
the sites that contain links to a given page. Ranking in Google also takes a number of other
factors into account, including the proximity of keywords on a page and whether they are in a
large font or are capitalized (based on the information stored in the inverted index).
Submitted By:
Nidhi Laddha(8)