0% found this document useful (0 votes)
3 views3 pages

Dos Tae1

The document discusses the application of distributed operating systems through a case study on Google Search Engine, highlighting its mission to organize the world's information. It outlines the processes of crawling, indexing, and ranking, emphasizing the importance of the PageRank algorithm in determining the relevance of search results. Google has evolved from a search engine to a major player in cloud computing, facing challenges related to scalability, reliability, and performance.

Uploaded by

nidhi.laddha09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Dos Tae1

The document discusses the application of distributed operating systems through a case study on Google Search Engine, highlighting its mission to organize the world's information. It outlines the processes of crawling, indexing, and ranking, emphasizing the importance of the PageRank algorithm in determining the relevance of search results. Google has evolved from a search engine to a major player in cloud computing, facing challenges related to scalability, reliability, and performance.

Uploaded by

nidhi.laddha09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

G H Raisoni College of Engineering, Nagpur

Department of Computer Science & Engineering


Session 2015-16, summer 2016

DOS TAE 1
Application of Distributed Operating System:
Case Study on Google Search Engine

Google is a US-based corporation with its headquarters in Mountain View, California (the
Googleplex), offering Internet search and broader web applications and earning revenue largely
from advertising associated with such services. The name is a play on the word googol, the
number 10100 (or 1 followed by a hundred zeros), emphasizing the sheer scale of information
available in the Internet today.
Google’s mission is to tame this huge body of information: ‘to organize the world’s information
and make it universally accessible and useful’ [www.google.com].Google was born out of a
research project at Stanford University, with the company launched in 1998. Since then, it has
grown to have a dominant share of the Internet search market, largely due to the effectiveness of
the underlying ranking algorithm used in its search engine (discussed further below).
Significantly, Google has diversified, and as well as providing a search engine is now a major
player in cloud computing. From a distributed systems perspective, Google provides a
fascinating case study with extremely demanding requirements, particularly in terms of
scalability, reliability, performance and openness.
Google’s mission:
To organize the world’s information and make it universally accessible and useful’.

The role of the Google search engine is, as for any web search engine, to take a given query and
return an ordered list of the most relevant results that match that query by searching the content
of the Web. The challenges stem from the size of the Web and its rate of change, as well as the
requirement to provide the most relevant results from the perspective of its users.

A brief overview of the operation of Google search below:


Fig: overall system architecture of Google

The underlying search engine consists of a set of services for crawling the Web and indexing and
ranking the discovered pages. Crawling: The task of the crawler is to locate and retrieve the
contents of the Web and pass the contents onto the indexing subsystem. This is performed by a
software service called Googlebot, which recursively reads a given web page, harvesting all the
links from that web page and then scheduling further crawling operations for the harvested links
(a technique known as deep searching that is highly effective in reaching practically all pages in
the Web). It is important for search engines to be able to report accurately on breaking news or
changing share prices. Googlebot therefore took note of the change history of web pages and
revisited frequently changing pages with a period roughly proportional to how often the pages
change. With the introduction of Caffeine in 2010 [googleblog.blogspot.com II], Google has
moved from a batch approach to a more continuous process of crawling intended to offer more
freshness in terms of search results.

Indexing: indexing produces what is known as an inverted index mapping words appearing in
web pages and other textual web resources (including documents in .pdf, .doc and other formats)
onto the positions where they occur in documents, including the precise position in the document
and other relevant information such as the font size and capitalization (which is used to
determine importance, as will be seen below). The index is also sorted to support efficient
queries for words against locations. As well as maintaining an index of words, the Google search
engine also
maintains an index of links, keeping track of which pages link to a given site. This is used by the
PageRank algorithm. This inverted index will allow us to discover
web pages that include the search terms ‘distributed’, ‘systems’ and ‘book’ and, by careful
analysis, we will be able to discover pages that include all of these terms. For example, the
search engine will be able to identify that the three terms can all be founding amazon.com,
www.cdk5.net and indeed many other web sites. Using the index, it is therefore possible to
narrow down the set of candidate web pages from billions to perhaps tens of thousands,
depending on the level of discrimination in the keywords chosen.

Ranking: The problem with indexing on its own is that it provides no information about the
relative importance of the web pages containing a particular set of keywords. All modern search
engines therefore place significant emphasis on a system of ranking whereby a higher rank is an
indication of the importance of a page and it is used to ensure that important pages are returned
nearer to the top of the list of results than lower-ranked pages. As mentioned above, much of the
success of Google can be traced back to the effectiveness of its ranking algorithm, PageRank
[Longville and Meyer 2006]. PageRank is inspired by the system of ranking academic papers
based on citation analysis. In the academic world, a paper is viewed as important if it has a lot of
citations by other academics in the field. Similarly, in PageRank, a page will be viewed as
important if it is linked to by a large number of other pages (using the link data mentioned
above). PageRank also goes beyond simple ‘citation’ analysis by looking at the importance of
the sites that contain links to a given page. Ranking in Google also takes a number of other
factors into account, including the proximity of keywords on a page and whether they are in a
large font or are capitalized (based on the information stored in the inverted index).

Submitted By:
Nidhi Laddha(8)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy