arasu2001
arasu2001
We offer an overview of current Web search engine design. After introducing a generic search
engine architecture, we examine each engine component in turn. We cover crawling, local Web
page storage, indexing, and the use of link analysis for boosting search performance. The most
common design and implementation techniques for each of these components are presented.
For this presentation we draw from the literature and from our own experimental search
engine testbed. Emphasis is on introducing the fundamental concepts and the results of
several performance analyses we conducted to compare different designs.
Categories and Subject Descriptors: H.3.5 [Information Storage and Retrieval]: On-line
Information Services—Web-based services; H.3.4 [Information Storage and Retrieval]:
Systems and Software
General Terms: Algorithms, Design, Performance
Additional Key Words and Phrases: Authorities, crawling, HITS, indexing, information
retrieval, link analysis, PageRank, search engine
1. INTRODUCTION
The plentiful content of the World-Wide Web is useful to millions. Some
simply browse the Web through entry points such as Yahoo!. But many
information seekers use a search engine to begin their Web activity. In this
case, users submit a query, typically a list of keywords, and receive a list of
Web pages that may be relevant, typically pages that contain the keywords.
In this paper we discuss the challenges in building good search engines,
and describe some useful techniques.
Many of the search engines use well-known information retrieval (IR)
algorithms and techniques [Salton 1989; Faloutsos 1985]. However, IR
algorithms were developed for relatively small and coherent collections
such as newspaper articles or book catalogs in a (physical) library. The
Web, on the other hand, is massive, much less coherent, changes more
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001, Pages 2–43.
Searching the Web • 3
pages that were covered in the crawling process. As mentioned earlier, text
indexing of the Web poses special difficulties, due to its size and its rapid
rate of change. In addition to these quantitative challenges, the Web calls
for some special, less common, kinds of indexes. For example, the indexing
module may also create a structure index, which reflects the links between
pages. Such indexes would not be appropriate for traditional text collec-
tions that do not contain links. The collection analysis module is responsi-
ble for creating a variety of other indexes.
The utility index in Figure 1 is created by the collection analysis module.
For example, utility indexes may provide access to pages of a given length,
pages of a certain “importance,” or pages with some number of images in
them. The collection analysis module may use the text and structure
indexes when creating utility indexes. Section 4 examines indexing in more
detail.
During a crawling and indexing run, search engines must store the pages
they retrieve from the Web. The page repository in Figure 1 represents
this—possibly temporary— collection. Search engines sometimes maintain
a cache of the pages they have visited beyond the time required to build the
index. This cache allows them to serve out result pages very quickly, in
addition to providing basic search facilities. Some systems, such as the
Internet Archive, have aimed to maintain a very large number of pages for
permanent archival purposes. Storage at such a scale again requires
special consideration. Section 3 examines these storage-related issues.
The query engine module is responsible for receiving and filling search
requests from users. The engine relies heavily on the indexes, and some-
times on the page repository. Due to the Web’s size and the fact that users
typically only enter one or two keywords, result sets are usually very large.
Hence the ranking module has the task of sorting the results such that
results near the top are the most likely to be what the user is looking for.
The query module is of special interest because traditional information
retrieval (IR) techniques have run into selectivity problems when applied
without modification to Web searching: Most traditional techniques rely on
measuring the similarity of query texts with texts in a collection’s docu-
ments. The tiny queries over vast collections that are typical for Web
search engines prevent such similarity-based approaches from filtering
sufficient numbers of irrelevant pages out of search results. Section 5
introduces search algorithms that take advantage of the Web’s interlinked
nature. When deployed in conjunction with the traditional IR techniques,
these algorithms significantly improve retrieval precision in Web search
scenarios.
In the rest of this article we describe in more detail the search engine
components we have presented. We also illustrate some of the specific
challenges that arise in each case and some of the techniques that have
been developed. Our paper is not intended to provide a complete survey of
techniques. As a matter of fact, the examples we use for illustration are
drawn mainly from our own work since it is what we know best.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
6 • A. Arasu et al.
Note that if we do not use idf terms in our similarity computation, the
importance of a page, IS ~ P ! , can be computed with “local” information,
i.e., P and Q . However, if we use idf terms, then we need global
information. During the crawling process we have not seen the entire
collection, so we have to estimate the idf factors from the pages that
have been crawled, or from some reference idf terms computed at some
other time. We use IS9 ~ P ! to refer to the estimated importance of page
P , which is different from the actual importance IS ~ P ! , which can be
computed only after the entire Web has been crawled.
Chakrabarti et al. [1999] presents another interest-driven approach
based on a hierarchy of topics. Interest is defined by a topic, and the
crawler tries to guess the topic of pages that will be crawled (by
analyzing the link structure that leads to the candidate pages).
● Crawl & Stop: Under this model, the crawler C starts at its initial page
P 0 and stops after visiting K pages. (K is a fixed number determined by
the number of pages that the crawler can download in one crawl.) At this
point a perfect crawler would have visited pages R 1 , . . . , R K , where R 1
is the page with the highest importance value, R 2 is the next highest,
and so on. We call pages R 1 through R K the hot pages. The K pages
visited by our real crawler will contain only M ~ # K ! pages with rank
higher than or equal to that of R K . (Note that we need to know the exact
rank of all pages in order to obtain the value M . Clearly, this estimation
may not be possible until we download all pages and obtain the global
image of the Web. Later, in Section 2.1.4, we restrict the entire Web to
the pages in the Stanford domain and estimate the ranks of pages based
on this assumption.) Then we define the performance of the crawler C to
be P CS ~ C ! 5 ~ M z 100 ! / K . The performance of the ideal crawler is of
course 100%. A crawler that somehow manages to visit pages entirely at
random, and may revisit pages, would have a performance of
~ K z 100 ! / T , where T is the total number of pages in the Web. (Each
page visited is a hot page with probability K / T . Thus, the expected
number of desired pages when the crawler stops is K 2 / T .)
● Crawl & Stop with Threshold: We again assume that the crawler visits K
pages. However, we are now given an importance target G , and any page
with importance higher than G is considered hot. Let us assume that the
total number of hot pages is H . Again, we assume that we know the
ranks of all pages, and thus can obtain the value H . The performance of
the crawler, P ST ~ C ! , is the percentage of the H hot pages that have been
visited when the crawler stops. If K , H , then an ideal crawler will have
performance ~ K z 100 ! / H . If K $ H , then the ideal crawler has 100%
performance. A purely random crawler that revisits pages is expected to
visit ~ H / T ! z K hot pages when it stops. Thus, its performance is
~ K z 100 ! / T . Only if the random crawler visits all T pages, is its
performance expected to be 100%.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
10 • A. Arasu et al.
80%
Ordering metric:
60% PageRank
backlink
breadth
40% random
20%
0% Pages crawled
0% 20% 40% 60% 80% 100%
Fig. 2. The performance of various ordering metrics for IB ~ P ! ; G 5 100 .
(breadth). The straight line in the graph (Figure 2) shows the expected
performance of a random crawler.
From the graph, we can clearly see that an appropriate ordering metric
can significantly improve the performance of the crawler. For example,
when the crawler used IB9 ~ P ! (backlink) as its ordering metric, the crawler
downloaded more than 50% hot pages when it visited less than 20% of the
entire Web. This is a significant improvement compared to a random
crawler or a breadth-first crawler, which downloaded less than 30% hot
pages at the same point. One interesting result of this experiment is that
the PageRank ordering metric, IR9 ~ R ! , shows better performance than the
backlink ordering metric IB9 ~ R ! , even when the importance metric is
IB ~ R ! . This is due to the inheritance property of the PageRank metric,
which can help avoid downloading “locally popular” pages before “globally
popular but locally unpopular” pages. In additional experiments [Cho et al.
1998] (not described here), we studied other metrics, and also observe that
the right ordering metric can significantly improve crawler performance.
(By up-to-date we mean that the content of a local page equals that of its
real-world counterpart.) Then, the freshness of the local collection S at time
t is
O F~e ;t!.
N
1
F~S;t! 5 i
N i51
The freshness is the fraction of the local collection that is up-to-date. For
instance, F ~ S;t ! will be one if all local pages are up-to-date, and F ~ S;t ! will
be zero if all local pages are out-of-date.
Age: To capture “how old” the collection is, we define the metric age as
follows:
Definition 2. The age of the local page e i at time t is
O A~e ;t!.
N
1
A~S;t! 5 i
N i51
The age of S tells us the average “age” of the local collection. For
instance, if all real-world pages changed one day ago and we have not
refreshed them since, A ~ S;t ! is one day.
Obviously, the freshness (and age) of the local collection may change over
time. For instance, the freshness might be 0.3 at one point of time and it
might be 0.6 at another point of time. Because of this possible fluctuation,
we now compute the average freshness over a long period of time and use
this value as the “representative” freshness of a collection.
Definition 3. # ~ e i! ,
We define the time average of freshness of page e i , F
# ~ S ! , as
and the time average of freshness of collection S, F
E E
t t
1 1
F# ~ei! 5 lim F~ei;t!dt F# ~S! 5 lim F~S;t!dt.
t3` t t3` t
0 0
e1 v v v v v v vv v
1 day
e2 v
second.) Depending on the page refresh strategy, this limited page down-
load resource will be allocated to different pages in different ways. For
example, the proportional refresh policy will allocate this download re-
source proportionally to the page change rate.
To illustrate the issues, consider a very simple example. Suppose that the
crawler maintains a collection of two pages: e 1 and e 2 . Page e 1 changes nine
times per day and e 2 changes once a day. Our goal is to maximize the
freshness of the database averaged over time. In Figure 3, we illustrate our
simple model. For page e 1 , one day is split into nine intervals, and e 1
changes once and only once in each interval. However, we do not know
exactly when the page changes within an interval. Page e 2 changes once
and only once per day, but we do not know precisely when.
Because our crawler is a tiny one, assume that we can refresh one page
per day. Then what page should it refresh? Should the crawler refresh e 1 or
should it refresh e 2 ? To answer this question, we need to compare how the
freshness changes if we pick one page over the other. If page e 2 changes in
the middle of the day and if we refresh e 2 right after the change, it will
remain up-to-date for the remaining half of the day. Therefore, by refresh-
ing page e 2 we get 1 / 2 day “benefit”(or freshness increase). However, the
probability that e 2 changes before the middle of the day is 1 / 2 , so the
“expected benefit” of refreshing e 2 is 1 / 2 3 1 / 2 day 5 1 / 4 day . By the
same reasoning, if we refresh e 1 in the middle of an interval, e 1 will remain
up-to-date for the remaining half of the interval (1 / 18 of the day) with
probability 1 / 2 . Therefore, the expected benefit is 1 / 2 3 1 / 18 day 5
1 / 36 day . From this crude estimation, we can see that it is more effective
to select e 2 for refresh!
Of course, in practice, we do not know for sure that pages will change in
a given interval. Furthermore, we may also want to worry about the age of
data. (In our example, if we always visit e 2 , the age of e 1 will grow
indefinitely.)
In Cho and Garcia-Molina [2000c], we studied a more realistic scenario,
using the Poisson process model. In particular, we can mathematically
prove that the uniform policy is always superior or equal to the propor-
tional one for any number of pages, change frequencies, and refresh rates,
and for both the freshness and the age metrics when page changes follow
Poisson processes. For a detailed proof, see Cho and Garcia-Molina [2000c].
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
Searching the Web • 15
f
1.4
1.2
1.0
0.8
0.6
0.4
0.2
λ
1 2 3 4 5
Fig. 4. Change frequency vs. refresh frequency for freshness optimization.
2.3 Conclusion
In this section we discussed the challenges that a crawler encounters when
it downloads large collections of pages from the Web. In particular, we
studied how a crawler should select and refresh the pages that it retrieves
and maintains.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
16 • A. Arasu et al.
There are, of course, still many open issues. For example, it is not clear
how a crawler and a Web site can negotiate/agree on a right crawling
policy, so that the crawler does not interfere with the primary operation of
the site while downloading the pages on the site. Also, existing work on
crawler parallelization is either ad hoc or quite preliminary, so we believe
this issue needs to be carefully studied. Finally, some of the information on
the Web is now “hidden” behind a search interface, where a query must be
submitted or a form filled out. Current crawlers cannot generate queries or
fill out forms, so they cannot visit the “dynamic” content. This problem will
get worse over time, as more and more sites generate their Web pages from
databases.
3. STORAGE
The page repository in Figure 1 is a scalable storage system for managing
large collections of Web pages. As shown in the figure, the repository needs
to perform two basic functions. First, it must provide an interface for the
crawler to store pages. Second, it must provide an efficient access API that
the indexer and collection analysis modules can use to retrieve the pages.
In the rest of the section, we present some key issues and techniques for
such a repository.
3.1 Challenges
A repository manages a large collection of “data objects,” namely, Web
pages. In that sense, it is conceptually quite similar to other systems that
store and manage data objects (e.g., file systems and database systems).
However, a Web repository does not have to provide a lot of the functional-
ity that the other systems provide (e.g., transactions, logging, directory
structure) and, instead, can be targeted to address the following key
challenges:
Scalability: It must be possible to seamlessly distribute the repository
across a cluster of computers and disks, in order to cope with the size of the
Web (see Section 1).
Dual access modes: The repository must support two different access
modes equally efficiently. Random access is used to quickly retrieve a
specific Web page, given the page’s unique identifier. Streaming access is
used to receive the entire collection, or some significant subset, as a stream
of pages. Random access is used by the query engine to serve out cached
copies to the end-user. Streaming access is used by the indexer and
analysis modules to process and analyze pages in bulk.
Large bulk updates: Since the Web changes rapidly (see Section 1), the
repository needs to handle a high rate of modifications. As new versions of
Web pages are received from the crawler, the space occupied by old versions
must be reclaimed1 through space compaction and reorganization. In
1
Some repositories might maintain a temporal history of Web pages by storing multiple
versions for each page. We do not consider this here.
addition, excessive conflicts between the update process and the applica-
tions accessing pages must be avoided.
Obsolete pages: In most file or data systems, objects are explicitly deleted
when no longer needed. However, when a Web page is removed from a Web
site, the repository is not notified. Thus, the repository must have a
mechanism for detecting and removing obsolete pages.
Node manager
Read nodes
Update nodes
Indexer
Crawler
LAN
Analyst
3.4 Conclusion
The page repository is an important component of the overall Web search
architecture. It must support the different access patterns of the query
engine (random access) and the indexer modules (streaming access) efficiently.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
20 • A. Arasu et al.
4. INDEXING
The indexer and collection analysis modules in Figure 1 build a variety of
indexes on the collected pages. The indexer module builds two basic
indexes: a text (or content) index and a structure (or link index). Using
these two indexes and the pages in the repository, the collection analysis
module builds a variety of other useful indexes. We present a short
description of each type of index, concentrating on their structure and use,
as follows:
Link index: To build a link index, the crawled portion of the Web is
modeled as a graph with nodes and edges. Each node in the graph is a Web
page, and a directed edge from node A to node B represents a hypertext
link in page A that points to page B . An index on the link structure must be
a scalable and efficient representation of this graph.
The most common structural information used by search algorithms
[Brin and Page 1998; Kleinberg 1999] is often neighborhood information,
i.e., given a page P , retrieve the set of pages pointed to by P (outward links)
or the set of pages pointing to P (incoming links). Disk-based adjacency list
representations [Aho et al. 1983] of the original Web graph and of the
inverted Web graph2 can efficiently provide access to such neighborhood
information. Other structural properties of the Web graph can be easily
derived from the basic information stored in these adjacency lists. For
example, the notion of sibling pages is often used as the basis for retrieving
pages “related” to a given page (see Section 5). Such sibling information can
be easily derived from the pair of adjacency list structures described above.
Small graphs of hundreds or even thousands of nodes can be represented
efficiently by any one of a variety of well-known data structures [Aho et al.
1983]. However, doing the same for a graph with several million nodes and
2
In the inverted Web graph, the direction of the hypertext links are reversed.
4.2 Challenges
Conceptually, building an inverted index involves processing each page to
extract postings, sorting the postings first on index terms and then on
locations, and finally writing out the sorted postings as a collection of
inverted lists on disk. For relatively small and static collections, as in the
environments traditionally targeted by information retrieval (IR) systems,
index-building times are not very critical. However, when dealing with
Web-scale collections, naive index-building schemes become unmanageable
and require huge resources, often taking days to complete. As a measure of
comparison with traditional IR systems, our 40 -million page WebBase
repository (Section 3.3) represents less than 4% of the publicly indexable
Web but is already larger than the 100 GB very large TREC-7 collection
[Hawking and Craswell 1998], which is the benchmark for large IR sys-
tems.
In addition, since content on the Web changes rapidly (see Section 1),
periodic crawling and rebuilding of the index is necessary to maintain
“freshness.” Index rebuildings become necessary because most incremental
index update techniques perform poorly when confronted with the huge
wholesale changes commonly observed between successive crawls of the
Web [Melnik et al. 2001].
Finally, storage formats for the inverted index must be carefully de-
signed. A small compressed index improves query performance by allowing
large portions of the index to be cached in memory. However, there is a
tradeoff between this performance gain and the corresponding decompres-
sion overhead at query time [Moffat and Bell 1995 ; Anh and Moffat 1998;
Witten 1994]. Achieving the right balance becomes extremely challenging
when dealing with Web-scale collections.
3
We discuss the statistician later, in Section 4.4.3.
stripping, sorting
1
2 tokenizing rat: 1 cat: 2
3 dog: 1 dog: 1
rat dog: 2 dog: 2
page dog
stream cat: 2 dog: 3
rat: 3 rat: 1
dog: 3 rat: 3
disk
memory memory memory
The WebBase indexing system builds the inverted index in two stages. In
the first stage, each distributor node runs a distributor process that
disseminates the pages to the indexers using the streaming access mode
provided by the repository. Each indexer receives a mutually disjoint
subset of pages and their associated identifiers. The indexers parse and
extract postings from the pages, sort the postings in memory, and flush
them to intermediate structures (sorted runs) on disk.
In the second stage, these intermediate structures are merged to create
one or more inverted files and their associated lexicons. An (inverted
file, lexicon) pair is generated by merging a subset of the sorted runs.
Each pair is transferred to one or more query servers depending on the
degree of index replication.
4.4.2 Parallelizing the Index-Builder. The core of our indexing engine is
the index-builder process that executes on each indexer. We demonstrate
below that this process can be effectively parallelized by structuring it as a
software pipeline.
The input to the index-builder is a sequence of Web pages and their
associated identifiers. The output of the index-builder is a set of sorted
runs, each containing postings extracted from a subset of the pages. The
process of generating these sorted runs can logically be split into three
phases as illustrated in Figure 7. We refer to these phases as loading,
processing, and flushing. During the loading phase, some number of pages
are read from the input stream and stored in memory. The processing
phase involves two steps. First, the pages are parsed, tokenized into
individual terms, and stored as a set of postings in a memory buffer. In the
second step, the postings are sorted in-place, first by term and then by
location. During the flushing phase, the sorted postings in the memory
buffer are saved on disk as a sorted run. These three phases are executed
repeatedly until the entire input stream of pages has been consumed.
Loading, processing, and flushing tend to use disjoint sets of system
resources. Processing is obviously CPU-intensive, whereas flushing prima-
rily exerts secondary storage, and loading can be done directly from the
network or a separate disk. Hence indexing performance can be improved
by executing these three phases concurrently (see Figure 8). Since the
execution order of loading, processing, and flushing is fixed, these three
phases together form a software pipeline.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
Searching the Web • 25
thread 3
indexing
time thread 2
thread 1 F F: flushing
P: processing
P L: loading
F
P L
F
P L
F
L P
F
period: P L optimal use
3*p F of resources
good use
P L of resources
L wasted
0 resources
Fig. 8. Multithreaded execution of index-builder.
6.8
6.4
Loading Processing
is bottleneck is bottleneck
6.2
5.8
5.6
5.4
5
10 20 30 40 48 50 60 70
buffer size. Even though the predicted optimum size differs slightly from
the observed optimum, the difference in running times between the two
sizes is less than 15 minutes for a 5 million page collection. Figure 10
shows how pipelining impacts the time taken to process and generate
sorted runs for a variety of input sizes. Note that for small collections of
pages, the performance gain through pipelining, though noticeable, is not
substantial. This is because small collections require very few pipeline
executions and the overall time is dominated by the time required at
start-up (to load up the buffers) and shut-down (to flush the buffers). Our
experiments showed that in general, for large collections, a sequential
index-builder is about 30 – 40% slower than a pipelined index-builder.
4.4.3 Efficient Global Statistics Collection. As mentioned in Section 4,
term-level4 statistics are often used to rank the search results of a query.
For example, one of the most commonly used statistics is inverse document
frequency or IDF. The IDF of a term w is defined as log N / df w , where N is
the total number of pages in the collection and df w is the number of pages
that contain at least one occurrence of w [Salton 1989]. In a distributed
indexing system, when the indexes are built and stored on a collection of
machines, gathering global (i.e., collection-wide) term-level statistics with
minimum overhead becomes an important issue [Viles and French 1995].
Some authors suggest computing global statistics at query time. This
would require an extra round of communication among the query servers to
exchange local statistics.5 However, this communication adversely impacts
4
Term-level refers to the fact that any gathered statistic describes only single terms, and not
higher level entities such as pages or Web sites.
5
By local statistics we mean the statistics that a query server can deduce from the portion of
the index that is stored on that node.
0
0 5 10 15 20 25 30 35 40 45 50
Number of pages indexed (in 100,000’s)
6
The relative overhead of a strategy is given by T 2 2 T 1 / T 1 , where T 2 is the time for full
index creation with statistics collection using that strategy and T 1 is the time for full index
creation with no statistics collection
cat:(1,2) (cat, 1)
dog:(1,3) (dog, 1)
cat:(2,4) (cat, 2) cat: 4
cat:(3,1) cat ?? cat 4 dog: 4
rat:(4,2) (rat, 2) dog ?? dog 4 rat: 2
rat:(5,3) rat ?? rat 2
(dog, 1)
dog:(6,3) Hash table Hash table cat: 4
cat:(7,1) (cat, 1) dog: 4
dog:(7,2) (dog, 2) Statistician Statistician
dog:(8,1)
sizes. Our studies show that the relative overheads of both strategies are
acceptably small (less than 5% for a 2 million page collection) and exhibit
sublinear growth with increase in collection size. This indicates that
centralized statistics collection is feasible even for very large collections.
Table II summarizes the characteristics of the FL and ME statistics-
gathering strategies.
4.5 Conclusion
The fundamental issue in indexing Web pages, when compared with
indexing in traditional applications and systems, is scale. Instead of
representing hundred- or thousand-node graphs, we need to represent
graphs with millions of nodes and billions of edges. Instead of inverting 2–
or 3– gigabyte collections with a few hundred thousand documents, we need
to build inverted indexes over millions of pages and hundreds of gigabytes.
This requires careful rethinking and redesign of traditional indexing archi-
tectures and data structures to achieve massive scalability.
In this section we provided an overview of the different indexes that are
normally used in a Web search service. We discussed how the fundamental
structure and content indexes as well as other utility indexes (such as the
PageRank index) fit into the overall search architecture. In this context, we
illustrated some of the techniques that we have developed as part of the
Stanford WebBase project to achieve high scalability in building inverted
indexes.
There are a number of challenges that still need to be addressed.
Techniques for incremental update of inverted indexes that can handle the
massive rate of change in Web content are yet to be developed. As new
indexes and ranking measures are invented, techniques to allow such
measures to be computed over massive distributed collections need to be
developed. At the other end of the spectrum, with the increasing impor-
tance of personalization, the ability to build some of these indexes and
measures on a smaller scale (customized for individuals or small groups of
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
30 • A. Arasu et al.
users and using limited resources) also becomes important. For example,
Haveliwala [1999] discusses some techniques for efficiently evaluating
PageRank on modestly equipped machines.
5.1 PageRank
Page et al. [1998] define a global ranking scheme, called PageRank, which
tries to capture the notion of the “importance” of a page. For instance, the
Yahoo! homepage is intuitively more important than the homepage of the
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
Searching the Web • 31
r~i! 5 O r~ j ! / N~ j !.
j[B~i!
The division by N ~ j ! captures the intuition that pages that point to page
i evenly distribute their rank boost to all of the pages they point to. In the
language of linear algebra [Golub and Loan 1989], this can be written as
r 5 A T r , where r is the m 3 1 vector @ r ~ 1 ! , r ~ 2 ! , . . . , r ~ m !# , and the
elements a i, j of the matrix A are given by, a ij 5 1 / N ~ i ! , if page i points to
page j , and a i, j 5 0 otherwise. Thus the PageRank vector r is the eigenvec-
tor of matrix A T corresponding to the eigenvalue 1 . Since the graph is
strongly connected, it can be shown that 1 is an eigenvalue of A T , and the
eigenvector r is uniquely defined (when a suitable normalization is per-
formed and the vector is nonnegative).
(a) (b)
Fig. 13. (a) Simple PageRank; (b) modified PageRank with d 5 0.8 .
7
Power iteration is guaranteed to converge only if the graph is aperiodic. (A strongly connected
directed graph is aperiodic if the greatest common divisor of the lengths of all closed walks is
1.) In practice, the Web is always aperiodic.
r~i! 5 d z O r~ j ! / N~ j ! 1 ~1 2 d! / m,
j[B~i!
where m is the total number of nodes in the graph. Note that simple
PageRank (Section 5.1.1) is a special case that occurs when d 5 1 .
In the random surfer model, the modification models the surfer occasion-
ally getting “bored” and making a jump to a random page on the Web
(instead of following a random link from the current page). The decay factor
d dictates how often the surfer gets bored.
Figure 13(b) shows the modified PageRank (for d 5 0.8 ) for the graph of
Figure 13(a) with the link 5 3 1 removed. Nodes 4 and 5 now have higher
ranks than the other nodes, indicating that surfers will tend to gravitate to
4 and 5. However, the other nodes have nonzero ranks.
5.1.5 Computational Issues. In order for the power iteration to be
practical, it is not only necessary that it converge to the PageRank, but that
8
Thus, leak nodes get no PageRank. An alternative is to assume that leak nodes have links
back to all the pages that point to them. This way, leak nodes that are reachable via high-rank
pages will have a higher rank than leak nodes reachable through unimportant pages.
5.2 HITS
In this section we describe another important link-based search algorithm,
HITS (Hypertext Induced Topic Search). This algorithm was first proposed
by Kleinberg [1999]. In contrast to the PageRank technique, which assigns
a global rank to every page, the HITS algorithm is a query-dependent
ranking technique. Moreover, instead of producing a single ranking score,
the HITS algorithm produces two—the authority and the hub scores.
Authority pages are those most likely to be relevant to a particular query.
For instance, the Stanford University homepage is an authority page for
the query “Stanford University”, while a page that discusses the weather at
Stanford would be less so. The hub pages are pages that are not necessarily
authorities themselves but point to several authority pages. For instance,
the page “searchenginewatch.com” is likely to be a good hub page for the
9
Google also uses other text-based techniques to enhance the quality of results. For example,
anchor text is considered part of the page pointed at. That is, if a link in page A with anchor
“Stanford University” points to page B , the text “Stanford University” will be considered part
of B and may even be weighted more heavily than the actual text in B .
The algorithm takes as input the query string and two parameters t and
d . Parameter t limits the size of the root set, while parameter d limits the
number of pages added to the focused subgraph. The latter control limits
the influence of an extremely popular page like www.yahoo.com if it were
10
Interestingly, mutually reinforcing relationships have been identified and exploited for other
Web tasks, see for instance Brin [1998].
11
In Kleinberg [1999], Altavista is used to construct the root set in the absence of a local text
index.
to appear in the root set.12 The expanded set S should be rich in authori-
ties, since it is likely that an authority is pointed to by at least one page in
the root set. Likewise, a lot of good hubs are also likely to be included in S .
Link analysis. The link analysis phase of the HITS algorithm uses the
mutually reinforcing property to identify the hubs and authorities from the
expanded set S . (Note that this phase is oblivious to the query that was
used to derive S .) Let the pages in the focused subgraph S be denoted 1,
2, . . . , n . Let B ~ i ! denote the set of pages that point to page i . Let F ~ i !
denote the set of pages that the page i points to. The link analysis
algorithm produces an authority score a i and a hub score h i for each page
in set S . To begin with, the authority scores and the hub scores are
initialized to arbitrary values. The algorithm is an iterative one and it
performs two kinds of operations in each step, called I and O . In the I
operation, the authority score of each page is updated to the sum of the hub
scores of all pages pointing to it. In the O step, the hub score of each page is
updated to the sum of authority scores of all pages that it points to. That is,
I step: ai 5 Oh j
Oa
j[B~i!
O step: hi 5 j[F~i!
j
The I and the O steps capture the intuition that a good authority is
pointed to by many good hubs and a good hub points to many good
authorities. Note incidentally that a page can be, and often is, both a hub
and an authority. The HITS algorithm just computes two scores for each
page, the hub score and the authority score. The algorithm iteratively
repeats the I and O steps, with normalization, until the hub and authority
scores converge:
(1) Initialize a i , h i ~ 1 # i # n ! to arbitrary values
(2) Repeat until convergence
(a) Apply the I operation
(b) Apply the O operation
O
(c) Normalize i a 2i 5 1 and Oh i
2
i 5 1
(3) End
An example of hub and authority calculation is shown in Figure 14. For
example, the authority score of node 5 can be obtained by adding the hub
scores of nodes that point to 5 (i.e., 0.408 1 0.816 1 0.408 ) and dividing
Î
this value by ~ 816 ! 2 1 ~ 1.632 ! 2 1 ~ .816 ! 2 .
12
Several heuristics can also be used to eliminate nonuseful links in the focused subgraph. The
reader is referred to Kleinberg [1999] for more details.
1 2 3
h = 0.408 h = 0.816 h = 0.408
a=0 a=0 a=0
Fig. 14. HITS Algorithm.
6. CONCLUSION
Searching the World-Wide Web successfully is the basis for many of our
information tasks today. Hence, search engines are increasingly being
relied upon to extract just the right information from a vast number of Web
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
40 • A. Arasu et al.
pages. The engines are being asked to accomplish this task with minimal
input from users, usually just one or two keywords.
In Figure 1 we show how such engines are put together. Several main
functional blocks make up the typical architecture. Crawlers travel the
Web, retrieving pages. These pages are stored locally, at least until they
can be indexed and analyzed. A query engine then retrieves URLs that
seem relevant to user queries. A ranking module attempts to sort these
returned URLs such that the most promising results are presented to the
user first.
This simple set of mechanisms requires a substantial underlying design
effort. Much of the design and implementation complexity stems from the
Web’s vast scale. We explained how crawlers, with their limited resources
in bandwidth and storage, must use heuristics to ensure that the most
desirable pages are visited and that the search engine’s knowledge of
existing Web pages stays current.
We have shown how the large-scale storage of Web pages in search
engines must be organized to match a search engine’s crawling strategies.
Such local Web page repositories must also enable users to access pages
randomly and have the entire collection streamed to them.
The indexing process, while studied extensively for smaller, more homo-
geneous collections, requires new thinking when applied to the many
millions of Web pages that search engines must examine. We discussed how
indexing can be parallelized and how needed statistics can be computed
during the indexing process.
Fortunately, the interlinked nature of the Web offers special opportuni-
ties for enhancing search engine performance. We introduced the notion of
PageRank, a variant of the traditional citation count. The Web’s link graph
is analyzed and the number of links pointing to a page is taken as an
indicator of that page’s value. The HITS algorithm, or “Hubs and Authori-
ties” is another technique that takes advantage of Web linkage. This
algorithm classifies the Web into pages that primarily function as major
information sources for particular topics (authority pages) and other pages
that primarily point readers to authority pages (hubs). Both PageRank and
HITS are used to boost search selectivity by identifying “important” pages
through link analysis.
A substantial amount of work remains to be accomplished, as search
engines hustle to keep up with the ever expanding Web. New media, such
as images and video, pose new challenges for search and storage. We offer
an introduction into current search engine technologies and point to several
upcoming new directions.
ACKNOWLEDGMENTS
We would like to thank Gene Golub and the referees for many useful
suggestions that improved the paper.
ACM Transactions on Internet Technology, Vol. 1, No. 1, August 2001.
Searching the Web • 41
REFERENCES
AHO, A., HOPCROFT, J., AND ULLMAN, J. 1983. Data Structures and Algorithms.
Addison-Wesley, Reading, MA.
ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature
401, 6749 (Sept.).
AMENTO, B., TERVEEN, L., AND HILL, W. 2000. Does authority mean quality? Predicting expert
quality ratings of web documents. In Proceedings of the 23rd Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New
York, NY.
ANH, V. N. AND MOFFAT, A. 1998. Compressed inverted files with reduced decoding overheads.
In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR ’98, Melbourne, Australia, Aug. 24 –28), W. B.
Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New
York, NY, 290 –297.
BAR-YOSSEF, Z., BERG, A., CHIEN, S., AND WEITZ, J. F. D. 2000. Approximating aggregate
queries about web pages via random walks. In Proceedings of the 26th International
Conference on Very Large Data Bases.
BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286,
5439 (Oct.), 509 –512.
BHARAT, K. AND BRODER, A. 1999. Mirror, mirror on the web: A study of host pairs with
replicated content. In Proceedings of the Eighth International Conference on The World-
Wide Web.
BHARAT, K., BRODER, A., HENZINGER, M., KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. The
connectivity server: fast access to linkage information on the Web. Comput. Netw. ISDN
Syst. 30, 1-7, 469 – 477.
BRIN, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of
the Sixth International Conference on Extending Database Technology (Valencia, Spain,
Mar.), H. -J. Schek, F. Saltor, I. Ramos, and G. Alonso, Eds.
BRIN, S. AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine.
Comput. Netw. ISDN Syst. 30, 1-7, 107–117.
BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A.,
AND WIENER, J. 2000. Graph structure in the web: experiments and models. In Proceedings
of the Ninth International Conference on The World Wide Web.
CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR, S. R., RAGHAVAN, P., RAJAGOPALAN, S., AND
TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM
SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne,
Australia). ACM Press, New York, NY.
CHAKRABARTI, S., DOM, B., AND INDYK, P. 1998b. Enhanced hypertext categorization using
hyperlinks. SIGMOD Rec. 27, 2, 307–318.
CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN, S., GIBSON, D., AND KLEINBERG, J.
1998c. Automatic resource compilation by analyzing hyperlink structure and associated
text. In Proceedings of the Seventh International Conference on The World Wide Web
(WWW7, Brisbane, Australia, Apr. 14 –18), P. H. Enslow and A. Ellis, Eds. Elsevier Sci.
Pub. B. V., Amsterdam, The Netherlands, 65–74.
CHAKRABARTI, S. AND MUTHUKRISHNAN, S. 1996. Resource scheduling for parallel database and
scientific applications. In Proceedings of the 8th Annual ACM Symposium on Parallel
Algorithms and Architectures (SPAA ’96, Padua, Italy, June 24 –26), G. E. Blelloch,
Chair. ACM Press, New York, NY, 329 –335.
CHAKRABARTI, S., VAN DEN BERG, M., AND DOM, B. 1999. Focused crawling: A new approach to
topic-specific web resource discovery. In Proceedings of the Eighth International Conference
on The World-Wide Web.
CHO, J. AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for
publication.
CHO, J. AND GARCIA-MOLINA, H. 2000b. The evolution of the web and implications for an
incremental crawler. In Proceedings of the 26th International Conference on Very Large
Data Bases.
CHO, J. AND GARCIA-MOLINA, H. 2000c. Synchronizing a database to improve freshness. In
Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD ’2000,
Dallas, TX, May). ACM Press, New York, NY.
CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Efficient crawling through URL
ordering. Comput. Netw. ISDN Syst. 30, 1-7, 161–172.
COFFMAN, E. J., LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search
engines. Tech. Rep. INRIA, Rennes, France.
DEAN, J. AND HENZINGER, M. R. 1999. Finding related pages in the world wide web. In
Proceedings of the Eighth International Conference on The World-Wide Web.
DILIGENTI, M., COETZEE, F. M., LAWRENCE, S., GILES, C. L., AND GORI, M. 2000. Focused
crawling using context graphs. In Proceedings of the 26th International Conference on Very
Large Data Bases.
DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY,, B. 1999. Rate of change and other metrics:
a live study of the world wide web. In Proceedings of the USENIX Symposium on
Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.
DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., DEERWESTER, S., AND HARSHMAN, R. 1988.
Using latent semantic analysis to improve access to textual information. In Proceedings of
the ACM Conference on Human Factors in Computing Systems (CHI ’88, Washington, DC,
May 15–19), J. J. O’Hare, Ed. ACM Press, New York, NY, 281–285.
EGGHE, L. AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New
York, NY.
FALOUTSOS, C. 1985. Access methods for text. ACM Comput. Surv. 17, 1 (Mar.), 49 –74.
FALOUTSOS, C. AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for
documents and its analytical performance evaluation. ACM Trans. Inf. Syst. 2, 4 (Oct.),
267–288.
GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471– 479.
GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring Web communities from link
topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia: Links,
Objects, Time and Space—Structure in Hypermedia Systems (HYPERTEXT ’98, Pittsburgh,
PA, June 20 –24), R. Akscyn, Chair. ACM Press, New York, NY, 225–234.
GOLUB, G. AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins
University Press, Baltimore, MD.
HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer
Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/
pub/1999-31.
HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large
collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).
HIRAI, J., RAGHAVAN, S., GARCIA-MOLINA, H., AND PAEPCKE, A. 2000. Webbase: A repository of
web pages. In Proceedings of the Ninth International Conference on The World Wide
Web. 277–293.
HUBERMAN, B. A. AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature
401, 6749 (Sept.).
KLEINBERG, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 6
(Nov.).
KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).
KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1999. Trawling the web for
emerging cyber-communities. In Proceedings of the Eighth International Conference on The
World-Wide Web.
LAWRENCE, S. AND GILES, C. 1998. Searching the world wide web. Science 280, 98 –100.
LAWRENCE, S. AND GILES, C. 1999. Accessibility of information on the web. Nature 400,
107–109.
MANBER, U. AND MYERS, G. 1993. Suffix arrays: a new method for on-line string searches.
SIAM J. Comput. 22, 5 (Oct.), 935–948.
MACLEOD, I. A., MARTIN, P., AND NORDIN, B. 1986. A design of a distributed full text retrieval
system. In Proceedings of 1986 ACM Conference on Research and Development in Informa-
tion Retrieval (SIGIR ’86, Palazzo dei Congressi, Pisa, Italy, Sept. 8 —10), F. Rabitti,
Ed. ACM Press, New York, NY, 131–137.
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed
full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library
Project. Computer Systems Laboratory, Stanford University, Stanford, CA.
http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2001. Building a distributed
full-text index for the web. In Proceedings of the Tenth International Conference on The
World-Wide Web.
MOFFAT, A. AND BELL, T. A. H. 1995. In situ generation of compressed inverted files. J. Am.
Soc. Inf. Sci. 46, 7 (Aug.), 537–550.
MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press,
New York, NY.
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking:
Bringing order to the web. Tech. Rep.. Computer Systems Laboratory, Stanford University,
Stanford, CA.
PINSKI, G. AND NARIN, F. 1976. Citation influence for journal aggregates of scientific
publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.
PITKOW, J. AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In
Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI ’97,
Atlanta, GA, Mar. 22–27), S. Pemberton, Ed. ACM Press, New York, NY, 383–390.
RIBEIRO-NETO, B. A. AND BARBOSA, R. A. 1998. Query performance for tightly coupled
distributed digital libraries. In Proceedings of the Third ACM Conference on Digital
Libraries (DL ’98, Pittsburgh, PA, June 23–26), I. Witten, R. Akscyn, and F. M. Shipman,
Eds. ACM Press, New York, NY, 182–190.
ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/
mak/projects/robots/exclusion.html.
SALTON, G., ED. 1988. Automatic Text Processing. Addison-Wesley Series in Computer
Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.
TOMASIC, A. AND GARCIA-MOLINA, H. 1993. Performance of inverted indices in distributed text
document retrieval systems. In Proceedings of the 2nd International Conference on Parallel
and Distributed Systems (Dec.). 8 –17.
VILES, C. L. AND FRENCH, J. C. 1995. Dissemination of collection wide information in a
distributed information retrieval system. In Proceedings of the 18th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’95,
Seattle, WA, July 9 –13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York,
NY, 12–20.
WILLS, C. E. AND MIKHAILOV, M. 1999. Towards a better understanding of web resources and
server responses for improved caching. In Proceedings of the Eighth International Confer-
ence on The World-Wide Web.
WITTEN, I., MOFFAT, A., AND BELL, T. 1999. Managing Gigabytes: Compressing and Indexing
Documents and Images. 2nd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA.