0% found this document useful (0 votes)
85 views19 pages

Chap 4 Text IR PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views19 pages

Chap 4 Text IR PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 4

Text Document Indexing and Retrieval


4.1 INTRODUCTION
This chapter is devoted to text document retrieval techniques, commonly called IR techniques. IR techniques are
important in multimedia information management systems for two main reasons. First, there exist a large
number of text documents in many organizations such as libraries. Text is a very important information source
for an organization. To efficiently use information stored in these documents, an efficient IR system is needed.
Second, text can be used to annotate other media such as audio, images, and video, so conventional IR
techniques can be used for multimedia information retrieval.
The two major design issues of IR systems are how to represent documents and queries and how to
compare similarities between document and query representations. A retrieval model defines these two aspects.
The four commonly used retrieval models are exact match, vector space, probabilistic, and cluster-based. The
most common exact match technique is the Boolean model.
In Section 4.2, we describe the main differences between IR systems and DBMSs, and the general
information retrieval process.
Although different retrieval models use different document representation or indexing, the indexing
process used is similar. Section 4.3 discusses the general automatic document indexing process and the Boolean
retrieval model. Sections 4.4 to 4.6 discusse vector space, probabilistic, and cluster-based retrieval models,
respectively.
To improve retrieval performance, natural language processing and artificial intelligence techniques are
used. In Section 4.7, we briefly describe the applications of these two areas in IR.
Due to the ambiguity and variations of natural language, it is almost impossible to retrieve all relevant
items and reject all relevant items, so measurement of IR effectiveness is important. Section 4.8 is devoted to the
performance measurement issue. Section 4.9 briefly compares performance of different retrieval techniques.
IR techniques are popular because they are now used in search engines of the WWW. In Section 4.10, we
describe the basic architecture of the WWW, general issues of resource discovery in the WWW, the main
differences between IR systems and WWW search engines, the implications of these differences to the WWW
search engine design, and an example WWW search engine.
Section 4.11 summarizes the chapter.

4.2 DIFFERENCES BETWEEN IR SYSTEMS AND DBMS


An understanding of differences between IR systems and DBMS is helpful to understanding IR techniques.
A DBMS contains homogeneous structured records. Each record is characterized by a set of attributes, and
the values of the attributes attached to particular records describe these records unequivocally and completely.
In IR systems, records are not structured. They do not contain fixed attributes. They are just normal text
documents. These documents can be indexed with a number of keywords, document descriptors, or index terms.
Each index term is assumed to describe the text content only to some extent, not completely or unequivocally,
and large numbers of different index terms may be attached to each particular document or text. Because textretrieval operations depend directly on the content representations used to describe the stored records, a
substantial effort must be devoted to analyzing the content of the stored documents and dealing with the
generation of the keywords and indices.
In a DBMS the retrieval is based on an exact match between the query and the attribute values of records.
Each retrieved record contains the precise attribute values specified in the query (and possibly other attribute
values not mentioned in the query), while each nonretrieved record exhibits at least one mismatch between
attribute values attached to the query and those attached to the records. In IR systems, it may not be useful to
insist on an exact match between the query and document terms for particular documents to be retrieved.
Instead, the retrieval of an item may depend on a sufficient degree of coincidence between the sets of terms
attached to queries and documents, produced by some approximate or partial matching method. Further, the
same term may have different meaning. In other words, items retrieved in DBMS are definitely relevant to the
query and useful to the user. But in IR systems, items considered relevant to the query by the system may not
be relevant and useful to the user.
The basic document retrieval process is shown in Figure 4.1. As shown on the right side of the figure, the
documents are processed off-line to obtain document representations. These representations are stored together

with documents themselves. During retrieval (left side of the figure), the user issues a query that is processed
(on-line) to obtain its representation. Then the query representation is compared with the document
representations. Documents deemed relevant by the system are retrieved and presented to the user, who
evaluates the returned documents and decides which ones are actually relevant to the information need. A good
ER system should then allow the user to provide relevance feedback to the system. The system uses this
information to modify query, query representation, and/or document representations. Another retrieval is done
based on the modified query and document representations. If necessary, the retrieval-feedback process is
iterated a few times. Note that not all IR systems support the user relevance feedback process.
Different IR models use different methods for query and document representation, similarity comparison
and/or relevance feedback. We discuss Boolean, vector space, probabilistic, and clustering models in the
following sections.

Query

Text
documents

processing

processing

Query
representation

document
representation
Similarity
calculation
Retrieved
documents
Relevance
evaluation

Figure 4.1 The information retrieval process.

4.3 AUTOMATIC TEXT DOCUMENT INDEXING AND BOOLEAN RETRIEVAL MODEL


4.3.1 Basic Boolean Retrieval Model
The aim of an IR system is to retrieve relevant items from a document database in response to users queries.
Most of the commercial IR systems today can be classified as Boolean IR systems or text-pattern search
systems. Text-pattern search queries are strings or regular expressions. During retrieval, all documents are
searched and those containing the query string are retrieved. Text-pattern systems are more common for
searching small document databases or collections. A well known example of text pattern search is the grep
family of tools in the UNIX environment.
In the Boolean retrieval system, documents are indexed by sets of keywords. (The indexing process will be
discussed later.) Queries are also represented by a set of keywords joined by logical (Boolean) operators that
supply relationships between the query terms. Three types of operators are in common use: OR, AND, and
NOT. Their retrieval rules are as follows:
The OR operator treats two terms as effectively synonymous. For example, given the query (term 1 OR

term 2), the presence of either term in a record (or a document) suffices to retrieve that record.
The AND operator combines terms (or keywords) into term phrases; thus the query (term 1 AND term 2)
indicates that both terms must be present in the document in order for it to be retrieved.
The NOT operator is a restriction, or term-narrowing, operator that is normally used in conjunction with
the AND operator to restrict the applicability of particular terms; thus the query (term 1 AND NOT term 2)
leads to the retrieval of records containing term 1 but not term 2.

4.3.2File Structure
A fundamental decision in the design of IR systems is which type of file structure to use for the underlying
document database. File structures used in IR systems include flat files, inverted files, signature files, and others
such as PAT trees and graphs [1].
Using a flat-file approach, one or more documents are stored in a file, usually as ASCII or EBCDIC text.
Documents are not indexed. Flat-file searching is usually done via pattern searching. In UNIX, for example, one
can store a document collection one document per file in a UNIX directory. These files can be searched using
pattern searching tools such as grep and awk. This approach is not efficient, because for each query the
entire document collection must be searched to check the query text pattern.
Signature files contain signatures (bit patterns) that represent documents. There are many ways to generate
signatures for documents. The query is also represented by a signature that is compared with the document
signature during retrieval.
A commonly used file structure is an inverted file that is a kind of indexed file. The inverted file concept
and retrieval operations based on inverted files are described next.

Inverted Files
In an inverted file, for each term a separate index is constructed that stores the record identifiers for all records
containing that term. An inverted file entry usually contains a keyword (term) and a number of document-IDs.
Each keyword or term and the document-IDs of documents containing the keyword are organized into one row.
An example of an inverted file is shown below:
Term 1: Record 1, Record 3
Term 2: Record 1, Record 2
Term 3: Record 2, Record 3, Record 4
Term 4: Record 1, Record 2, Record 3, Record 4
where Term i (i being 1, 2, 3, or 4) is the ID number of index term i, Record i (i being 1, 2, 3, or 4) is the ID
number of record i or document i.
The first row means that Term 1 is used in Record 1 and Record 3. The other rows have analogous
meanings. Using an inverted file, searching and retrieval is fast. Only rows containing query terms are retrieved.
There is no need to search for every record in the database.
Retrieval rules using the Boolean model based on inverted files are as follows:
For the Boolean AND query, such as (Term i AND Term j), a merged list is produced for rows i and j of
the inverted file and all duplicated records (those containing both Term i and Term j) are presented as
output. Using the above inverted file as an example, for query (Term 1 and Term 3), the output is Record 3.
For an OR query, such as (Term i OR Term j), the merged list is produced for rows i and j and all distinct
items of the merged list are presented as output. Using the above inverted file as an example, for query
(Term 1 OR Term 2), the output is:
Record 1, Record 2, Record 3.
For a NOT query, such as (Term i AND NOT Term j) the output is items appearing in row i but not in row
j. Using the above inverted file as an example, for query (Term 4 AND NOT Term 1), the output will be:
Record 2, Record 4. For query (Term 1 AND NOT Term 4), the output is nil.

Extensions of the inverted File Operation

So far we have ignored two important factors in document indexing and retrieval: term positions and the
significance of terms (term weights) in documents. In AND queries, all records containing both terms are
retrieved, regardless of their positions in the documents. Each term is of equal importance, regardless of their
occurring frequencies in documents. To improve retrieval performance, these two factors must be taken into
account. We discuss term weight later. In the following we discuss position constraints.
The relationships specified between two or more terms can be strengthened by adding nearness parameters
to the query specification. When nearness parameters are included in a query specification, the topic is more
specifically defined, and the probable relevance of any retrieved item is larger.
Two possible parameters of this kind are the within sentence and adjacency specification:
(Term i within sentence Term j) means that terms i and j occur in a common sentence of a retrieved record.
(Term i adjacent Term j) means that Terms i and j occur adjacently in the retrieved documents.
To support this type of query, term location information must be included in the inverted file. The general
structure of the extended inverted file is
Term i: Record no., Paragraph no., Sentence no., Word no.
For example, if an inverted file has the following entries:
information: R99, 10, 8,3; R155, 15,3,6; R166, 2, 3, 1
retrieval: R77, 9, 7,2; R99, 10, 8,4; R166, 10, 2,5
then as a result of query (information within sentence retrieval), only document R99 will be retrieved.
In the above example, terms information and retrieval appear in the same sentence of document R99.
So it is very likely the record is about information retrieval. Although document R166 contains both
information and retrieval, they are at different places of the document, so the document may not be about
information retrieval. It is likely that the terms information and retrieval are used in different contexts.

4.3.3 Term Operations and Automatic Indexing


We mentioned that documents are indexed with keywords, but we have not described how the indexing is
actually done. We discuss the operations carried out on terms and an automatic indexing process next.
A document contains many terms or words. But not every word is useful and important. For example,
prepositions and articles such as of, the, and a are not useful to represent the contents of a document.
These terms are called stop words. An excerpt of common stop words is listed in Table 4.1. During the indexing
process, a document is treated as a list of words and stop words are removed from the list. The remaining terms
or words are further processed to improve indexing and retrieval efficiency and effectiveness. Common
operations carried out on these terms are stemming, thesaurus, and weighting.
Stemming is the automated conflation (fusing or combining) of related words, usually by reducing the
words to a common root form. For example, suppose that words ~retrieval, retrieved, retrieving, and
retrieve all appear in a document. Instead of treating these as four different words, for indexing purposes these
four words are reduced to a common root retriev. The term retriev is used as an index term of the document.
A good description of stemming algorithms can be found in Chapter 8 of [11].
With stemming, the index file will be more compact and information retrieval will be more efficient.
Information recall will also be improved because the root is more general and more relevant documents will be
retrieved in response to queries. But the precision may be decreased as the root term is less specific. We discuss
performance measurement in recall and precision in Section 4.8.
Table 4.1
Excerpt of Common Stop Words

A
ABOUT
ACROSS

ALTHOUGH
ALWAYS
AMONG

ANYONE
ANYTHING
ANYWHERE

AFTER
AFTERWARDS
AGAIN
AGAINST
ALL
ALSO

AMONGST
AN
AND
ANOTHER
ANY
ANYHOW

ARE
AROUND
AS
AT
BE
BECOME

Another way of conflating related terms is with a thesaurus that lists synonymous terms and sometimes the
relationships among them. For example, the words study, learning, schoolwork, and reading have
similar meanings. So instead of using four index terms, a general term study can be used to represent these
four terms. The thesaurus operation has a similar effect on retrieval efficiency and effectiveness as the stemming
operation.
Different indexing terms have different frequencies of occurrence and importance to the document. Note
that the occurring frequency of a term after stemming and thesaurus operations is the sum of the frequencies of
all its variations. For example, the term frequency of retriev is the sum of the occurring frequencies of the
terms retrieve, retrieval, retrieving, and retrieved, The introduction of term-importance weights for
document terms and query terms may help the distinguish terms that are more important for retrieval purposes
from less important terms. When term weights are added to the inverted file, different documents have different
similarities to the query and the documents can be ranked at retrieval time in decreasing order of similarity.
An example of inverted file with term weights is shown below:
Terml: R1, 0.3; R3,0.5; R6, 0.8; R7, 0.2; R11, 1
Term2: R2, 0.7; R3, 0.6; R7, 0.5; R9, 0.5
Term3: R1, 0.8; R2, 0.4; R9, 0.7
The first row means that weight of term 1 is 0.3 in Record 1, 0.5 in Record 3, 0.8 in Record 6, 0.2 in
Record 7 and 1 in Record 11. Other rows can be read similarly.
Boolean operations with term weights can be carried out as follows:

For the OR query, the higher weight among records containing the query terms is used as the similarity
between the query and documents. The returned list is ordered in decreasing similarity. For example, for
query (Term 2 OR Term 3), we have Ri = 0.8, R2 = 0.7, R3 = 0.6. R7 = 0.5, R9 = 0.7, therefore the output order
is Ri, R2, R9, R3, and R7.

For the AND query, the lower weight between the common records matching query terms is used as the
similarity between the query and the documents. For example for the query (Term 2 AND Term 3). we
have R2 = 0.4, R9 = 0.5. Therefore the out put is R9 and R2.
For the NOT query, the similarity between the query and the documents is the difference between the
common entries in the inverted file. For example, for query (Tern 2 AND NOT Term 3), we have R2 = 0.3,
R3 = 0.6, R7 = 0.5, R9 = 0, therefore the output is R3, R7, and R2.

We discussed how the use of term weights can help rank the returned list. Ranked return is very important
because if the first few items are most similar or relevant to the query, they are normally the most useful to the
user. The user can just look at the first few items without going through a long list of items. Now let us look at
how to determine term weights for different index terms.
The assignment of index terms to documents and queries is carried out in the hope of distinguishing
documents that are relevant for information users from other documents.
In a particular document, the more often a term appears, the more important the term and the higher the
term weight should be.
In the context of an entire document collection, if a term appears in almost all documents, the term is not a
good index term, because it does not help in distinguishing relevant documents from others.
Therefore good index terms are those that appear often in a few documents but do not appear in other
documents. Term weight should be assigned taking into account both term frequency (tf) and document
frequency (df). The commonly used formula to calculate term weight is

Wij = tfij * log( N / df j )


5

where Wij is the weight of term j in document i, tfij, is the frequency of term j in document i, N is the total
number of documents in the collection, dfj is the number of documents containing term j. The above weight is
proportional to term frequency and inverse document frequency. Thus the above formula is commonly called
tf.idf.
Based on the above formula, if a term occurs in all documents in the collection (dfj == N), the weight of the
term is zero (i.e., the term should not be used as an index term because use of the term is not able to differentiate
documents). On the other hand, if a term appears often in only a few documents, the weight of the term is high
(i.e., it is a good index term).

4.3.4 Summary of Automatic Document Indexing


The aim of indexing is to find the best terms to represent each document so that documents can be retrieved
accurately during the retrieval process. The automatic indexing process consists of the following steps:
1. Identify words in the title, abstract, and/or document;
2. Eliminate stop words from the above words by consulting a special dictionary, or stop list, containing a
list of high-frequency function words;
3. Identify synonyms by consulting a thesaurus dictionary. All terms with similar meanings are replaced
with a common word;
4. Stem words using certain algorithms by removing derivational and inflectional affixes (suffix and
prefix);
5. Count stem frequencies in each document;
6. Calculate term (stem) weights;
7. Create the inverted file based on the above terms and weights.

4.4 VECTOR SPACE RETRIEVAL MODEL


4.4.1 Basic Vector Space Retrieval Model
The concept of Boolean retrieval model is simple and used in most commercial systems. However, it is difficult
to formulate Boolean queries and the retrieval results are very sensitive to query formulation. Query term
weights are normally not used as queries are often very short. To overcome these problems, alternative retrieval
models vector space, probabilistic and cluster-based models have been proposed. This section discusses the
vector space model and the following two sections deals with the other two models.
The vector space model assumes that there is a fixed set of index terms to represent documents and queries.
A document Di and a query Qj are represented as
Di = [Ti1, Ti2, , TiN]
Qj = [Qj1, Qj2, ... , QjN]
where Tik is the weight of term k in document I,. Qjk is the weight of query k in query j, and N is the total
number of terms used in documents and queries.
Term weights Tik and Qjk can be binary (i.e., either 1 or 0), or tf.idf or weights obtained by other means.
Retrieval in the vector space model is based on the similarity between the query and the documents. The
similarity between document Di and query Qj is calculated as follows:

S ( Di , Q j ) =

Tik Q jk

k =1

To compensate for differences in document sizes and query sizes, the above similarity can be normalized as
follows:

S ( Di , Q j ) =

Tik Q jk

k =1
N

T ik . Q 2 jk

k =1

k =1

This is the well known cosine coefficient between vectors Di and Qj. During retrieval, a ranked list in
descending order of similarity is returned to the user. For -pie, if four documents and a query are represented as
the following vectors:
D1 = [0.2, 0.1, 0.4, 0.5]
D2 = [0.5, 0.6, 0.3, 0]
D3 = [0.4,0.5,0.8,0.3]
D4 = [0.1, 0, 0.7, 0.8]
Q = [0.5, 0.5, 0, 0]
then the similarities between the query and each of the document are as follows:
S(D1, Q) = 0.31
S(D2, Q) = 0.93
S(D3, Q) = 0.66
S(D4, Q) = 0.07
The system will return the documents in the order D2, D3, D1, and D4.
The main limitation of the vector space model is that it treats terms as unrelated and it only works well with
short documents and queries.

4.4.2 Relevance Feedback Techniques


As we mentioned, items relevant to the query according to the system may not actually be relevant to the query
as judged by the user. Techniques that employ users relevance feedback information have been developed to
improve system effectiveness. Relevance feedback takes users judgments about the relevance of documents and
uses them to modify query or document indexes.

Query Modification
Query modification based on user relevance feedback uses the following rules:
Terms occurring in documents previously identified as relevant are added to the original query, or the
weight of such terms is increased.
Terms occurring in documents previously identified as irrelevant are deleted from the query, or the weight
of such terms is reduced.
The new query is submitted again to retrieve documents. The above rules are expressed as follows:

Q (i + 3) = Q (i ) +

D i Re l

Di

Di

D i Non Re l

where Q(i+l) is the new query, Q(i) is the current query, Dt is a collection of documents retrieved in response to Q(I),
the first summation is done on all relevant documents within Di, and the Second summation is done on non
relevant documents with D.
Experiments show that the performance is improved by using this technique. The principle behind this
approach is to find similar documents to the ones already judged as relevant to the query. Documents relevant to

the query should be similar to each other.

Document Modification
In query modification based on user relevance feedback, queries are modified using the terms in the relevant
documents. Other users do not benefit from this modification. In document modification based on the users
relevance feedback, document index terms are modified using query terms, so the change made affects other
users. Document modification uses the following rules based on relevance feedback:
Terms in the query, but not in the user-judged relevant documents, are added to the document index list
with an initial weight.
Weights of index terms in the query and also in relevant documents are increased by a certain amount.
Weights of index terms not in the query but in the relevant documents are decreased by a certain amount.

When subsequent queries similar to the queries used to modify the documents are issued, performance is
improved. But this approach may decrease the effectiveness if the subsequent queries are very different from
those used to modify the documents.

4.5 PROBABILISTIC RETRIEVAL MODEL


The probabilistic retrieval model considers term dependencies and relationships. It is based on the following
four parameters:
P(rel): the probability of relevance of a document
P(nonrel): the probability of non relevance of a document
a1: the cost associated with the retrieval of a non relevant document
a2: the cost associated with the non retrieval of a relevant document
Since the retrieval of a nonrelevant document carries a loss of a1P(nonrel) and the rejection of a relevant
document carries a loss of a2P(rel), the total loss caused by a given retrieval process will be minimized if a
document is retrieved whenever
a2.P(rel) a1.P(nonrel)
The main issue of the probabilistic retrieval model is how to estimate P(rel) and P(nonrel). This is normally
done by assuming a certain term occurrence distribution in documents. We will not discuss the derivation of
these parameters further. Interested readers are referred to [24].
The probabilistic model provides an important guide for characterizing retrieval processes. However, it has
not improved retrieval effectiveness greatly, due to the difficulties of obtaining P(rel) and P(nonrel).

4.6 CLUSTER-BASED RETRIEVAL MODEL


In the information retrieval models discussed so far, similar documents may not be in close proximity in the file
system. In such a file organization, it is difficult to implement browsing capability. Retrieval effectiveness and
efficiency are low because not all relevant items may be retrieved and whole document space has to be searched.
To overcome these disadvantages, document clustering grouping similar documents into clusters was
introduced. In the following we briefly describe cluster generation methods and cluster-based retrieval
techniques. The coverage is quite brief, focusing on basic principles Interested readers are referred to [2, 3] for
details about these topics.

4.6.1 Cluster Generation


There are two general approaches to cluster generation. The first one is based on all pair-wise document
similarities and assembles similar items into common clusters. The second uses heuristic methods that do not

require pairwise document similarities to be computed.


In the approach based on pairwise similarities, each document is represented as a document vector as in the
vector space model. Then the similarity between each pair of documents is calculated. During the clustering
process, each document is initially placed into a class by itself and then two most similar documents based on
the pairwise similarities are combined into a cluster. The similarities between the newly formed cluster and
other documents are calculated then the most similar documents (including the cluster) are combined into a new
cluster. The combining process continues until all documents are grouped into a supercluster. This is called an
hierarchical, agglomerative clustering process.
The hierarchical clustering methods are based on all pairwise similarities between documents and are
relatively expensive to perform. But these methods produce a unique set of well formed clusters for each set of
documents.
In contrast, heuristic clustering methods produce rough cluster arrangements rapidly and at relatively little
expense. The simplest heuristic process, called a one-pass procedure, takes the documents to be clustered one at
a time in arbitrary order. The first document taken is placed in a cluster of its own. Each subsequent document is
then compared with all existing clusters, and is placed into an existing cluster if it is sufficiently similar to that
cluster. If the document is not similar enough to any existing cluster, the document is placed in a new cluster of
its own. This process continues until all documents are clustered. The cluster structure generated this way
depends on the order in which documents are processed and is uneven. Some control mechanisms are required
to produce usable clusters.

4.6.2 Cluster-Based Retrieval


When clusters are formed, document search and retrieval is effective and efficient. Each cluster has a
representative vector, normally its centroid. A cluster centroid is typically calculated as the average vector of all
documents of the cluster (i.e., the weight of centroid term i is defined as the average of the weights of the ith
terms of all documents).
During document retrieval, the query vector is compared with the centroids of clusters. After the cluster
with highest similarity to the query vector is identified, there are two alternatives. In the first, all documents in
the cluster are retrieved. This option is normally taken when clusters are small. In the second alternative, the
query vector is compared with each document vector in the cluster and only the most similar documents are
retrieved.

4.7 NONTRADITIONAL IR METHODS


The main issues of IR are how to represent documents and information needs (queries) accurately and then how
to match users information needs with documents. It is obvious that these two processes need to be improved to
improve IR performance. A number of new techniques have been proposed to achieve this. In this section we
briefly describe natural language processing (NLP) and concept- (or knowledge-) based IR retrieval techniques
[3, 4].
Traditional IR retrieval models rely on statistical occurrences of terms as a basis for retrieval. There
are a number of problems with the methods based on term occurrences only. First, individual words do not
contain all the information encoded in language. The order of words provides a lot information. The same words
in a different order may have totally different meanings. Second, one word may have multiple meanings. This is
called polysemy. Third, a number of words may have a similar meaning. This is called synonymy. Fourth,
phrases have meanings beyond the sum of individual words. Overall, natural language is ambiguous. To
improve IR performance, the system should be able to understand the natural language. NLP attempts automatic
natural language understanding. The application of NLP techniques to IR improves retrieval performances.
Another way to improve IR performance is to use domain knowledge. In a knowledge-based IR
model, information specific to a domain, called domain knowledge, is used to model concepts (terms), events,
and relationships among concepts and events [5]. For example, terms multimedia, audio, video, images,
information, indexing, and retrieval are all associated with the topic multimedia information retrieval,
with different weights. If we build a complete relationship tree with weights attached to different terms for this
topic, documents with one or more these terms will have different combined weights or similarities to the topic.
Retrieval can be based on these similarities.
Knowledge about the user, such as his/her preference and background, can also be used to improve IR
performance. For example, if the system knows that the user is a tennis fan and the user issued a query sports
news, the system will give higher weights to news items about tennis than to other sports.

Those who are interested in nontraditional IR using NLP, domain knowledge and user profile are referred
to references [3-5, 13-16].

4.8 PERFORMANCE MEASUREMENT


Information retrieval performance is normally measured using three parameters: retrieval speed, recall, and
precision. These three parameters are largely determined by the indexing scheme and similarity measurement
used. The meaning of speed is obvious and the higher the speed, the better the performance. This parameter
measures efficiency.
Recall and precision are collectively used to measure the effectiveness of a retrieval system. Recall
measures the capacity to retrieve relevant information items from the database. It is defined as the ratio between
the number of relevant items retrieved and the total number of relevant items in the database. During
performance testing, the total number of relevant items in the database for each testing query should be
determined first by an expert in the domain. The higher the recall, the better the performance.
Precision measures the retrieval accuracy. It is defined as the ratio between the number of relevant items
retrieved and the total number of retrieved items. If considered in isolation, the higher the precision, the higher
the retrieval performance. In practice, recall and precision are considered together. It is normally the case that
the higher the recall, the lower the precision. This is because in the process of trying to retrieve all relevant
items to a query some irrelevant items are also retrieved, reducing the precision. A system with high recall but
low precision means that the system will return a long list of items, many of which are irrelevant. On the other
hand, a system with high precision but low recall means that many items relevant to the query are not retrieved.
Thus a good retrieval system should balance the recall and precision. So to compare the performance between
two information retrieval systems, both recall and precision should be compared. One technique to do this is to
determine precision values with recall values ranging from 0 to 1 and to plot a recall-precision graph for each
system, as shown in Figure 4.2. The system with the graph further from the origin has the higher performance.
The following example shows calculation of recall and precision.

Figure 4.2 Recall-precision graphs. A system with the graph further from the origin has higher
performance. So the system with graph B is better than the system with graph A.
Suppose a database has 1,000 information items in total, Out of which 10 are relevant to a particular query. The
system returned the following list in response to the query:
R,R, I, I, R, R, I, I, R, I, R, R, I, I, R
where the Rs denote items relevant to the query judged by the user and the Is denote items judged irrelevant by
the user. Note that all items returned are deemed relevant by the system, but only some of them are actually
relevant to the query as judged by the user.
Recall-precision pairs are calculated by considering the different number of items returned as shown in
Table 4.2.
We see that the more the items returned, the higher the recall and the lower the precision. In performance
evaluation, recall-precision pairs are calculated at fixed intervals of recall. For instance, precision is calculated
when the recall value is at 0.1, 0.2, 0.3, , 0.9, 1.0. Experiments with many queries should be carried out. The
precision values at the same recall value are then averaged to obtain a set of average recall-precision pairs for
the system. At a fixed recall value, the higher the precision, the higher the system performance.

10

The index affects both recall and precision as well as system efficiency. If the index does not capture all the
information about items, the system is not able to find all the items relevant to queries, leading to a lower recall.
If the index is not precise, some irrelevant items are retrieved by the system, leading to a lower precision.
Similarity measurement is extremely important and should conform to human judgment. Otherwise, the
precision of the system will be low. As shown in the above example, when some returned items are judged not
relevant to the query by the user, the retrieval precision is decreased.

Number of item returned


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Table 4.2
Recall and Precision Calculation
Recall
1/10
2/10
2/10
2/10
3/10
4/10
4/10
4/10
5/10
5/10
6/10
7/10
7/10
7/10
8/10

Precision
1/1
2/2
2/3
1/2
3/5
4/6
4/7
4/8
5/9
5/10
6/11
7/12
7/13
7/14
8/15

4.9 PERFORMANCE COMPARISON AMONG DIFFERENT IR TECHNIQUES


Studies have been carried out to evaluate the retrieval performance of different techniques [4, 6]. The following
are some of the findings:
Automatic indexing is as good as manual indexing, but performance will be better if a combination of
automatic and manual indexing is used.
When similar queries are used, the retrieval performance of partial match techniques is better than exact
match techniques (Boolean model).
The probabilistic model and vector space model have similar retrieval performance.
Cluster-based retrieval techniques and the probabilistic model have similar retrieval performance, but they
retrieve different documents.
Assuming all relevant documents are ~ot found on the first pass, the use of relevance feedback will
improve the retrieval performance.
During query formulation and relevance feedback, significant user input produces higher retrieval
performance than no or limited user input.
The use of domain knowledge and user profile significantly improves the retrieval performance.

4.10 WWW SEARCH ENGINES


The WWW is currently one of the most popular applications of the Internet. It can be perceived as a collection
of interlinked documents distributed around the world. The main purpose of this section is to discuss how a user
can retrieve relevant documents from the WWW using tools called search engines. We first introduce the basic
concepts of the WWW, discuss the main differences between traditional IR systems and WWW search engines,
and then describe an example search engine.

4.10.1 A Brief Introduction to the WWW


A basic understanding of hypertext and hypermedia is required to understand the WWW. Hypertext is a way of

11

organizing information that allows non-sequential access. A hypertext document is made up of a number of
nodes and links. A node usually represents a single concept or idea. It is a container of information. Links
connect related nodes. An area within the content of a node indicating the existence of a link is called an anchor.
Anchors are normally highlighted in a special way (e.g., underlined or color shaded) or represented by a special
symbol. Selecting the anchor activates the link and brings up the destination node. Note that many people do not
distinguish between links and anchors. To them a highlighted area in a hypertext document is a link. Figure 4.3
shows an example hypertext document. It shows three of many nodes of a document about Monash University.
Initially, the first node with general information on Monash is shown. The underlined words are anchors
indicating that there are links leading to more information about the underlined items. If the user wants to find
more information on any underlined item, he or she can simply select the link associated with the anchor. For
example, if he or she selects campuses hoping to find more about these campuses, the system brings up node 2
with information on the campuses. Again there are a number of anchors that the reader can select. For example,
if the user selects Gippsland, node 3 is brought up. We see that with hypertext it is easy to retrieve related
information.
To summarize, hypertext is an approach to information management in which data is stored in a network of
nodes connected by computer-supported links. The modifier computer supported is very important. In
traditional printed text, we can think of footnotes and references as links. These links are not computer
supported and information pointed to by these links cannot be retrieved quickly. Thus hypertext capability is
achieved through use of the storage, fast-searching, and fast-retrieval abilities of computers.

Monash university is Aus


tralias largest university and
national leader in its inno
vative approach to research
and teaching. With seven cam
puses, nine faculties and two
subfaculties, Monas provides
a comprehensive

The seven campuses are


Berwic, Caulfield,
Gippland, Parkville,
Peninsula. and Sunway
Malaysia. These campuses
provide students with easy
access to high education

The Gippsland campus is


located in the Latrobe valley
about 168 kilometers east of
Melbourne. The valley is an
important industrial, rural and
tourist region, rich in resources
such as brown coal, oil and gas
and timber. It also has highly
productive agricultural
Figure 4.3 An example hypertext document.
Hypermedia is an extension of hypertext in that anchors and nodes can be any type of media such as
graphics, images, audio, and video as well as text. For example, if the hypertext document in the above example
is extended into hypermedia, a picture of it map may be used indicate the locations of the campuses and sound
and video clips tttav be used to make the presentation more effective.
The WWW is the geographical extension of hypermedia in that the destination anchor or node of a link can
be located anywhere on the network. So a WWW document is distributed and different parts of the document
are stored at different locations on die network. In the above example, for instance, the information for
individual campuses Is maintained and stored on servers at the respective campuses. The locations of these
nodes are almost transparent to the user. As in hypermedia, the user just selects the anchor and the associated
nodes are brought up. We say it is almost transparent because if the network connecting the selected node is
slow or busy, the user may find that it takes longer to bring up the node than when the node is stored locally.

12

Thus it may not be entirely transparent.


In principle, the network for linking nodes can be any type of network. However, due to the popularity and
wide availability of the Internet, the current WWW runs on the Internet. So the WWW is the integration of
hypermedia and the Internet.
Figure 4.4 shows the architecture of the WWW. The main components of the WWW are server, client, and
the connection between the server and the client. Note that although only one server is shown in the diagram, the
user may access multiple servers iii any information retrieval session.

Application
Program
CGI
HTTP
Client

Server

Figure 4.4 A simplified configuration of the WWW.


The user interacts with the client or browser through a user interface. In a typical session, the user enters
the document request through the interface. The client then sends the request to the appropriate server. The
server processes the request and retrieves and sends the requested document to the client if the server has the
document and the client has permission to access the document. The received document is presented to the user
by the client.
When the users request is related to a specific application such as searching a database, the server passes
the request to an application program through the common gateway interface (CGI). The result of processing the
request is passed back to the server which then sends it to the client.
WWW documents are formatted using Hypertext Markup Language (HTML). HTML structures the
document in a standard way so that clients can interpret and display the document correctly.
The communication between the client and server is carried out using Hypertext Transport Protocol
(HTTP). HTTP is a reliable protocol built on top of TCP/IP. In other words, the HTTP guarantees the logical
correctness of the communication but does not provide a timeliness guarantee.
The term WWW has two meanings. First, it refers to a set of concepts and protocols including HTTP and
HTML. Second, it refers to a space of digitized information. The success of the WWW is largely due to userfriendly browsers that provide easy access to information on the Internet. In addition to native HTML
documents, we can access other servers, such as a file transfer protocol (FTP) server and gopher, through the
Web browser.
The WWW is now a popular tool to disseminate and retrieve information. We do not attempt to cover all
technical aspects of the WWW. There is a large amount of literature on the WWW on-line and in printed form.
Here we look at one important issue of the WWW how to find useful information.
4.10.2 Resource Discovery
Resource discovery refers to the process of finding and retrieving information on the Internet. As the number of
users and the amount of information on the Internet grow rap~ idly, how to find relevant information becomes
very important. There are millions of users and millions of servers managing and storing information in one
form or another. How do we know the information that we want exists on the Internet? If it exists, how (10 we
know the location of the documents and how can we retrieve them? These are some of the issues of resource
discovery.
Let us first look at how locations of documents in the WWW and the Internet in general are specified. On
the Internet, document locations are specified using uniform resource locators (URL). The general format of
URLs is as follows:
Protocol://Server-name[:port]/Document-name

13

The URL has three parts. The first part specifies the Internet protocol used to access the document. The
protocols that can be used include ftp, http, gopher, and telnet.
The second part of the URL identifies the name of the document server, such as
www-gscit.fcit.monash.edu.au. This is the standard Internet domain specification. The example server
name means that the server is called www-gscit, which is in domain fcit (Faculty of Computing and
Information Technology) of monash (Monash University) of edu (education sector) of au (Australia).
Each server name has a corresponding Internet Protocol (IP) address. So if the IP address is known we can use it
directly instead of the machine name string. If the server is run on a nondefault port (for the WWW, the default
port is 80), we need to specify the port being used. The final pail of the URL represents the file name of the
document to be retrieved. The file name must be complete, including the full path name.
The following are two example URLs:
http://www-gscit.fcit.monash.edu.au/gindex.html
ftp://ftp.monash.edu.au/pub/internet/readme. txt
The first URL refers to a document called gindex.html in the default directory of server wwwgscit.fcit.monash.edu.au accessible using HTTP. The second URL refers to a file called readme.txt in the
directory /pub/internet of the server ftp.monash.edu.au accessible using ftp.
Internet documents are uniquely specified with U7RLs. Now, how can we know that the file we want
exists, and what is its corresponding URL?
There are two general ways to find and retrieve documents on the Internet: organizing/browsing and
searching. Organizing refers to the human-guided process of deciding how to interrelate information, usually by
placing documents into some sort of a hierarchy. For example, documents on the Internet can be classified
according to their subject areas. One subject area may have multiple levels of sub areas. Browsing refers to the
corresponding human-guided activity of exploring the organization and contents of a resource space or to the
human activity of following links or URLs to see what is around there. Searching is a process where the user
provides some description of the resources being sought, and a discovery system locates information that
matches the description, usually using the IR techniques discussed in the previous section.
Browsing is a slow process for information finding. It depends heavily on the quality of informations
organization. It may be difficult to find all relevant information, and users can get disoriented and lost in the
information space.
Searching is more efficient than browsing, but it relies on the assumption that information is indexed.
There are currently many servers in the Internet that provide searching facilities. These servers use a program
called a crawler to visit major information servers around the world and to index information available on these
servers. The indexing technique used is similar to that used in IR discussed earlier in this chapter. In this case
the document identifiers are the URLs of the documents. These searching servers also rely on the document
creator to inform them about the contents and URLs of documents they created. So these searching servers
provide pointers to documents on the Internet.
In practice both browsing and searching are used in information discovery. The user may first browse
around to find a suitable search engine to use. Then he or she issues a query to the server. There may be many
documents returned in response to each query. These documents are normally ranked according to the similarity
between the query and documents. The user has to determine which documents are useful by browsing.
Resource discovery on the Internet is an extended case of IR. In this case, documents are distributed across
many servers on the Internet, making information organization, indexing, and retrieval more challenging. In the
following, we describe the main differences between WWW search engines and IR systems, their implications
to WWW search engine design, and an example WWW search engine.

4.10.3 Major Differences Between IR Systems and WWW Search Engines


The basic role of WWW search engines is similar to that in an IR system: to index documents and retrieve
relevant documents in response to users queries. But their operating environment differs significantly, leading
to many challenges in designing and developing a WWW search engine. The major differences are:
1. WWW documents are distributed around the Internet while documents in an IR system are centrally located;
2. The number of WWW documents is much greater than that of an IR system;
3. WWW documents are more dynamic and heterogeneous than documents in an IR system;
4. WWW documents are structured with HTML while the documents in an IR system are normally plain text;

14

5. WWW search engines are used by more users and more frequently than IR systems.
We discuss these differences and their implications to the WWW search engine design.

4.10.3.1 WWW Documents are Distributed


in the WWW, documents are stored on a huge number of servers located around the world. Before these
document can be analyzed and indexed, they have to be retrieved from these distributed servers. The component
in a WWW search engine performing this function is called a crawler, robot, or spider [7, 8].
The spider visits WWW servers and retrieves HTML documents based on a URL database. URLs can be
submitted by the Web documents authors, from embedded links in Web documents, or from databases of name
servers.
The retrieved documents are sent to an indexing engine for analysis and indexing. Most current spiders
only retrieve HTML documents. Images, video, and other media are ignored.
It is sometimes not possible to crawl the entire WWW. So a search engine must decide which WWW
documents are visited and indexed. Even if crawling the entire WWW is possible, it is advantageous to visit and
index more important documents first because a significant amount of time is required to obtain and index all
documents. Cho, Garcia-Molina, and Page defined several importance metrics and proposed a number of
crawling order schemes [9]. Their experimental results show that a crawler with a good ordering scheme can
obtain important pages significantly faster than one without.

4.10.3.2 The Number of WWW Documents is Large


There are millions of WWW servers around the world and each server stores many HTML documents. This
large scale has many implications for search engine resources (CPU speed, bandwidth, storage) and retrieval
strategies.
A simple calculation shows the resource requirements of a WWW search engine. Suppose a total of 100
million documents need to be retrieved and indexed by a search engine and it takes 1 second to retrieve, analyze
and index each document. It would then take 1,157 days for the search engine to complete the task! This is
simply unacceptable. To overcome this problem, most search engines are built around multiple powerful computers with huge main memory, huge disk, fast CPUs, and high bandwidth Internet connections. For example,
AltaVista1 has 16 AlphaServer 8400 5/300s, each with 6 GB of memory, with 100 Mbps Internet access [10]. It
is claimed that it can visit and retrieve 6 million HTML documents per day.
Another effect of the large number of documents is on retrieval strategies. The returned list must be ranked,
with more relevant items being listed first. Otherwise, it would be hard for the user to find the relevant items
from the long return list. Indexing and retrieval should be designed to take into account the fact that retrieval
precision is more important than recall. This is to reduce the return list and make the first few items more
relevant to the users needs. Also, the low recall problem can be alleviated by the fact that retrieved relevant
documents may have links to other relevant items.

4.10.3.3 WWW Documents are Dynamic and Heterogeneous


Web pages are heterogeneous and dynamic. They are developed by a wide range of people according to
different standards and cover averse topics. They are also constantly updated and changed without notice. These
fact have significant implications for search engine design:
First, a huge vocabulary must be used to core with the large number and diversity of documents.
Second, it is difficult to make use of domain knowledge to improve retrieve effectiveness as documents are
from many different domains.
Third, document frequency cannot be obtained by calculating the term weight as the Web database is built
progressively and is never complete.
Fourth, the vector space model is not suitable because document size varies and this model favors short
documents.
Fifth, the index must be updated constantly is the documents change constantly.
Sixth, the search engine must be robust to cope with the unpredictable nature of documents and Web

15

servers.

4.10.3.4 WWW Documents are Structured


Web pages are normally structured according to HTML. There are tags to indicate different types of text such as
document title, section title, different fonts, and links to other pages. A search engine can make use of this
information to assign different weights to different types of text, leading to high retrieval effectiveness. We
describe how this information is used using an example search engine in Section 4.10.4.

4.10.3.5 WWW Search Engines are Heavily Used


Web search engines are used frequently by many diverse users. This means that the user interface must be easy
for novices to use and must provide advanced features for advanced users to improve the retrieval effectiveness.
In addition, the search engine must efficiently serve many users simultaneously with a reasonable response time.
This requires both powerful machines and appropriate index structures.

4.10.3.6 Other Issues


In addition to the above characteristics and implications, there are many other issues in designing and
developing Web search engines.
First, the designers have to decide whether the search engine has a central storage (i.e., whether indexed Web
pages should be stored centrally in the search engine or be discarded after indexing and users have to retrieve
them from their original servers). The decision has a number of implications. If Web pages are centrally stored,
a large amount of storage is required. As Web pages are constantly updated, the central storage may not have the
latest version of the document. In addition, there is a copyright issue. Is it legal to collect Web pages and
distribute them? On the other hand, if the Web pages are not centrally stored after indexing, the original
documents or even servers may not exit any more when the user tries to retrieve indexed documents.
Second, search engines are heavily commercialized. The search engine may favor the Web pages of its
sponsors, leading to low objective retrieval effectiveness.
Third, retrieval effectiveness can be manipulated by Web page designers. For example, most search
engines rank documents based on word occurrence. The more often the word occurs in a document, the higher
the relevance of the document to the query containing that word. A Web page designer can use this knowledge
to improve the rank of the page by repeating the word or phrase many times in the document title or body.
Fourth, the result of a search may give a wrong impression regarding the availability of required
information. For example, a user needing information on a particular topic may wrongly conclude that there is
no such information available on-line if the search engine returns with no match is found, although the
information actually exists. This may be due to the fact that indexing by the search engine is not complete, the
retrieval effectiveness is very low, and/or the query is not well formulated.
Fifth, most likely, different search engines return totally different lists to the same query. There is currently
no comparison among common search engines in terms of completeness and effectiveness.

4.10.4 General Structure of WWW Search Engines


Design details differ from one search engine to another. However, all search engines have three major elements.
The first is the spider, crawler, or robot. The spider visits a Web page, reads it, and then follows links to other
pages within the site. The spider may return to the site on a regular basis, such as every month or two, to look
for changes.
Everything the spider finds goes into the second part of a search engine, the index. The index, sometimes
called the catalog, is like a giant book containing a copy of every Web page that the spider finds. If a Web page
changes, then this book is updated with new information. Sometimes it can take a while for new pages or
changes that the spider finds to be added to the index. Thus, a Web page may have been spidered but not yet
indexed. Until it is indexed added to the index it is not available to those searching with the search engine.
Search engine software is the third part of a search engine. This is the program that sifts through the
millions of pages recorded in the index to find matches to a search and rank them in order of what it estimates is
most relevant. Different search engines use different similarity measurement and ranking functions. However,

16

they all use term frequency and term location in one way or another.
All search engines have the basic parts described above, but there are differences in how these parts are
tuned. That is why the same search on different search engines often produces different results.

4.10.5 An Example Search Engine


There are many search engines available on the WWW. For commercial reasons, their design details are rarely
publicized. But their main working principles are the same. In this subsection, we describe a research prototype
search engine, called Google [11, 12]. In the following we first provide an overview of the Google architecture
and then describe a number of major components in some detail.

4.10.5.1 Architecture Overview of Google


Figure 4.5 shows the high level architecture of Google. It has two interfaces to the Internet. One is used for
crawlers to visit and fetch documents from WWW servers distributed around the world, while the other is used
to serve users search requests. Its operation can be described in two phases: We crawling (downloading Web
pages) and indexing, and document searching.
During the Web crawling and indexing phase, the URL server sends lists of URLs to be fetched to the
crawlers. To improve the crawling speed, several distributed crawlers run simultaneously. The Web pages (or
docum2nis) fetched by the crawlers are passed to the compression server, which compresses an~ stores Web
pages in the repository. The indexer reads compressed Web pages, uncompressed and parses them. Based on
word occurrences, word positions, and word properties such as font size and capitalization, the indexer
generates a forward index file that is sorted by DocID. DocID is the ID number assigned to each Web page. In
addition, the parser also parses out all the links in every Web page and stores important information, including
link source and destinations, and the text of the link (anchor text), in an anchor fire.
The URL resolver reads URLs from the anchor file and converts relative URLs into absolute URLs. It
extracts anchor text and puts it into the forward index, associated with the DocID that the anchor points to. It
also generates a links database that contains pairs of documents linked by each link. PageRanks for all the
documents are then calculated, based on the link database.
The indexer and URL resolver also generate a document information (Doc Infor) file that contains
information about each document including DodD, URL, whether crawled, and a pointer to the repository.
Uncrawled URLs are fed to the URL server, waiting to be crawled.
The sorter generates an inverted index file from the forward index file. The inverted index is sorted by
WordID.
As a result of the crawling and indexing phase, entries for each Web page are created in the forward index,
inverted file, document information file, and PageRank file.
During the document searching phase, the user enters a query from a Web browser. The query is normally
a number of keywords. It is transmitted to the Google Web server which in turn passes the query to the Google
searcher. The searcher retrieves relevant information from the forward index, inverted file, document
information file, and Page-Rank file and ranks documents according to their similarity to the query. The ranked
list is returned to the user who selects the Web pages to be retrieved from the repository and displayed.
In the following, we describe some of the main components or functions of the Google search engine.

17

Figure 4.5 The Google architecture.

4.10.5.2 Web Crawling


The two technical issues of Web crawling are performance and reliability. To improve performance, Google
runs a number of crawlers (typically 3) at the same time. l~a crawler keeps about 300 connections open at once.
At peak speeds, the system can craw over 100 Web pages per second using four crawlers. As a major bottleneck
of Web crawling is domain name server (DNS) lookup, each crawler maintains its own DNS cache so it does
not need to do a DNS lookup before crawling each document.
Reliability is a very important practical issue as a crawler needs to visit millions 7 diverse Web pages.
These pages may use different versions of HTML, may not conform to the HTML standards, have many typos,
and be in different development stages. Careful design, testing, and monitoring of the crawling process is
required.
99

4.10.5.3 PageRanks and Anchor Text


Google improves retrieval effectiveness by using information present in Web page structure. The two main
features used in Google are PageRank and Anchor text.
Pagerank is introduced based on an intuition that a page is important if there are many other pages that
have links pointing to it and/or if one or more important pages have links pointing to it. The PageRank of page
A is defined as:
PR(A) = (I-d) +d(PR(T1)/C(T1) + ... + PR(Tn )/C(Tn))
where d is a damping factor between 0 and 1, T1 to Tn are Web pages that have links pointing to A, C(Tn) is the
number of links going out of page Tn.

18

PR(A) is calculated using a simple iterative algorithm. It was reported that Page-Ranks for 26 million Web
pages can be computed in a few hours on a medium size workstation [12]. PageRank is a good way to prioritize
the results of Web keyword searches, leading to higher retrieval precision.
The second main feature of Google is that it considers anchor text in both source and destination pages. As
anchor text is part of the source document, most search engines consider it implicitly. In Google, anchor text has
more importance (higher weight) than normal text. In addition, it has a significant weight in the destination
document. This approach has a number of advantages. First, anchor text is normally made up of important terms
or concepts, otherwise the author would not bother to add a link to the text for further information. Second,
anchor text normally provides a good description of the destination document. Third, this approach makes it
possible to return Web pages that have not been crawled.

4.10.5.4 Searching
When the Google Web server receives a user s query, it is passed to the searcher. The searcher parses the query
and converts it into WordlDs.
In the forward and inverted index files, many types of information about each word is stored. The
information includes position in a document and type of text (title, anchor, URL, font size, etc.). Different types
of text have different weights. Word proximity is computed based on word positions. The closer the words, the
higher the weight. An initial score is then calculated for each document based on types of text, occurrence of
each type, and word proximity. The final score is obtained by combining the initial score with the PageRank.
The return list is displayed in the descending order of the final score.

4.11 SUMMARY
The main design issue of IR systems is how to represent documents and queries and then how to compare the
similarity between the document and query representations. We have described a number of common techniques
(called retrieval models) that address this issue.
The retrieval effectiveness of IR systems is yet to be improved. A promising approach is to use domain
knowledge and NLP to automatically understand documents and queries. This is not surprising considering how
human beings determine if a document is relevant or useful.
WWW search engines are similar to IR systems but present many other challenging ISSUes. Current WWW
search engines search text only (HTML) documents. It is expected that future search engines will be able to
search various media including text, image, video, and audio by integrating IR techniques with the contentbased multimedia retrieval techniques discussed in the following chapters.

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy