Chap 4 Text IR PDF
Chap 4 Text IR PDF
with documents themselves. During retrieval (left side of the figure), the user issues a query that is processed
(on-line) to obtain its representation. Then the query representation is compared with the document
representations. Documents deemed relevant by the system are retrieved and presented to the user, who
evaluates the returned documents and decides which ones are actually relevant to the information need. A good
ER system should then allow the user to provide relevance feedback to the system. The system uses this
information to modify query, query representation, and/or document representations. Another retrieval is done
based on the modified query and document representations. If necessary, the retrieval-feedback process is
iterated a few times. Note that not all IR systems support the user relevance feedback process.
Different IR models use different methods for query and document representation, similarity comparison
and/or relevance feedback. We discuss Boolean, vector space, probabilistic, and clustering models in the
following sections.
Query
Text
documents
processing
processing
Query
representation
document
representation
Similarity
calculation
Retrieved
documents
Relevance
evaluation
term 2), the presence of either term in a record (or a document) suffices to retrieve that record.
The AND operator combines terms (or keywords) into term phrases; thus the query (term 1 AND term 2)
indicates that both terms must be present in the document in order for it to be retrieved.
The NOT operator is a restriction, or term-narrowing, operator that is normally used in conjunction with
the AND operator to restrict the applicability of particular terms; thus the query (term 1 AND NOT term 2)
leads to the retrieval of records containing term 1 but not term 2.
4.3.2File Structure
A fundamental decision in the design of IR systems is which type of file structure to use for the underlying
document database. File structures used in IR systems include flat files, inverted files, signature files, and others
such as PAT trees and graphs [1].
Using a flat-file approach, one or more documents are stored in a file, usually as ASCII or EBCDIC text.
Documents are not indexed. Flat-file searching is usually done via pattern searching. In UNIX, for example, one
can store a document collection one document per file in a UNIX directory. These files can be searched using
pattern searching tools such as grep and awk. This approach is not efficient, because for each query the
entire document collection must be searched to check the query text pattern.
Signature files contain signatures (bit patterns) that represent documents. There are many ways to generate
signatures for documents. The query is also represented by a signature that is compared with the document
signature during retrieval.
A commonly used file structure is an inverted file that is a kind of indexed file. The inverted file concept
and retrieval operations based on inverted files are described next.
Inverted Files
In an inverted file, for each term a separate index is constructed that stores the record identifiers for all records
containing that term. An inverted file entry usually contains a keyword (term) and a number of document-IDs.
Each keyword or term and the document-IDs of documents containing the keyword are organized into one row.
An example of an inverted file is shown below:
Term 1: Record 1, Record 3
Term 2: Record 1, Record 2
Term 3: Record 2, Record 3, Record 4
Term 4: Record 1, Record 2, Record 3, Record 4
where Term i (i being 1, 2, 3, or 4) is the ID number of index term i, Record i (i being 1, 2, 3, or 4) is the ID
number of record i or document i.
The first row means that Term 1 is used in Record 1 and Record 3. The other rows have analogous
meanings. Using an inverted file, searching and retrieval is fast. Only rows containing query terms are retrieved.
There is no need to search for every record in the database.
Retrieval rules using the Boolean model based on inverted files are as follows:
For the Boolean AND query, such as (Term i AND Term j), a merged list is produced for rows i and j of
the inverted file and all duplicated records (those containing both Term i and Term j) are presented as
output. Using the above inverted file as an example, for query (Term 1 and Term 3), the output is Record 3.
For an OR query, such as (Term i OR Term j), the merged list is produced for rows i and j and all distinct
items of the merged list are presented as output. Using the above inverted file as an example, for query
(Term 1 OR Term 2), the output is:
Record 1, Record 2, Record 3.
For a NOT query, such as (Term i AND NOT Term j) the output is items appearing in row i but not in row
j. Using the above inverted file as an example, for query (Term 4 AND NOT Term 1), the output will be:
Record 2, Record 4. For query (Term 1 AND NOT Term 4), the output is nil.
So far we have ignored two important factors in document indexing and retrieval: term positions and the
significance of terms (term weights) in documents. In AND queries, all records containing both terms are
retrieved, regardless of their positions in the documents. Each term is of equal importance, regardless of their
occurring frequencies in documents. To improve retrieval performance, these two factors must be taken into
account. We discuss term weight later. In the following we discuss position constraints.
The relationships specified between two or more terms can be strengthened by adding nearness parameters
to the query specification. When nearness parameters are included in a query specification, the topic is more
specifically defined, and the probable relevance of any retrieved item is larger.
Two possible parameters of this kind are the within sentence and adjacency specification:
(Term i within sentence Term j) means that terms i and j occur in a common sentence of a retrieved record.
(Term i adjacent Term j) means that Terms i and j occur adjacently in the retrieved documents.
To support this type of query, term location information must be included in the inverted file. The general
structure of the extended inverted file is
Term i: Record no., Paragraph no., Sentence no., Word no.
For example, if an inverted file has the following entries:
information: R99, 10, 8,3; R155, 15,3,6; R166, 2, 3, 1
retrieval: R77, 9, 7,2; R99, 10, 8,4; R166, 10, 2,5
then as a result of query (information within sentence retrieval), only document R99 will be retrieved.
In the above example, terms information and retrieval appear in the same sentence of document R99.
So it is very likely the record is about information retrieval. Although document R166 contains both
information and retrieval, they are at different places of the document, so the document may not be about
information retrieval. It is likely that the terms information and retrieval are used in different contexts.
A
ABOUT
ACROSS
ALTHOUGH
ALWAYS
AMONG
ANYONE
ANYTHING
ANYWHERE
AFTER
AFTERWARDS
AGAIN
AGAINST
ALL
ALSO
AMONGST
AN
AND
ANOTHER
ANY
ANYHOW
ARE
AROUND
AS
AT
BE
BECOME
Another way of conflating related terms is with a thesaurus that lists synonymous terms and sometimes the
relationships among them. For example, the words study, learning, schoolwork, and reading have
similar meanings. So instead of using four index terms, a general term study can be used to represent these
four terms. The thesaurus operation has a similar effect on retrieval efficiency and effectiveness as the stemming
operation.
Different indexing terms have different frequencies of occurrence and importance to the document. Note
that the occurring frequency of a term after stemming and thesaurus operations is the sum of the frequencies of
all its variations. For example, the term frequency of retriev is the sum of the occurring frequencies of the
terms retrieve, retrieval, retrieving, and retrieved, The introduction of term-importance weights for
document terms and query terms may help the distinguish terms that are more important for retrieval purposes
from less important terms. When term weights are added to the inverted file, different documents have different
similarities to the query and the documents can be ranked at retrieval time in decreasing order of similarity.
An example of inverted file with term weights is shown below:
Terml: R1, 0.3; R3,0.5; R6, 0.8; R7, 0.2; R11, 1
Term2: R2, 0.7; R3, 0.6; R7, 0.5; R9, 0.5
Term3: R1, 0.8; R2, 0.4; R9, 0.7
The first row means that weight of term 1 is 0.3 in Record 1, 0.5 in Record 3, 0.8 in Record 6, 0.2 in
Record 7 and 1 in Record 11. Other rows can be read similarly.
Boolean operations with term weights can be carried out as follows:
For the OR query, the higher weight among records containing the query terms is used as the similarity
between the query and documents. The returned list is ordered in decreasing similarity. For example, for
query (Term 2 OR Term 3), we have Ri = 0.8, R2 = 0.7, R3 = 0.6. R7 = 0.5, R9 = 0.7, therefore the output order
is Ri, R2, R9, R3, and R7.
For the AND query, the lower weight between the common records matching query terms is used as the
similarity between the query and the documents. For example for the query (Term 2 AND Term 3). we
have R2 = 0.4, R9 = 0.5. Therefore the out put is R9 and R2.
For the NOT query, the similarity between the query and the documents is the difference between the
common entries in the inverted file. For example, for query (Tern 2 AND NOT Term 3), we have R2 = 0.3,
R3 = 0.6, R7 = 0.5, R9 = 0, therefore the output is R3, R7, and R2.
We discussed how the use of term weights can help rank the returned list. Ranked return is very important
because if the first few items are most similar or relevant to the query, they are normally the most useful to the
user. The user can just look at the first few items without going through a long list of items. Now let us look at
how to determine term weights for different index terms.
The assignment of index terms to documents and queries is carried out in the hope of distinguishing
documents that are relevant for information users from other documents.
In a particular document, the more often a term appears, the more important the term and the higher the
term weight should be.
In the context of an entire document collection, if a term appears in almost all documents, the term is not a
good index term, because it does not help in distinguishing relevant documents from others.
Therefore good index terms are those that appear often in a few documents but do not appear in other
documents. Term weight should be assigned taking into account both term frequency (tf) and document
frequency (df). The commonly used formula to calculate term weight is
where Wij is the weight of term j in document i, tfij, is the frequency of term j in document i, N is the total
number of documents in the collection, dfj is the number of documents containing term j. The above weight is
proportional to term frequency and inverse document frequency. Thus the above formula is commonly called
tf.idf.
Based on the above formula, if a term occurs in all documents in the collection (dfj == N), the weight of the
term is zero (i.e., the term should not be used as an index term because use of the term is not able to differentiate
documents). On the other hand, if a term appears often in only a few documents, the weight of the term is high
(i.e., it is a good index term).
S ( Di , Q j ) =
Tik Q jk
k =1
To compensate for differences in document sizes and query sizes, the above similarity can be normalized as
follows:
S ( Di , Q j ) =
Tik Q jk
k =1
N
T ik . Q 2 jk
k =1
k =1
This is the well known cosine coefficient between vectors Di and Qj. During retrieval, a ranked list in
descending order of similarity is returned to the user. For -pie, if four documents and a query are represented as
the following vectors:
D1 = [0.2, 0.1, 0.4, 0.5]
D2 = [0.5, 0.6, 0.3, 0]
D3 = [0.4,0.5,0.8,0.3]
D4 = [0.1, 0, 0.7, 0.8]
Q = [0.5, 0.5, 0, 0]
then the similarities between the query and each of the document are as follows:
S(D1, Q) = 0.31
S(D2, Q) = 0.93
S(D3, Q) = 0.66
S(D4, Q) = 0.07
The system will return the documents in the order D2, D3, D1, and D4.
The main limitation of the vector space model is that it treats terms as unrelated and it only works well with
short documents and queries.
Query Modification
Query modification based on user relevance feedback uses the following rules:
Terms occurring in documents previously identified as relevant are added to the original query, or the
weight of such terms is increased.
Terms occurring in documents previously identified as irrelevant are deleted from the query, or the weight
of such terms is reduced.
The new query is submitted again to retrieve documents. The above rules are expressed as follows:
Q (i + 3) = Q (i ) +
D i Re l
Di
Di
D i Non Re l
where Q(i+l) is the new query, Q(i) is the current query, Dt is a collection of documents retrieved in response to Q(I),
the first summation is done on all relevant documents within Di, and the Second summation is done on non
relevant documents with D.
Experiments show that the performance is improved by using this technique. The principle behind this
approach is to find similar documents to the ones already judged as relevant to the query. Documents relevant to
Document Modification
In query modification based on user relevance feedback, queries are modified using the terms in the relevant
documents. Other users do not benefit from this modification. In document modification based on the users
relevance feedback, document index terms are modified using query terms, so the change made affects other
users. Document modification uses the following rules based on relevance feedback:
Terms in the query, but not in the user-judged relevant documents, are added to the document index list
with an initial weight.
Weights of index terms in the query and also in relevant documents are increased by a certain amount.
Weights of index terms not in the query but in the relevant documents are decreased by a certain amount.
When subsequent queries similar to the queries used to modify the documents are issued, performance is
improved. But this approach may decrease the effectiveness if the subsequent queries are very different from
those used to modify the documents.
Those who are interested in nontraditional IR using NLP, domain knowledge and user profile are referred
to references [3-5, 13-16].
Figure 4.2 Recall-precision graphs. A system with the graph further from the origin has higher
performance. So the system with graph B is better than the system with graph A.
Suppose a database has 1,000 information items in total, Out of which 10 are relevant to a particular query. The
system returned the following list in response to the query:
R,R, I, I, R, R, I, I, R, I, R, R, I, I, R
where the Rs denote items relevant to the query judged by the user and the Is denote items judged irrelevant by
the user. Note that all items returned are deemed relevant by the system, but only some of them are actually
relevant to the query as judged by the user.
Recall-precision pairs are calculated by considering the different number of items returned as shown in
Table 4.2.
We see that the more the items returned, the higher the recall and the lower the precision. In performance
evaluation, recall-precision pairs are calculated at fixed intervals of recall. For instance, precision is calculated
when the recall value is at 0.1, 0.2, 0.3, , 0.9, 1.0. Experiments with many queries should be carried out. The
precision values at the same recall value are then averaged to obtain a set of average recall-precision pairs for
the system. At a fixed recall value, the higher the precision, the higher the system performance.
10
The index affects both recall and precision as well as system efficiency. If the index does not capture all the
information about items, the system is not able to find all the items relevant to queries, leading to a lower recall.
If the index is not precise, some irrelevant items are retrieved by the system, leading to a lower precision.
Similarity measurement is extremely important and should conform to human judgment. Otherwise, the
precision of the system will be low. As shown in the above example, when some returned items are judged not
relevant to the query by the user, the retrieval precision is decreased.
Table 4.2
Recall and Precision Calculation
Recall
1/10
2/10
2/10
2/10
3/10
4/10
4/10
4/10
5/10
5/10
6/10
7/10
7/10
7/10
8/10
Precision
1/1
2/2
2/3
1/2
3/5
4/6
4/7
4/8
5/9
5/10
6/11
7/12
7/13
7/14
8/15
11
organizing information that allows non-sequential access. A hypertext document is made up of a number of
nodes and links. A node usually represents a single concept or idea. It is a container of information. Links
connect related nodes. An area within the content of a node indicating the existence of a link is called an anchor.
Anchors are normally highlighted in a special way (e.g., underlined or color shaded) or represented by a special
symbol. Selecting the anchor activates the link and brings up the destination node. Note that many people do not
distinguish between links and anchors. To them a highlighted area in a hypertext document is a link. Figure 4.3
shows an example hypertext document. It shows three of many nodes of a document about Monash University.
Initially, the first node with general information on Monash is shown. The underlined words are anchors
indicating that there are links leading to more information about the underlined items. If the user wants to find
more information on any underlined item, he or she can simply select the link associated with the anchor. For
example, if he or she selects campuses hoping to find more about these campuses, the system brings up node 2
with information on the campuses. Again there are a number of anchors that the reader can select. For example,
if the user selects Gippsland, node 3 is brought up. We see that with hypertext it is easy to retrieve related
information.
To summarize, hypertext is an approach to information management in which data is stored in a network of
nodes connected by computer-supported links. The modifier computer supported is very important. In
traditional printed text, we can think of footnotes and references as links. These links are not computer
supported and information pointed to by these links cannot be retrieved quickly. Thus hypertext capability is
achieved through use of the storage, fast-searching, and fast-retrieval abilities of computers.
12
Application
Program
CGI
HTTP
Client
Server
13
The URL has three parts. The first part specifies the Internet protocol used to access the document. The
protocols that can be used include ftp, http, gopher, and telnet.
The second part of the URL identifies the name of the document server, such as
www-gscit.fcit.monash.edu.au. This is the standard Internet domain specification. The example server
name means that the server is called www-gscit, which is in domain fcit (Faculty of Computing and
Information Technology) of monash (Monash University) of edu (education sector) of au (Australia).
Each server name has a corresponding Internet Protocol (IP) address. So if the IP address is known we can use it
directly instead of the machine name string. If the server is run on a nondefault port (for the WWW, the default
port is 80), we need to specify the port being used. The final pail of the URL represents the file name of the
document to be retrieved. The file name must be complete, including the full path name.
The following are two example URLs:
http://www-gscit.fcit.monash.edu.au/gindex.html
ftp://ftp.monash.edu.au/pub/internet/readme. txt
The first URL refers to a document called gindex.html in the default directory of server wwwgscit.fcit.monash.edu.au accessible using HTTP. The second URL refers to a file called readme.txt in the
directory /pub/internet of the server ftp.monash.edu.au accessible using ftp.
Internet documents are uniquely specified with U7RLs. Now, how can we know that the file we want
exists, and what is its corresponding URL?
There are two general ways to find and retrieve documents on the Internet: organizing/browsing and
searching. Organizing refers to the human-guided process of deciding how to interrelate information, usually by
placing documents into some sort of a hierarchy. For example, documents on the Internet can be classified
according to their subject areas. One subject area may have multiple levels of sub areas. Browsing refers to the
corresponding human-guided activity of exploring the organization and contents of a resource space or to the
human activity of following links or URLs to see what is around there. Searching is a process where the user
provides some description of the resources being sought, and a discovery system locates information that
matches the description, usually using the IR techniques discussed in the previous section.
Browsing is a slow process for information finding. It depends heavily on the quality of informations
organization. It may be difficult to find all relevant information, and users can get disoriented and lost in the
information space.
Searching is more efficient than browsing, but it relies on the assumption that information is indexed.
There are currently many servers in the Internet that provide searching facilities. These servers use a program
called a crawler to visit major information servers around the world and to index information available on these
servers. The indexing technique used is similar to that used in IR discussed earlier in this chapter. In this case
the document identifiers are the URLs of the documents. These searching servers also rely on the document
creator to inform them about the contents and URLs of documents they created. So these searching servers
provide pointers to documents on the Internet.
In practice both browsing and searching are used in information discovery. The user may first browse
around to find a suitable search engine to use. Then he or she issues a query to the server. There may be many
documents returned in response to each query. These documents are normally ranked according to the similarity
between the query and documents. The user has to determine which documents are useful by browsing.
Resource discovery on the Internet is an extended case of IR. In this case, documents are distributed across
many servers on the Internet, making information organization, indexing, and retrieval more challenging. In the
following, we describe the main differences between WWW search engines and IR systems, their implications
to WWW search engine design, and an example WWW search engine.
14
5. WWW search engines are used by more users and more frequently than IR systems.
We discuss these differences and their implications to the WWW search engine design.
15
servers.
16
they all use term frequency and term location in one way or another.
All search engines have the basic parts described above, but there are differences in how these parts are
tuned. That is why the same search on different search engines often produces different results.
17
18
PR(A) is calculated using a simple iterative algorithm. It was reported that Page-Ranks for 26 million Web
pages can be computed in a few hours on a medium size workstation [12]. PageRank is a good way to prioritize
the results of Web keyword searches, leading to higher retrieval precision.
The second main feature of Google is that it considers anchor text in both source and destination pages. As
anchor text is part of the source document, most search engines consider it implicitly. In Google, anchor text has
more importance (higher weight) than normal text. In addition, it has a significant weight in the destination
document. This approach has a number of advantages. First, anchor text is normally made up of important terms
or concepts, otherwise the author would not bother to add a link to the text for further information. Second,
anchor text normally provides a good description of the destination document. Third, this approach makes it
possible to return Web pages that have not been crawled.
4.10.5.4 Searching
When the Google Web server receives a user s query, it is passed to the searcher. The searcher parses the query
and converts it into WordlDs.
In the forward and inverted index files, many types of information about each word is stored. The
information includes position in a document and type of text (title, anchor, URL, font size, etc.). Different types
of text have different weights. Word proximity is computed based on word positions. The closer the words, the
higher the weight. An initial score is then calculated for each document based on types of text, occurrence of
each type, and word proximity. The final score is obtained by combining the initial score with the PageRank.
The return list is displayed in the descending order of the final score.
4.11 SUMMARY
The main design issue of IR systems is how to represent documents and queries and then how to compare the
similarity between the document and query representations. We have described a number of common techniques
(called retrieval models) that address this issue.
The retrieval effectiveness of IR systems is yet to be improved. A promising approach is to use domain
knowledge and NLP to automatically understand documents and queries. This is not surprising considering how
human beings determine if a document is relevant or useful.
WWW search engines are similar to IR systems but present many other challenging ISSUes. Current WWW
search engines search text only (HTML) documents. It is expected that future search engines will be able to
search various media including text, image, video, and audio by integrating IR techniques with the contentbased multimedia retrieval techniques discussed in the following chapters.
19