0% found this document useful (0 votes)
44 views19 pages

NLP Unit-Ii (Part-I)

Information Retrieval (IR) is a software program focused on organizing, storing, retrieving, and evaluating textual information from document repositories. The document discusses classical problems in IR, particularly the ad-hoc retrieval problem, and various models such as Boolean, Probabilistic, and Vector Space models, detailing their features, advantages, and disadvantages. Key concepts like indexing, stop word elimination, stemming, and term weighting are also explored to enhance the effectiveness of IR systems.

Uploaded by

pavani20891
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views19 pages

NLP Unit-Ii (Part-I)

Information Retrieval (IR) is a software program focused on organizing, storing, retrieving, and evaluating textual information from document repositories. The document discusses classical problems in IR, particularly the ad-hoc retrieval problem, and various models such as Boolean, Probabilistic, and Vector Space models, detailing their features, advantages, and disadvantages. Key concepts like indexing, stop word elimination, stemming, and term weighting are also explored to enhance the effectiveness of IR systems.

Uploaded by

pavani20891
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-II (Part-I)

Information Retrieval
Introduction:
Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the existence
and location of documents that might consist of the required information. The documents that
satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve
only relevant documents.
Classical Problem in Information Retrieval (IR) System
The main goal of IR research is to develop a model for retrieving information from the
repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc
retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required
information. Then the IR system will return the required documents related to the desired
information. For example, suppose we are searching something on the Internet and it gives
some exact pages that are relevant as per our requirement but there can be some non-relevant
pages too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval
Followings are some aspects of ad-hoc retrieval that are addressed in IR research −
 How users with the help of relevance feedback can improve original formulation of a
query?
 How to implement database merging, i.e., how results from different text databases
can be merged into one result set?
 How to handle partly corrupted data? Which models are appropriate for the same?

Design Features of Information Retrieval Systems:


With the help of the following diagram, we can understand the process of information
retrieval (IR)-
It is clear from the above diagram that a user who needs information will have to formulate a
request in the form of query in natural language. Then the IR system will respond by
retrieving the relevant output, in the form of documents, about the required information.
Indexing: (Inverted Index)
The primary data structure of most of the IR systems is in the form of indexing. We can
define an indexing as a data structure that list, for every word, all documents that contain it
and frequency of the occurrences in document. It makes it easy to search for ‘hits’ of a query
word.

Stop Word Elimination:


Stop words are those high frequency words that are deemed unlikely to be useful for
searching. They have less semantic weights. All such kind of words are in a list called stop
list. For example, articles “a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are
the examples of stop words. The size of the inverted index can be significantly reduced by
stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of
inverted index by almost half. On the other hand, sometimes the elimination of stop word
may cause elimination of the term that is useful for searching. For example, if we eliminate
the alphabet “A” from “Vitamin A” then it would have no significance.
Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of
extracting the base form of words by chopping off the ends of words. For example, the words
laughing, laughs, laughed would be stemmed to the root word laugh.
Zipf's law:
Zipf's law states that the frequency of a token in a text is directly proportional to its rank or
position in the sorted list. This law describes how tokens are distributed in languages: some
tokens occur very frequently, some occur with intermediate frequency, and some tokens
rarely occur.

Information Retrieval (IR) Model


Mathematically, models are used in many scientific areas having objective to understand
some phenomenon in the real world. A model of information retrieval predicts and explains
what a user will find in relevance to the given query. IR model is basically a pattern that
defines the above-mentioned aspects of retrieval procedure and consists of the following −
 A model for documents.
 A model for queries.
 A matching function that compares queries to documents.
Mathematically, a retrieval model consists of −
D − Representation for documents.
R − Representation for queries.
F − The modelling framework for D, Q along with relationship between them.
R (q,di) − A similarity function which orders the documents with respect to the query. It is
also called ranking.

Types of Information Retrieval (IR) Model


An information model (IR) model can be classified into the following three models −
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.
Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are
the example of alternative IR model.
Classical IR Models:

The Boolean Model:


It is the oldest information retrieval (IR) model. The model is based on set theory and the
Boolean algebra, where documents are sets of terms and queries are Boolean expressions on
terms. The Boolean model can be defined as −
 D − A set of words, i.e., the indexing terms present in a document. Here, each term is
either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and operators are logical
products − AND, logical sum − OR and logical difference − NOT
 F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
 R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents.
For example, the query term “economic” defines the set of documents that are indexed with
the term “economic”.
Now, what would be the result after combining terms with Boolean AND Operator? It will
define a document set that is smaller than or equal to the document sets of any of the single
terms. For example, the query with terms “social” and “economic” will produce the
documents set of documents that are indexed with both the terms. In other words, document
set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator? It will
define a document set that is bigger than or equal to the document sets of any of the single
terms. For example, the query with terms “social” or “economic” will produce the
documents set of documents that are indexed with either the term “social” or “economic”. In
other words, document set with the union of both the sets.
Advantages of the Boolean Model:
The advantages of the Boolean model are as follows −
 The simplest model, which is based on sets.
 Easy to understand and implement.
 It only retrieves exact matches
 It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model:
The disadvantages of the Boolean model are as follows −
The model’s similarity function is Boolean. Hence, there would be no partial matches.
This can be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a critical
word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.
Example of Information Retrieval in Boolean Model

 For example, the term Peanut Butter individually (or Jelly individually) defines all the
documents with the term Peanut Butter (or Jelly) alone and indexes them.
 If the information needed is based on Peanut Butter AND Jelly, we will be giving a set of
documents that contain both the words and so the query with the keywords Peanut Butter
AND Jelly will be giving a set of documents that are having the both the words Peanut
Butter AND Jelly.
 Using OR the search will return documents containing either Peanut Butter or documents
containing Jelly or documents containing both Peanut Butter and Jelly.

Probabilistic Model:

Probabilistic models provide the foundation for reasoning under uncertainty in the realm
of information retrieval.

Let us understand why there is uncertainty while retrieving documents and the basis for
probability models in information retrieval.

Uncertainty in retrieval models: The probabilistic models in information retrieval are built
on the idea that the process of retrieval is inherently uncertain from multiple standpoints:

 There is uncertainty in the understanding of user’s information needs - We can not sure
that the user mapped their needs into the query they have presented.
 Even if the query represents the need well, there is uncertainty in the estimation of
document relevance for the query which stems from either the uncertainty from the
selection of the document representation or the uncertainty from matching the query and
documents.

Basis of probabilistic retrieval model: Probabilistic model is based on the Probability


Ranking Principle which states that an information retrieval system is supposed to rank the
documents based on their probability of relevance to the query given all the other pieces of
evidence available.

 Probabilistic information retrieval models estimate how likely it is that a document is


relevant for a query.
 There may be a variety of sources of evidence that are used by the probabilistic retrieval
methods and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
 Probabilistic information models are also among the oldest and best performing and most
widely used IR models.

The Vector Space Model: Also called term vector models, the vector space model is
an algebraic model for representing text documents (or also many kinds of multimedia
objects in general) as vectors of identifiers such as index terms.

The vector space model is based on the notion of similarity between the search
document and the representative query prepared by the user which should be similar to the
documents needed for information retrieval.

We can represent both documents and queries with vectors with a t-dimensional vector
representation:

 dj = (w1,j,w2,j,...,wt,j) for a document with


 q = (w1,q,w2,q,...,wt,q)
 Definition of Terms: The components of each dimension in the document and query
representation correspond to a separate term whose terms can be single words, keywords,
or longer phrases also.
o If the words of the documents are chosen to be the terms in the above
representation, the dimensionality of the vectors is the number of distinct
words occurring in the vocabulary of the corpus.
 Computing the values of the Terms: The value of the vector is non-zero if a term or word
occurs in the document. Many different schemes can be taken for example count of
occurrences, normalized counts, etc. but the most popular and efficient representation is tf-
idf term weighing.
 The vector space model represents the documents and queries as vectors in
a multidimensional space whose dimensions are the terms further used to build an index to
represent the documents.

Notion of Similarity in Vector Space Model

The assumptions of the document similarities theory are used to compute the relevancy
rankings of documents and the keywords in the search in vector space models.

 Angle of deviation between query and document: One way is to compare the deviation of
angles between each document vector and the original query vector where the query is
represented as some kind of vector as the documents.
 Cosine distance as similarity metric: The most popular and easier method in practice is
to calculate the cosine of the angle between the vectors - A cosine value of zero means that
the query and document vector are orthogonal and have no match at all.
o A zero cosine similarity value implies that the terms in the query do not exist in the
document we are considering.
 Ranking the results using a similarity metric: The degree of similarity between the
representation of the prepared document and the representations of the documents in
the collection is used to rank the search results.

One other way of looking at the similarity criterion in the vector space model is that - the
more the two representations of search documents and the user-prepared query agree in
the given elements and their distribution, the higher would the probability of their
representing similar information.

Index Creation for Terms in the Vector Space Model

The creation of the indices for the vector space model involves lexical scanning,
morphological analysis, and term value computation.

 Lexical scanning is the creation of individual word documents to identify the significant
terms and morphological analysis reduces to reduce different word forms to common
stems and then compute the values of terms on the basis of stemmed words.
 The terms of the query are also weighted to take into account their importance, and they
are computed by using the statistical distributions of the terms in the collection and in the
documents.
 The vector space model assigns a high-ranking score to a document that contains only a few
of the query terms if these terms occur infrequently in the collection of the original corpus
but frequently in the document.

Assumptions of the Vector Space Model

 The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
 The words used to define the dimensions of the space are orthogonal or independent.
 The similarity assumption is an approximation and realistic whereas the assumption that
words are pairwise independent doesn't hold true in realistic scenarios.

Disadvantages of Vector Space Model

 Long documents are poorly represented because they have poor similarity values due to a
small scalar product and a large dimensionality of the terms in the model.
 Search keywords must be precisely designed to match document terms and the word
substrings might result in a false positive match.
 Semantic sensitivity: Documents with similar context but different term vocabulary won't be
associated resulting in false negative matches.
 The order in which the terms appear in the document is lost in the vector space
representation.
 Weighting is intuitive but not represented formally in the model.
 Issues with implementation: Due to the need for the similarity metric calculation and in turn
storage of all the values of all vector components, it is problematic in case of incremental
updates of the index.
 Adding a single new document changes the document frequencies of terms that occur in the
document, which changes the vector lengths of every document that contains one or more
of these terms.

Term weighting:
Term weighting means the weights on the terms in vector space. Higher the weight of the
term, greater would be the impact of the term on cosine. More weights should be assigned to
the more important terms in the model.
Another method, which is more effective, is to use term frequency (tfij), document frequency
(dfi) and collection frequency (cfi).
Term Frequency (tfij)
It may be defined as the number of occurrences of wi in dj. The information that is captured
by term frequency is how salient a word is within the given document or in other words we
can say that the higher the term frequency the more that word is a good description of the
content of that document.
Document Frequency (dfi)
It may be defined as the total number of documents in the collection in which w i occurs. It is
an indicator of informativeness. Semantically focused words will occur several times in the
document unlike the semantically unfocused words.
Collection Frequency (cfi)
It may be defined as the total number of occurrences of wi in the collection.
Mathematically,
df i≤cfiand∑jtfij=cfi
Forms of Document Frequency Weighting
Let us now learn about the different forms of document frequency weighting. The forms are
described below −
Term Frequency Factor
This is also classified as the term frequency factor, which means that if a term t appears often
in a document then a query containing t should retrieve that document. We can combine
word’s term frequency (tfij) and document frequency (dfi) into a single weight as follows –
weight(i,j)={(1+log(tfij))log N\dfi iftfi,j≥1
0 iftfi,j=0
Here N is the total number of documents.
Inverse Document Frequency (idf)
This is another form of document frequency weighting and often called idf weighting or
inverse document frequency weighting. The important point of idf weighting is that the
term’s scarcity across the collection is a measure of its importance and importance is
inversely proportional to frequency of occurrence.
Mathematically,
idft=log(1+N\nt)
idft=log(N−nt\nt)
Here,
N = documents in the collection
nt = documents containing term t
Non-Classical IR Model:
They differ from classic models in that they are built upon propositional logic. Examples
of non-classical IR models include Information Logic, Situation Theory, and Interaction
models.
Alternative IR Model:
These take principles of classical IR model and enhance upon to create more functional
models like the Cluster model, Alternative Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic Models Generalized Vector Space
Model, etc.
Information System Evaluation:
The creation of the annual Text Retrieval Evaluation Conference (TREC) sponsored
by the Defense Advanced Research Projects Agency (DARPA) and the National
Institute of Standards and Technology (NIST) changed the standard process of
evaluating information systems. The conference provides a standard database
consisting of gigabytes of test data, search statements and the expected results from
the searches to academic researchers and commercial companies for testing of their
systems. This has placed a standard baseline into comparisons of algorithms.

In recent years the evaluation of Information Retrieval Systems and


techniques for indexing, sorting, searching and retrieving information have become
increasingly important.
There are many reasons to evaluate the effectiveness of an Information Retrieval
System:
 To aid in the selection of a system to procure
 To monitor and evaluate system effectiveness
 To evaluate query generation process for improvements
 To provide inputs to cost-benefit analysis of an information system
 To determine the effects of changes made to an existing information system.
Measures Used in System Evaluations

Measurements can be made from two perspectives: user perspective and system
perspective. Techniques for collecting measurements can also be objective or
subjective. An objective measure is one that is well-defined and based upon numeric
values derived from the system operation. A subjective measure can produce a
number, but is based upon an individual user’s judgments.

Measurements with automatic indexing of items arriving at a system are


derived from standard performance monitoring associated with any program in a
computer (e.g., resources used such as memory and processing cycles) and time to
process an item from arrival to availability to a search process. When manual
indexing is required, the measures are then associated with the indexing process.
Response time is a metric frequently collected to determine the efficiency of the
search execution. Response time is defined as the time it takes to execute the search.

In addition to efficiency of the search process, the quality of the search results is also
measured by precision and recall.

Another measure that is directly related to retrieving non-relevant items can be used
in defining how effective an information system is operating. This measure is called
Fallout and defined as:

There are other measures of search capabilities that have been proposed. A new
measure that provides additional insight in comparing systems or algorithms is the
“Unique Relevance Recall” (URR) metric. URR is used to compare more two or
more algorithms or systems. It measures the number of relevant items that are
retrieved by one algorithm that are not retrieved by the others:
Other measures have been proposed for judging the results of searches:
Novelty Ratio: ratio of relevant and not known to the user to total relevant retrieved
Coverage Ratio: ratio of relevant items retrieved to total relevant by the user before
the search Sought Recall: ratio of the total relevant reviewed by the user after the
search to the total relevant the user would have liked to examine
Measurement Example-TREC-Results
Until the creation of the Text Retrieval Conferences (TREC) by the Defense
Advance Research Projects Agency (DARPA) and the National Institute of
Standards and Technology (NIST), experimentation in the area of information
retrieval was constrained by the researcher’s ability to manually create a test
database. One of the first test databases was associated with the Cranfield I and II
tests (Cleverdon-62, Cleverdon-66). It contained 1400 documents and 225 queries.

It became one of the standard test sets and has been used by a large number
of researchers. Other test collections have been created by Fox and Sparck Jones.
There have been five TREC- conferences since 1992. TREC- provides a set of
training documents and a set of test documents, each over 1 Gigabyte in size. It also
provides a set of training search topics (along with relevance judgments from the
database) and a set of test topics.

The researchers send to the TREC-sponsor the list of the top 200 items in
ranked order that satisfy the search statements. These lists are used in determining
the items to be manually reviewed for relevance and for calculating the results from
each system. The search topics are “user need” statements rather than specific
queries. This allows maximum flexibility for each researcher to translate the search
statement to a query appropriate for their system and assists in the determination of
whether an item is relevant.
The search Topics in the initial TREC-consisted of a Number, Domain (e.g.,
Science and Technology), Title, Description of what constituted a relevant item, Narrative
natural language text for the search, and Concepts which were specific search terms.

In addition to the search measurements, other standard information on system


performance such as system timing, storage, and specific descriptions on the tests are
collected on each system. This data is useful because the TREC- objective is to support the
migration of techniques developed in a research environment into operational systems.
TREC-5 was held in November 1996. The results from each conference have varied based
upon understanding from previous conferences and new objectives.

TREC-1 (1992) was constrained by researchers trying to get their systems to work
with the very large test databases. TREC-2 in August 1993 was the first real test of the
algorithms which provided insights for the researchers into areas in which their systems
needed work. The search statements (user need statements) were very large and complex.
They reflect long-standing information needs versus adhoc requests. By TREC-3, the
participants were experimenting with techniques for query expansion and the importance of
constraining searches to passages within items versus the total item.

TREC-4 introduced significantly shorter queries (average reduction from 119 terms
in TREC-3 to 16 terms in TREC-4) and introduced five new areas of testing called “tracks”
(Harman-96). The queries were shortened by dropping the title and a narrative field, which
provided additional description of a relevant item. The multilingual track expanded TREC-
4 to test a search in a Spanish test set of 200 Mbytes of articles from the “El Norte”
newspaper.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy