0% found this document useful (0 votes)
5 views53 pages

mod4 nlp

The document discusses Information Retrieval (IR), focusing on the organization, storage, and retrieval of information based on user queries. It outlines various models of IR, including Boolean, probabilistic, and vector models, detailing how documents and queries are represented and ranked. Additionally, it covers techniques like stop word elimination and stemming, as well as the significance of term weighting in enhancing retrieval effectiveness.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views53 pages

mod4 nlp

The document discusses Information Retrieval (IR), focusing on the organization, storage, and retrieval of information based on user queries. It outlines various models of IR, including Boolean, probabilistic, and vector models, detailing how documents and queries are represented and ranked. Additionally, it covers techniques like stop word elimination and stemming, as well as the significance of term weighting in enhancing retrieval effectiveness.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

MODULE-4

BY MEGHA RANI R
• Information retrieval (IR) deals with the organization,
storage, retrieval, and evaluation of information relevant to a
user’s query.
• A user in need of information formulates a request in the
form of a query written in a natural language.

• The retrieval system responds by retrieving the


document that seems relevant to the query
• It begins with the user’s information need. Based on this need,
he/she formulates a query.
• The IR system returns documents that seem relevant to the query. This
is an engineering account of the IR system.
• The basic question involved is, ‘what constitutes the information in the
documents and the queries’.
• This, in turn is related to the problem of representation of documents
and queries.
• The retrieval is performed by matching the query representation with
document representation.
• The actual text of the document is not used in the retrieval process. Instead,
documents in a collection are frequently represented through a set of index
terms or keywords.

• Keywords can be single word or multi-word phrases. They might be extracted


automatically or manually (i.e., specified by a human). Such a representation
provides a logical view of the document. The process of transforming document
text to some representation of it is known as indexing.
• There are different types of index structures. One used
data structure, commonly by the IR system, is the
inverted index.
• An inverted index is simply a list of keywords, with each
keyword carrying pointers to the documents containing
that keyword.
• The computational cost involved in adopting a full text
logical view (i.e., using a full set of words to represent a
document) is high.
• Hence, some text operations are usually performed to
reduce the set of representative keywords.
The two most commonly used text operations are
1.stop word elimination
2.stemming.
• Stop word elimination removes grammatical or functional words, while
stemming reduces words to their common grammatical roots.
• Zipf’s law can be applied to further reduce the size of index set.
• Not all the terms in a document are equally relevant.
• Some might be more important in conveying a document’s content.
• Attempts have been made to quantify the significance of index terms to a
document by assigning them numerical values, called weights.
• In a small collection of documents, an IR system can access a document to
decide its relevance to a query.
• However, in a large collection of documents, this technique poses practical
problems.

• Hence, a collection of raw documents is usually transformed into an easily


accessible representation. This process is known as indexing.

• Most indexing techniques involve identifying good document descriptors, such


as keywords or terms, which describe the information content of documents.

• A good descriptor is one that helps describe the content of the document and
discriminate it from other documents in the collection.




• The lexical processing of index terms involves elimination of stop words. Stop words are high frequency words which
have little semantic weight and are thus, unlikely to help in retrieval.
• These words play important grammatical roles in language, such as in the formation of phrases, but do not
contribute to the semantic content of a document in a keyword-based representation. Such words are commonly
used in documents, regardless of topics, and thus, have no topical specificity.
• Typical example of stop words are articles and prepositions.
• Eliminating them considerably reduces the number of index terms. The drawback of eliminating stop words is that it
can sometimes result in the elimination of useful index terms, for instance the stop word A in Vitamin A.
• Some phrases, like to be or not to be, consist entirely of stop words.
• Eliminating stop words in such case, make it impossible to correctly search a document.
• Stemming normalizes morphological variants, though in a crude manner, by removing affixes from the words to
reduce them to their stem, e.g., the words compute, computing, computes, and computer, are all be reduced to the
word compute.
• Thus, the keywords or terms used to represent same word stem, comput.
• Thus, the keywords or terms used to represent text are stems, not the actual words.
• One of the most widely used stemming algorithms has been developed by Porter (1980). The stemmed
representation of the text, Design features of information retrieval systems, is
(design, featur, inform, retriev, system)
• One of the problems associated with stemming is that it may throw away useful distinctions. In some cases, it may
be useful to help conflate similar terms, resulting in increased recall.
• In others, it may be harmful, resulting in reduced precision (e.g., when documents containing the term computation
are returned in response to the query phrase personal computer). Recall and precision are the two most commonly
used measures of the effectiveness of an information retrieval system
• Zipf made an important observation on the distribution of words in natural
languages.
• This observation has been named Zipf’s law. Simply stated, Zipf’s law says that
the frequency of words multiplied by their ranks in a large corpus is more or less
constant.
• More formally, Frequency × rank ≈ constant.
• This means that if we compute the frequencies of the words in a corpus, and
arrange them in decreasing order of frequency, then the product of the
frequency of a word and its rank is approximately equal to the product of the
frequency and rank of another word.
• This indicates that the frequency of a word is inversely proportional to its rank.
• Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain a small
number of words that occur with high frequency and a large number of words that occur with low
frequency.
• In between, is a middling number of medium frequency terms. This distribution has important
significance in IR.
• The high frequency words, being common, have less discriminating power, and thus, are not useful
for indexing. Low frequency words are less likely to be included in the query, and are also not useful
for indexing.
• As there are a large number of rare (low frequency) words, dropping them considerably reduces the
size of a list of index terms.
• The remaining medium frequency words are content-bearing terms and can be used for indexing.
• This can be implemented by defining thresholds for high and low frequency, and dropping words
that have frequencies above or below these thresholds. Stop word elimination can be thought of as
an implementation of Zipf’s law, where high frequency terms are dropped from a set of index terms
• An IR model is a pattern that defines several aspects of the retrieval procedure, for
example,
 how documents and user's queries are represented,
 how a system retrieves relevant documents according to users' queries, and
 how retrieved documents are ranked.
• The IR system consists of a model for documents, a model for queries, and a matching
function which compares queries to documents.
• The central objective of the model is to retrieve all documents relevant to a query. This
defines the central task of an IR system.
• Several different IR models have been developed.
• These models differ in the way documents and queries are represented and retrieval is performed.
• Some of them consider documents as sets of terms and perform retrieval based merely on the presence or
absence of one or more query terms in the document.
• Others represent a document as a vector of term weights and perform retrieval based on the numeric score
assigned to each document, representing similarity between the query and the document.
• These models can be classified as follows:
• Classical models of IR
• Non-classical models of IR
• Alternative models of IR
• The three classical IR models — Boolean, vector, and probabilistic — are based on mathematical knowledge that is
easily recognized and well understood. These models are simple, efficient, and easy to implement.
Boolean model

• The Boolean model is the oldest of the three classical models.


• It is based on Boolean logic and classical set theory. In this model, documents are
represented as a set of keywords, usually stored in an inverted file.
• An inverted file is a list of keywords and identifiers of the documents in which they occur.
• Users are required to express their queries as a Boolean expression consisting of
keywords connected with Boolean logical operators (AND, OR, NOT).
• Retrieval is performed based on whether or not a document contains the query terms..
Example 9.1 Let the set of original documents be𝐷={𝐷1,𝐷2,𝐷3}
• This results in the retrieval of the original document 𝐷1 that has the representation 𝑑1 .
• If more than one document have the same representation, every such document is
retrieved.
• Boolean information retrieval does not differentiate between these documents.
• With an inverted index, this simply means taking an intersection of the list of the
documents associated with the keywords information and retrieval.
• Boolean retrieval models have been used in IR systems for a long time. They are simple,
efficient, and easy to implement and perform well in terms of recall and precision if the
query is well formulated. However, the model suffers from certain drawbacks.
• No ranking of results (no concept of relevance or partial match)
• Results are either relevant or not — no middle ground
• Rigid query structure — requires exact term matches
• Can return too many or too few results
PROBABILISTIC MODEL

• The probabilistic model applies a probabilistic framework to IR.


• It ranks documents based on the probability of their relevance to a given query .
• Retrieval depends on whether probability of relevance (relative to a query) of a document is higher
than that of non-relevance, i.e. whether it exceeds a threshold value.
• Given a set of documents D, a query q, and a cut-off value α\alpha, this model first calculates the
probability of relevance and irrelevance of a document to the query.
• It then ranks documents having probabilities of relevance at least that of irrelevance in decreasing
order of their relevance.
• Documents are retrieved if the probability of relevance in the ranked list exceeds the cut off value.
More formally, if P(R/d) is the probability of relevance of a document dj for query q, and P(I/d) is the
probability of irrelevance, then the set of documents retrieved in response to the query q is as follows:
VECTOR MODEL

• The vector space model is one of the most well-studied retrieval models.
• The vector space model represents documents and queries as vectors of features representing terms
that occur within them.
• Each document is characterized by a Boolean or numerical vector.
• These vectors are represented in a multi-dimensional space, in which each dimension corresponds to
a distinct term in the corpus of documents.
• In its simplest form, each feature takes a value of either zero or one, indicating the absence or
presence of that term in a document or query.
• More generally, features are assigned numerical values that are usually a function of the frequency
of terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield a retrieval
score to each document.
• This score is used to produce a ranked list of retrieved documents.
• To reduce the importance of the length of document vectors, we normalize document
vectors.
• Normalization changes all vectors to a standard length. We convert document vectors
to unit length by dividing each dimension by the overall length of the vector.
• Normalizing the term-document matrix shown in this example, we get the following
matrix:
TERM WEIGHTING

• Each term used as an indexing feature in a document helps discriminate that document from others.
• Term weighting is a technique used in information retrieval and text mining to assign importance to
terms (usually words) in documents. The goal is to reflect how relevant a term is within a specific
document and across a collection of documents.
• 1. Term Frequency (TF) – Local Importance
“The more a document contains a given word, the more it is about that word.”
This means if a term appears frequently in a document, it is probably important to that document.
• Represented as tfij(term i in document j)
2. Inverse Document Frequency (IDF) – Global Importance
“The less a term occurs in the document collection, the more discriminating it is.”
Terms that appear in fewer documents are more useful for distinguishing those documents.
Terms that are very common across documents (like "the", "and", "of") are not helpful in finding relevant
or unique documents.
TERM WEIGHTING
• A third factor that may affect weighting function is the document length.
• A term appearing the same number of times in a short document and in a long document,
will be more valuable to the former.
• Most weighting schemes can thus be characterized by the following three factors:
1. Within-document frequency or term frequency (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length
Any term weighting scheme can be represented by a triple ABC. The letter A in this triple
represents the way the tf component is handled, B indicates the way the idf component is
incorporated, and C represents the length normalization component.
• Different combinations of options can be used to represent document and query vectors.
The retrieval model themselves can be represented by a pair of triples like nnn.nnn (doc =
‘nnn’, query = ‘nnn’), where the first triple corresponds to the weighting strategy used for
the documents and the second triple to the weighting strategy used for the query term.
• Retrieval systems represent documents and queries as vectors, and the choice of ABC
affects how these vectors are constructed.
• For instance: Document might be weighted using ltc ,Query might be weighted using lnc

Examples:
•nnn → raw TF, no IDF, no normalization
•ltc → log TF, IDF applied, cosine normalization
• Non-classical IR models are based on principles other than similarity, probability, Boolean
operations, etc., on which classical retrieval models are based.
• Examples include information logic model, situation theory model, and interaction model.
• The information logic model is based on a special logic technique called logical imaging.
Retrieval is performed by making inferences from document to query.
• This is unlike classical models, where a search process is used. Unlike usual implication,
which is true in all cases except that when antecedent is true and consequent is false, this
inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference. The principle put forward
by van Rijsbergen is used to measure this uncertainty.
• This principle says: Given any two sentences x and y, a measure of the uncertainty of y → x
relative to a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish the truth of y → x.
• The situation theory model is also based on van Rijsbergen’s principle.
• Retrieval is considered as a flow of information from document to query.
• A structure called infon, denoted by ι, is used to describe the situation and to model
information flow. An infon represents an n-ary relation and its polarity.
• The polarity of an infon can be either 1 or 0, indicating that the infon carries either positive
or negative information.
For example, the information in the sentence, Adil is serving a dish, is conveyed by the infon:
• A document d is considered relevant to a query q if it supports or entails it, written as:
• 𝑑⊨𝑞
• But if d does not support q, it does not necessarily mean the document is irrelevant!
Because:It may use different words (e.g., synonyms, hyponyms).
• For example, "car" vs "automobile", "serve" vs "offer".
• This transformation (d → d′) is considered a flow of information between situations.

• The interaction IR model was first introduced in Dominich (1992, 1993) and Rijsbergen
(1996). In this model, the documents are not isolated; instead, they are interconnected.
• he query interacts with the interconnected documents. Retrieval is conceived as a result of
this interaction. This view of interaction is taken from the concept of interaction as realized
in the Copenhagen interpretation of quantum mechanics.
• Artificial neural networks can be used to implement this model.
• Each document is modelled as a neuron, the document set as a whole forms a neural
network.
• The query is also modelled as a neuron and integrated into the network.
• this enables:
Formation of new connections
Modification of existing connections
Interactive restructuring of relationships during retrieval
•Retrieval is based on the measure of interaction between the query and documents.
•The interaction score is used to rank or retrieve relevant documents.
SIMILARITY MEASURE

• Vector space model represents documents and queries as vectors in a


multi-dimensional space.
• Retrieval is performed by measuring the ‘closeness’ of the query vector
to document vector.
• Documents can then be ranked according to the numeric similarity
between the query and the document.
• In the vector space model, the documents selected are those that are
geometrically closest to the query according to some measure."
CLUSTER MODEL
• The cluster model is an attempt to reduce the number of matches during
retrieval.
• Closely associated documents tend to be relevant to the same clusters.
• This hypothesis suggests that closely associated documents are likely to be
retrieved together.
• This means that by forming groups (classes or clusters) of related documents,
the search time reduced considerably.
• Instead of matching the query with every document in the collection, it is
matched with representatives of the class, and only documents from a class
whose representative is close to query, are considered for individual match.
FUZZY MODEL

• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs [ti,μ(ti)] where
μ\muμ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.
• The membership degree expresses the significance of term to the information contained in the document.
• Usually, the significance values (weights) are assigned based on the number of occurrences of the term in
the document and in the entire document collection, as discussed earlier.
Each document in the collection
D={d1​,d2​,...,dj​,...,dn​}
can thus be represented as a vector of term weights, as in the following vector space model
(w1j,w2j,w3j,...,wij,...,wmj)t
where wij is the degree to which term ti belongs to document dj.
• Each term in the document is considered a representative of a subject area and wij is the
membership function of document dj to the subject area represented by term ti.
• Each term ti is itself represented by a fuzzy set fi in the domain of documents given by
fi={(dj,wij)}∣i=1,...,m; j=1,...,n.
• This weighted representation makes it possible to rank the retrieved documents in
decreasing order of their relevance to the user’s query.
• Queries are Boolean queries. For each term that appears in the query, a set of documents
is retrieved. Fuzzy set operators are then applied to obtain the desired result.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy