mod4 nlp
mod4 nlp
BY MEGHA RANI R
• Information retrieval (IR) deals with the organization,
storage, retrieval, and evaluation of information relevant to a
user’s query.
• A user in need of information formulates a request in the
form of a query written in a natural language.
• A good descriptor is one that helps describe the content of the document and
discriminate it from other documents in the collection.
•
•
•
•
•
•
• The lexical processing of index terms involves elimination of stop words. Stop words are high frequency words which
have little semantic weight and are thus, unlikely to help in retrieval.
• These words play important grammatical roles in language, such as in the formation of phrases, but do not
contribute to the semantic content of a document in a keyword-based representation. Such words are commonly
used in documents, regardless of topics, and thus, have no topical specificity.
• Typical example of stop words are articles and prepositions.
• Eliminating them considerably reduces the number of index terms. The drawback of eliminating stop words is that it
can sometimes result in the elimination of useful index terms, for instance the stop word A in Vitamin A.
• Some phrases, like to be or not to be, consist entirely of stop words.
• Eliminating stop words in such case, make it impossible to correctly search a document.
• Stemming normalizes morphological variants, though in a crude manner, by removing affixes from the words to
reduce them to their stem, e.g., the words compute, computing, computes, and computer, are all be reduced to the
word compute.
• Thus, the keywords or terms used to represent same word stem, comput.
• Thus, the keywords or terms used to represent text are stems, not the actual words.
• One of the most widely used stemming algorithms has been developed by Porter (1980). The stemmed
representation of the text, Design features of information retrieval systems, is
(design, featur, inform, retriev, system)
• One of the problems associated with stemming is that it may throw away useful distinctions. In some cases, it may
be useful to help conflate similar terms, resulting in increased recall.
• In others, it may be harmful, resulting in reduced precision (e.g., when documents containing the term computation
are returned in response to the query phrase personal computer). Recall and precision are the two most commonly
used measures of the effectiveness of an information retrieval system
• Zipf made an important observation on the distribution of words in natural
languages.
• This observation has been named Zipf’s law. Simply stated, Zipf’s law says that
the frequency of words multiplied by their ranks in a large corpus is more or less
constant.
• More formally, Frequency × rank ≈ constant.
• This means that if we compute the frequencies of the words in a corpus, and
arrange them in decreasing order of frequency, then the product of the
frequency of a word and its rank is approximately equal to the product of the
frequency and rank of another word.
• This indicates that the frequency of a word is inversely proportional to its rank.
• Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain a small
number of words that occur with high frequency and a large number of words that occur with low
frequency.
• In between, is a middling number of medium frequency terms. This distribution has important
significance in IR.
• The high frequency words, being common, have less discriminating power, and thus, are not useful
for indexing. Low frequency words are less likely to be included in the query, and are also not useful
for indexing.
• As there are a large number of rare (low frequency) words, dropping them considerably reduces the
size of a list of index terms.
• The remaining medium frequency words are content-bearing terms and can be used for indexing.
• This can be implemented by defining thresholds for high and low frequency, and dropping words
that have frequencies above or below these thresholds. Stop word elimination can be thought of as
an implementation of Zipf’s law, where high frequency terms are dropped from a set of index terms
• An IR model is a pattern that defines several aspects of the retrieval procedure, for
example,
how documents and user's queries are represented,
how a system retrieves relevant documents according to users' queries, and
how retrieved documents are ranked.
• The IR system consists of a model for documents, a model for queries, and a matching
function which compares queries to documents.
• The central objective of the model is to retrieve all documents relevant to a query. This
defines the central task of an IR system.
• Several different IR models have been developed.
• These models differ in the way documents and queries are represented and retrieval is performed.
• Some of them consider documents as sets of terms and perform retrieval based merely on the presence or
absence of one or more query terms in the document.
• Others represent a document as a vector of term weights and perform retrieval based on the numeric score
assigned to each document, representing similarity between the query and the document.
• These models can be classified as follows:
• Classical models of IR
• Non-classical models of IR
• Alternative models of IR
• The three classical IR models — Boolean, vector, and probabilistic — are based on mathematical knowledge that is
easily recognized and well understood. These models are simple, efficient, and easy to implement.
Boolean model
• The vector space model is one of the most well-studied retrieval models.
• The vector space model represents documents and queries as vectors of features representing terms
that occur within them.
• Each document is characterized by a Boolean or numerical vector.
• These vectors are represented in a multi-dimensional space, in which each dimension corresponds to
a distinct term in the corpus of documents.
• In its simplest form, each feature takes a value of either zero or one, indicating the absence or
presence of that term in a document or query.
• More generally, features are assigned numerical values that are usually a function of the frequency
of terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield a retrieval
score to each document.
• This score is used to produce a ranked list of retrieved documents.
• To reduce the importance of the length of document vectors, we normalize document
vectors.
• Normalization changes all vectors to a standard length. We convert document vectors
to unit length by dividing each dimension by the overall length of the vector.
• Normalizing the term-document matrix shown in this example, we get the following
matrix:
TERM WEIGHTING
• Each term used as an indexing feature in a document helps discriminate that document from others.
• Term weighting is a technique used in information retrieval and text mining to assign importance to
terms (usually words) in documents. The goal is to reflect how relevant a term is within a specific
document and across a collection of documents.
• 1. Term Frequency (TF) – Local Importance
“The more a document contains a given word, the more it is about that word.”
This means if a term appears frequently in a document, it is probably important to that document.
• Represented as tfij(term i in document j)
2. Inverse Document Frequency (IDF) – Global Importance
“The less a term occurs in the document collection, the more discriminating it is.”
Terms that appear in fewer documents are more useful for distinguishing those documents.
Terms that are very common across documents (like "the", "and", "of") are not helpful in finding relevant
or unique documents.
TERM WEIGHTING
• A third factor that may affect weighting function is the document length.
• A term appearing the same number of times in a short document and in a long document,
will be more valuable to the former.
• Most weighting schemes can thus be characterized by the following three factors:
1. Within-document frequency or term frequency (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length
Any term weighting scheme can be represented by a triple ABC. The letter A in this triple
represents the way the tf component is handled, B indicates the way the idf component is
incorporated, and C represents the length normalization component.
• Different combinations of options can be used to represent document and query vectors.
The retrieval model themselves can be represented by a pair of triples like nnn.nnn (doc =
‘nnn’, query = ‘nnn’), where the first triple corresponds to the weighting strategy used for
the documents and the second triple to the weighting strategy used for the query term.
• Retrieval systems represent documents and queries as vectors, and the choice of ABC
affects how these vectors are constructed.
• For instance: Document might be weighted using ltc ,Query might be weighted using lnc
Examples:
•nnn → raw TF, no IDF, no normalization
•ltc → log TF, IDF applied, cosine normalization
• Non-classical IR models are based on principles other than similarity, probability, Boolean
operations, etc., on which classical retrieval models are based.
• Examples include information logic model, situation theory model, and interaction model.
• The information logic model is based on a special logic technique called logical imaging.
Retrieval is performed by making inferences from document to query.
• This is unlike classical models, where a search process is used. Unlike usual implication,
which is true in all cases except that when antecedent is true and consequent is false, this
inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference. The principle put forward
by van Rijsbergen is used to measure this uncertainty.
• This principle says: Given any two sentences x and y, a measure of the uncertainty of y → x
relative to a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish the truth of y → x.
• The situation theory model is also based on van Rijsbergen’s principle.
• Retrieval is considered as a flow of information from document to query.
• A structure called infon, denoted by ι, is used to describe the situation and to model
information flow. An infon represents an n-ary relation and its polarity.
• The polarity of an infon can be either 1 or 0, indicating that the infon carries either positive
or negative information.
For example, the information in the sentence, Adil is serving a dish, is conveyed by the infon:
• A document d is considered relevant to a query q if it supports or entails it, written as:
• 𝑑⊨𝑞
• But if d does not support q, it does not necessarily mean the document is irrelevant!
Because:It may use different words (e.g., synonyms, hyponyms).
• For example, "car" vs "automobile", "serve" vs "offer".
• This transformation (d → d′) is considered a flow of information between situations.
• The interaction IR model was first introduced in Dominich (1992, 1993) and Rijsbergen
(1996). In this model, the documents are not isolated; instead, they are interconnected.
• he query interacts with the interconnected documents. Retrieval is conceived as a result of
this interaction. This view of interaction is taken from the concept of interaction as realized
in the Copenhagen interpretation of quantum mechanics.
• Artificial neural networks can be used to implement this model.
• Each document is modelled as a neuron, the document set as a whole forms a neural
network.
• The query is also modelled as a neuron and integrated into the network.
• this enables:
Formation of new connections
Modification of existing connections
Interactive restructuring of relationships during retrieval
•Retrieval is based on the measure of interaction between the query and documents.
•The interaction score is used to rank or retrieve relevant documents.
SIMILARITY MEASURE
• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs [ti,μ(ti)] where
μ\muμ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.
• The membership degree expresses the significance of term to the information contained in the document.
• Usually, the significance values (weights) are assigned based on the number of occurrences of the term in
the document and in the entire document collection, as discussed earlier.
Each document in the collection
D={d1,d2,...,dj,...,dn}
can thus be represented as a vector of term weights, as in the following vector space model
(w1j,w2j,w3j,...,wij,...,wmj)t
where wij is the degree to which term ti belongs to document dj.
• Each term in the document is considered a representative of a subject area and wij is the
membership function of document dj to the subject area represented by term ti.
• Each term ti is itself represented by a fuzzy set fi in the domain of documents given by
fi={(dj,wij)}∣i=1,...,m; j=1,...,n.
• This weighted representation makes it possible to rank the retrieved documents in
decreasing order of their relevance to the user’s query.
• Queries are Boolean queries. For each term that appears in the query, a set of documents
is retrieved. Fuzzy set operators are then applied to obtain the desired result.