NLP Unit-Ii (Part-I)
NLP Unit-Ii (Part-I)
Information Retrieval
Introduction:
Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the existence
and location of documents that might consist of the required information. The documents that
satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve
only relevant documents.
Classical Problem in Information Retrieval (IR) System
The main goal of IR research is to develop a model for retrieving information from the
repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc
retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required
information. Then the IR system will return the required documents related to the desired
information. For example, suppose we are searching something on the Internet and it gives
some exact pages that are relevant as per our requirement but there can be some non-relevant
pages too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval
Followings are some aspects of ad-hoc retrieval that are addressed in IR research −
How users with the help of relevance feedback can improve original formulation of a
query?
How to implement database merging, i.e., how results from different text databases
can be merged into one result set?
How to handle partly corrupted data? Which models are appropriate for the same?
For example, the term Peanut Butter individually (or Jelly individually) defines all the
documents with the term Peanut Butter (or Jelly) alone and indexes them.
If the information needed is based on Peanut Butter AND Jelly, we will be giving a set of
documents that contain both the words and so the query with the keywords Peanut Butter
AND Jelly will be giving a set of documents that are having the both the words Peanut
Butter AND Jelly.
Using OR the search will return documents containing either Peanut Butter or documents
containing Jelly or documents containing both Peanut Butter and Jelly.
Probabilistic Model:
Probabilistic models provide the foundation for reasoning under uncertainty in the realm
of information retrieval.
Let us understand why there is uncertainty while retrieving documents and the basis for
probability models in information retrieval.
Uncertainty in retrieval models: The probabilistic models in information retrieval are built
on the idea that the process of retrieval is inherently uncertain from multiple standpoints:
There is uncertainty in the understanding of user’s information needs - We can not sure
that the user mapped their needs into the query they have presented.
Even if the query represents the need well, there is uncertainty in the estimation of
document relevance for the query which stems from either the uncertainty from the
selection of the document representation or the uncertainty from matching the query and
documents.
The Vector Space Model: Also called term vector models, the vector space model is
an algebraic model for representing text documents (or also many kinds of multimedia
objects in general) as vectors of identifiers such as index terms.
The vector space model is based on the notion of similarity between the search
document and the representative query prepared by the user which should be similar to the
documents needed for information retrieval.
We can represent both documents and queries with vectors with a t-dimensional vector
representation:
The assumptions of the document similarities theory are used to compute the relevancy
rankings of documents and the keywords in the search in vector space models.
Angle of deviation between query and document: One way is to compare the deviation of
angles between each document vector and the original query vector where the query is
represented as some kind of vector as the documents.
Cosine distance as similarity metric: The most popular and easier method in practice is
to calculate the cosine of the angle between the vectors - A cosine value of zero means that
the query and document vector are orthogonal and have no match at all.
o A zero cosine similarity value implies that the terms in the query do not exist in the
document we are considering.
Ranking the results using a similarity metric: The degree of similarity between the
representation of the prepared document and the representations of the documents in
the collection is used to rank the search results.
One other way of looking at the similarity criterion in the vector space model is that - the
more the two representations of search documents and the user-prepared query agree in
the given elements and their distribution, the higher would the probability of their
representing similar information.
The creation of the indices for the vector space model involves lexical scanning,
morphological analysis, and term value computation.
Lexical scanning is the creation of individual word documents to identify the significant
terms and morphological analysis reduces to reduce different word forms to common
stems and then compute the values of terms on the basis of stemmed words.
The terms of the query are also weighted to take into account their importance, and they
are computed by using the statistical distributions of the terms in the collection and in the
documents.
The vector space model assigns a high-ranking score to a document that contains only a few
of the query terms if these terms occur infrequently in the collection of the original corpus
but frequently in the document.
The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
The words used to define the dimensions of the space are orthogonal or independent.
The similarity assumption is an approximation and realistic whereas the assumption that
words are pairwise independent doesn't hold true in realistic scenarios.
Long documents are poorly represented because they have poor similarity values due to a
small scalar product and a large dimensionality of the terms in the model.
Search keywords must be precisely designed to match document terms and the word
substrings might result in a false positive match.
Semantic sensitivity: Documents with similar context but different term vocabulary won't be
associated resulting in false negative matches.
The order in which the terms appear in the document is lost in the vector space
representation.
Weighting is intuitive but not represented formally in the model.
Issues with implementation: Due to the need for the similarity metric calculation and in turn
storage of all the values of all vector components, it is problematic in case of incremental
updates of the index.
Adding a single new document changes the document frequencies of terms that occur in the
document, which changes the vector lengths of every document that contains one or more
of these terms.
Term weighting:
Term weighting means the weights on the terms in vector space. Higher the weight of the
term, greater would be the impact of the term on cosine. More weights should be assigned to
the more important terms in the model.
Another method, which is more effective, is to use term frequency (tfij), document frequency
(dfi) and collection frequency (cfi).
Term Frequency (tfij)
It may be defined as the number of occurrences of wi in dj. The information that is captured
by term frequency is how salient a word is within the given document or in other words we
can say that the higher the term frequency the more that word is a good description of the
content of that document.
Document Frequency (dfi)
It may be defined as the total number of documents in the collection in which w i occurs. It is
an indicator of informativeness. Semantically focused words will occur several times in the
document unlike the semantically unfocused words.
Collection Frequency (cfi)
It may be defined as the total number of occurrences of wi in the collection.
Mathematically,
df i≤cfiand∑jtfij=cfi
Forms of Document Frequency Weighting
Let us now learn about the different forms of document frequency weighting. The forms are
described below −
Term Frequency Factor
This is also classified as the term frequency factor, which means that if a term t appears often
in a document then a query containing t should retrieve that document. We can combine
word’s term frequency (tfij) and document frequency (dfi) into a single weight as follows –
weight(i,j)={(1+log(tfij))log N\dfi iftfi,j≥1
0 iftfi,j=0
Here N is the total number of documents.
Inverse Document Frequency (idf)
This is another form of document frequency weighting and often called idf weighting or
inverse document frequency weighting. The important point of idf weighting is that the
term’s scarcity across the collection is a measure of its importance and importance is
inversely proportional to frequency of occurrence.
Mathematically,
idft=log(1+N\nt)
idft=log(N−nt\nt)
Here,
N = documents in the collection
nt = documents containing term t
Non-Classical IR Model:
They differ from classic models in that they are built upon propositional logic. Examples
of non-classical IR models include Information Logic, Situation Theory, and Interaction
models.
Alternative IR Model:
These take principles of classical IR model and enhance upon to create more functional
models like the Cluster model, Alternative Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic Models Generalized Vector Space
Model, etc.
Information System Evaluation:
The creation of the annual Text Retrieval Evaluation Conference (TREC) sponsored
by the Defense Advanced Research Projects Agency (DARPA) and the National
Institute of Standards and Technology (NIST) changed the standard process of
evaluating information systems. The conference provides a standard database
consisting of gigabytes of test data, search statements and the expected results from
the searches to academic researchers and commercial companies for testing of their
systems. This has placed a standard baseline into comparisons of algorithms.
Measurements can be made from two perspectives: user perspective and system
perspective. Techniques for collecting measurements can also be objective or
subjective. An objective measure is one that is well-defined and based upon numeric
values derived from the system operation. A subjective measure can produce a
number, but is based upon an individual user’s judgments.
In addition to efficiency of the search process, the quality of the search results is also
measured by precision and recall.
Another measure that is directly related to retrieving non-relevant items can be used
in defining how effective an information system is operating. This measure is called
Fallout and defined as:
There are other measures of search capabilities that have been proposed. A new
measure that provides additional insight in comparing systems or algorithms is the
“Unique Relevance Recall” (URR) metric. URR is used to compare more two or
more algorithms or systems. It measures the number of relevant items that are
retrieved by one algorithm that are not retrieved by the others:
Other measures have been proposed for judging the results of searches:
Novelty Ratio: ratio of relevant and not known to the user to total relevant retrieved
Coverage Ratio: ratio of relevant items retrieved to total relevant by the user before
the search Sought Recall: ratio of the total relevant reviewed by the user after the
search to the total relevant the user would have liked to examine
Measurement Example-TREC-Results
Until the creation of the Text Retrieval Conferences (TREC) by the Defense
Advance Research Projects Agency (DARPA) and the National Institute of
Standards and Technology (NIST), experimentation in the area of information
retrieval was constrained by the researcher’s ability to manually create a test
database. One of the first test databases was associated with the Cranfield I and II
tests (Cleverdon-62, Cleverdon-66). It contained 1400 documents and 225 queries.
It became one of the standard test sets and has been used by a large number
of researchers. Other test collections have been created by Fox and Sparck Jones.
There have been five TREC- conferences since 1992. TREC- provides a set of
training documents and a set of test documents, each over 1 Gigabyte in size. It also
provides a set of training search topics (along with relevance judgments from the
database) and a set of test topics.
The researchers send to the TREC-sponsor the list of the top 200 items in
ranked order that satisfy the search statements. These lists are used in determining
the items to be manually reviewed for relevance and for calculating the results from
each system. The search topics are “user need” statements rather than specific
queries. This allows maximum flexibility for each researcher to translate the search
statement to a query appropriate for their system and assists in the determination of
whether an item is relevant.
The search Topics in the initial TREC-consisted of a Number, Domain (e.g.,
Science and Technology), Title, Description of what constituted a relevant item, Narrative
natural language text for the search, and Concepts which were specific search terms.
TREC-1 (1992) was constrained by researchers trying to get their systems to work
with the very large test databases. TREC-2 in August 1993 was the first real test of the
algorithms which provided insights for the researchers into areas in which their systems
needed work. The search statements (user need statements) were very large and complex.
They reflect long-standing information needs versus adhoc requests. By TREC-3, the
participants were experimenting with techniques for query expansion and the importance of
constraining searches to passages within items versus the total item.
TREC-4 introduced significantly shorter queries (average reduction from 119 terms
in TREC-3 to 16 terms in TREC-4) and introduced five new areas of testing called “tracks”
(Harman-96). The queries were shortened by dropping the title and a narrative field, which
provided additional description of a relevant item. The multilingual track expanded TREC-
4 to test a search in a Spanish test set of 200 Mbytes of articles from the “El Norte”
newspaper.