Information Retrieval Models
Information Retrieval Models
RETRIEVAL MODELS
UNIT 3
The three main statistical models
The following major models have been developed to retrieve information:
– the Boolean model,
– the Statistical model, which includes the vector space and the probabilistic retrieval
model, and
– the Linguistic and Knowledge-based models.
1. Boolean Model
■ Disadvantages :
– Exact matching may retrieve too few or too many documents
– Difficult to rank output
– Difficult to control the number of documents retrieved that all matched documents will
be returned
2. Statistical Model
■ The vector space and probabilistic models are the two major examples of the statistical retrieval
approach.
■ Both models use statistical information in the form of term frequencies to determine the
relevance of documents with respect to a query.
■ Although they differ in the way they use the term frequencies, both produce as their output a
list of documents ranked by their estimated relevance.
■ The statistical retrieval models address some of the problems of Boolean retrieval methods, but
they have disadvantages of their own.
■ The provides summary of the key features of the vector space and probabilistic approaches.
2.1 Vector Space Model
■ The vector space model represents the documents and queries as vectors in a multidimensional
space, whose dimensions are the terms used to build an index to represent the documents.
■ The creation of an index involves lexical/verbal scanning to identify the significant terms,
where morphological/structural analysis reduces different word forms to common "stems", and
the occurrence of those stems is computed.
■ The vector space model can assign a high ranking score to a document that contains only a few
of the query terms if these terms occur infrequently in the collection but frequently in the
document. The vector space model makes the following assumptions:
■ 1) The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
■ 2) The words used to define the dimensions of the space are orthogonal or independent. While
it is a reasonable first approximation, the assumption that words are pairwise independent is not
realistic.
Advantages of a vector model
■ Its term weighting scheme can improve retrieval performance
■ Allows partial matching
■ Retrieved documents are scored according to their degree of similarity
■ DISADVANTAGE
■ Terms are assumed to be manually independent. In some cases it might hurt performance.
2.2 Probabilistic Model/Inference Network
■ The probabilistic retrieval model is based on the Probability Ranking Principle, which states
that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft 1992].
■ The principle takes into account that there is uncertainty in the representation of the information
need and the documents.
■ There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
■ The statistical approaches have the following strengths:
■ 1) They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the
output by setting a relevance threshold or by specifying a certain number of documents to display.
■ 2) Queries can be easier to formulate because users do not have to learn a query language and can use natural language.
■ 3) The uncertainty inherent in the choice of query concepts can be represented.