0% found this document useful (0 votes)
50 views15 pages

Information Retrieval Models

Uploaded by

mihlemaza03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views15 pages

Information Retrieval Models

Uploaded by

mihlemaza03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

INFORMATION

RETRIEVAL MODELS
UNIT 3
The three main statistical models
The following major models have been developed to retrieve information:
– the Boolean model,
– the Statistical model, which includes the vector space and the probabilistic retrieval
model, and
– the Linguistic and Knowledge-based models.
1. Boolean Model

■ Documents represented as a set of terms


■ Form queries using standard Boolean logic set-theoretic operators
– AND, OR and NOT
■ Retrieval and relevance
– Binary concepts
– 1 (True) 0 (False)
■ Lacks sophisticated ranking algorithms
■ Standard Boolean
■ It has the following strengths:
■ 1) It is easy to implement and it is computationally efficient. Hence, it is the standard model for
the current large-scale, operational retrieval systems and many of the major on-line information
services use it.
■ 2) It enables users to express structural and conceptual constraints to describe important
linguistic features. Users find that synonym specifications (reflected by OR-clauses) and
phrases (represented by proximity relations) are useful in the formulation of queries. The
Boolean approach possesses a great expressive power and clarity.
■ 3) Boolean retrieval is very effective if a query requires an exhaustive and unambiguous
selection.
■ 4) The Boolean method offers a multitude of techniques to broaden or narrow a query.
■ 5) The Boolean approach can be especially effective in the later stages of the search process,
because of the clarity and exactness with which relationships between concepts can be
represented.
■ Narrowing and Broadening Techniques
■ As mentioned earlier, a Boolean query can be described in terms of the following four
operations: degree and type of coordination, proximity constraints, field specifications and
degree of stemming as expressed in terms of word/string specifications.
■ If users want to (re)formulate a Boolean query then they need to make informed choices along
these four dimensions to create a query that is sufficiently broad or narrow depending on their
information needs.
■ Most narrowing techniques lower recall as well as raise precision, and most broadening
techniques lower precision as well as raise recall. Any query can be reformulated to achieve the
desired precision or recall characteristics, but generally it is difficult to achieve both.
■ Each of the four kinds of operations in the query formulation has particular operators, some of
which tend to have a narrowing or broadening effect. For each operator with a narrowing effect,
there is one or more inverse operators with a broadening effect. Hence, users require help to
gain an understanding of how changes along these four dimensions will affect the broadness or
narrowness of a query.
■ Smart Boolean
■ Smart Boolean tries to help users construct and modify a Boolean query as well as make better
choices along the four dimensions that characterize a Boolean query.
■ This method is as a good example that illustrates some of the possible ways to make Boolean
retrieval more user-friendly and effective.
■ Users start by specifying a natural language statement that is automatically translated into a
Boolean Topic representation that consists of a list of factors or concepts, which are automatically
coordinated using the AND operator.
■ If the user at the initial stage can or wants to include synonyms, then they are coordinated using
the OR operator. Hence, the Boolean Topic representation connects the different factors using the
AND operator, where the factors can consist of single terms or several synonyms connected by
the OR operator.
■ One of the goals of the Smart Boolean approach is to make use of the structural knowledge
contained in the text surrogates, where the different fields represent contexts of useful
information. Further, the Smart Boolean approach wants to use the fact that related concepts can
share a common stem. For example, the concepts "computers" and "computing" have the
common stem computer*
■ Extended Boolean Models
■ Several methods have been developed to extend the Boolean model to address the following
issues:
■ 1) The Boolean operators are too strict and ways need to be found to soften them.
■ 2) The standard Boolean approach has no provision for ranking. The Smart Boolean approach
and the methods described in this section provide users with relevance ranking. 3) The Boolean
model does not support the assignment of weights to the query or document terms.
■ The P-norm method developed by Fox (1983) allows query and document terms to have
weights, which have been computed by using term frequency statistics with the proper
normalization procedures. These normalized weights can be used to rank the documents in the
order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of
increasing distance from the point (1, 1, ... , 1) for an AND query.
■ Advantages of Boolean Model:
– Clear Formulation
– Easy to implement

■ Disadvantages :
– Exact matching may retrieve too few or too many documents
– Difficult to rank output
– Difficult to control the number of documents retrieved that all matched documents will
be returned
2. Statistical Model
■ The vector space and probabilistic models are the two major examples of the statistical retrieval
approach.
■ Both models use statistical information in the form of term frequencies to determine the
relevance of documents with respect to a query.
■ Although they differ in the way they use the term frequencies, both produce as their output a
list of documents ranked by their estimated relevance.
■ The statistical retrieval models address some of the problems of Boolean retrieval methods, but
they have disadvantages of their own.
■ The provides summary of the key features of the vector space and probabilistic approaches.
2.1 Vector Space Model
■ The vector space model represents the documents and queries as vectors in a multidimensional
space, whose dimensions are the terms used to build an index to represent the documents.
■ The creation of an index involves lexical/verbal scanning to identify the significant terms,
where morphological/structural analysis reduces different word forms to common "stems", and
the occurrence of those stems is computed.
■ The vector space model can assign a high ranking score to a document that contains only a few
of the query terms if these terms occur infrequently in the collection but frequently in the
document. The vector space model makes the following assumptions:
■ 1) The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
■ 2) The words used to define the dimensions of the space are orthogonal or independent. While
it is a reasonable first approximation, the assumption that words are pairwise independent is not
realistic.
Advantages of a vector model
■ Its term weighting scheme can improve retrieval performance
■ Allows partial matching
■ Retrieved documents are scored according to their degree of similarity

■ DISADVANTAGE
■ Terms are assumed to be manually independent. In some cases it might hurt performance.
2.2 Probabilistic Model/Inference Network
■ The probabilistic retrieval model is based on the Probability Ranking Principle, which states
that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft 1992].
■ The principle takes into account that there is uncertainty in the representation of the information
need and the documents.
■ There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
■ The statistical approaches have the following strengths:
■ 1) They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the
output by setting a relevance threshold or by specifying a certain number of documents to display.
■ 2) Queries can be easier to formulate because users do not have to learn a query language and can use natural language.
■ 3) The uncertainty inherent in the choice of query concepts can be represented.

■ However, the statistical approaches have the following shortcomings:


■ 1) They have a limited expressive power. For example, the NOT operation can not be represented because only positive
weights are used. It can be proven that possible Boolean queries can be generated by the statistical approaches that use
weighted linear sums to rank the documents.
■ 2) The statistical approach lacks the structure to express important linguistic features such as phrases. Proximity
constraints are also difficult to express, a feature that is of great use for experienced searchers.
■ 3) The computation of the relevance scores can be computationally expensive.
■ 4) A ranked linear list provides users with a limited view of the information space and it does not directly suggest how
to modify a query if the need arises.
■ 5) The queries have to contain a large number of words to improve the retrieval performance. As is the case for the
Boolean approach, users are faced with the problem of having to choose the appropriate words that are also used in the
relevant documents.
■ If users provide the retrieval system with relevance feedback, then this information is used by the statistical approaches
to recompute the weights as follows: the weights of the query terms in the relevant documents are increased, whereas
the weights of the query terms that do not appear in the relevant documents are decreased.
■ 2.3 Latent Semantic Indexing
■ Several statistical and AI techniques have been used in association with domain semantics to extend the vector space model
to help overcome some of the retrieval problems described above, such as the "dependence problem" or the "vocabulary
problem".
■ One such method is Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are calculated
and exploited in the retrieval process. The assumption is that there is some "latent/hidden/dormant" structure in the pattern
of word usage across documents and that statistical techniques can be used to estimate this latent structure. An advantage of
this approach is that queries can retrieve documents even if they have no words in common.
■ The LSI technique captures deeper associative structure than simple term-to-term correlations and is completely automatic.
■ The only difference between LSI and vector space methods is that LSI represents terms and documents in a reduced
dimensional space of the derived indexing dimensions. As with the vector space method, differential term weighting and
relevance feedback can improve LSI performance substantially.
■ Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space model.
■ The four methods were the result of crossing two factors, the first factor being whether the retrieval method used Latent
Semantic Indexing or keyword matching, and the second factor being whether the profile was based on words or phrases
provided by the user (Word profile), or documents that the user had previously rated as relevant (Document profile).
■ The LSI match-document profile method proved to be the most successful of the four methods.
■ This method combines the advantages of both LSI and the document profile. The document profile provides a simple, but
effective, representation of the user's interests. Indicating just a few documents that are of interest is as effective as
generating a long list of words and phrases that describe one's interest. Document profiles have an added advantage over
word profiles: users can just indicate documents they find relevant without having to generate a description of their interests.
3.Linguistic and Knowledge-based Approaches
■ In the simplest form of automatic text retrieval, users enter a string of keywords that are used to search the
inverted indexes of the document keywords.
■ This approach retrieves documents based solely on the presence or absence of exact single word strings as
specified by the logical representation of the query. Clearly this approach will miss many relevant documents
because it does not capture the complete or deep meaning of the user's query.
■ The Smart Boolean approach and the statistical retrieval approaches, each in their specific way, try to address
this problem.
■ Linguistic and knowledge-based approaches have also been developed to address this problem by performing
a morphological, syntactic and semantic analysis to retrieve documents more effectively [Lancaster and
Warner 1993].
■ In a morphological analysis, roots and affixes are analyzed to determine the part of speech (noun, verb,
adjective etc.) of the words. Next complete phrases have to be parsed using some form of syntactic analysis.
Finally, the linguistic methods have to resolve word ambiguities and/or generate relevant synonyms or quasi-
synonyms based on the semantic relationships between words.
■ The development of a sophisticated linguistic retrieval system is difficult and it requires complex knowledge
bases of semantic information and retrieval heuristics. Hence these systems often require techniques that are
commonly referred to as artificial intelligence or expert systems techniques.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy