0% found this document useful (0 votes)

50 views15 pages

Information Retrieval Models

Uploaded by

mihlemaza03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views15 pages

Information Retrieval Models

Uploaded by

mihlemaza03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

INFORMATION

RETRIEVAL MODELS
UNIT 3
The three main statistical models
The following major models have been developed to retrieve information:
– the Boolean model,
– the Statistical model, which includes the vector space and the probabilistic retrieval
model, and
– the Linguistic and Knowledge-based models.
1. Boolean Model

■ Documents represented as a set of terms

■ Form queries using standard Boolean logic set-theoretic operators
– AND, OR and NOT
■ Retrieval and relevance
– Binary concepts
– 1 (True) 0 (False)
■ Lacks sophisticated ranking algorithms
■ Standard Boolean
■ It has the following strengths:
■ 1) It is easy to implement and it is computationally efficient. Hence, it is the standard model for
the current large-scale, operational retrieval systems and many of the major on-line information
services use it.
■ 2) It enables users to express structural and conceptual constraints to describe important
linguistic features. Users find that synonym specifications (reflected by OR-clauses) and
phrases (represented by proximity relations) are useful in the formulation of queries. The
Boolean approach possesses a great expressive power and clarity.
■ 3) Boolean retrieval is very effective if a query requires an exhaustive and unambiguous
selection.
■ 4) The Boolean method offers a multitude of techniques to broaden or narrow a query.
■ 5) The Boolean approach can be especially effective in the later stages of the search process,
because of the clarity and exactness with which relationships between concepts can be
represented.
■ Narrowing and Broadening Techniques
■ As mentioned earlier, a Boolean query can be described in terms of the following four
operations: degree and type of coordination, proximity constraints, field specifications and
degree of stemming as expressed in terms of word/string specifications.
■ If users want to (re)formulate a Boolean query then they need to make informed choices along
these four dimensions to create a query that is sufficiently broad or narrow depending on their
information needs.
■ Most narrowing techniques lower recall as well as raise precision, and most broadening
techniques lower precision as well as raise recall. Any query can be reformulated to achieve the
desired precision or recall characteristics, but generally it is difficult to achieve both.
■ Each of the four kinds of operations in the query formulation has particular operators, some of
which tend to have a narrowing or broadening effect. For each operator with a narrowing effect,
there is one or more inverse operators with a broadening effect. Hence, users require help to
gain an understanding of how changes along these four dimensions will affect the broadness or
narrowness of a query.
■ Smart Boolean
■ Smart Boolean tries to help users construct and modify a Boolean query as well as make better
choices along the four dimensions that characterize a Boolean query.
■ This method is as a good example that illustrates some of the possible ways to make Boolean
retrieval more user-friendly and effective.
■ Users start by specifying a natural language statement that is automatically translated into a
Boolean Topic representation that consists of a list of factors or concepts, which are automatically
coordinated using the AND operator.
■ If the user at the initial stage can or wants to include synonyms, then they are coordinated using
the OR operator. Hence, the Boolean Topic representation connects the different factors using the
AND operator, where the factors can consist of single terms or several synonyms connected by
the OR operator.
■ One of the goals of the Smart Boolean approach is to make use of the structural knowledge
contained in the text surrogates, where the different fields represent contexts of useful
information. Further, the Smart Boolean approach wants to use the fact that related concepts can
share a common stem. For example, the concepts "computers" and "computing" have the
common stem computer*
■ Extended Boolean Models
■ Several methods have been developed to extend the Boolean model to address the following
issues:
■ 1) The Boolean operators are too strict and ways need to be found to soften them.
■ 2) The standard Boolean approach has no provision for ranking. The Smart Boolean approach
and the methods described in this section provide users with relevance ranking. 3) The Boolean
model does not support the assignment of weights to the query or document terms.
■ The P-norm method developed by Fox (1983) allows query and document terms to have
weights, which have been computed by using term frequency statistics with the proper
normalization procedures. These normalized weights can be used to rank the documents in the
order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of
increasing distance from the point (1, 1, ... , 1) for an AND query.
■ Advantages of Boolean Model:
– Clear Formulation
– Easy to implement

■ Disadvantages :
– Exact matching may retrieve too few or too many documents
– Difficult to rank output
– Difficult to control the number of documents retrieved that all matched documents will
be returned
2. Statistical Model
■ The vector space and probabilistic models are the two major examples of the statistical retrieval
approach.
■ Both models use statistical information in the form of term frequencies to determine the
relevance of documents with respect to a query.
■ Although they differ in the way they use the term frequencies, both produce as their output a
list of documents ranked by their estimated relevance.
■ The statistical retrieval models address some of the problems of Boolean retrieval methods, but
they have disadvantages of their own.
■ The provides summary of the key features of the vector space and probabilistic approaches.
2.1 Vector Space Model
■ The vector space model represents the documents and queries as vectors in a multidimensional
space, whose dimensions are the terms used to build an index to represent the documents.
■ The creation of an index involves lexical/verbal scanning to identify the significant terms,
where morphological/structural analysis reduces different word forms to common "stems", and
the occurrence of those stems is computed.
■ The vector space model can assign a high ranking score to a document that contains only a few
of the query terms if these terms occur infrequently in the collection but frequently in the
document. The vector space model makes the following assumptions:
■ 1) The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
■ 2) The words used to define the dimensions of the space are orthogonal or independent. While
it is a reasonable first approximation, the assumption that words are pairwise independent is not
realistic.
Advantages of a vector model
■ Its term weighting scheme can improve retrieval performance
■ Allows partial matching
■ Retrieved documents are scored according to their degree of similarity

■ DISADVANTAGE
■ Terms are assumed to be manually independent. In some cases it might hurt performance.
2.2 Probabilistic Model/Inference Network
■ The probabilistic retrieval model is based on the Probability Ranking Principle, which states
that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft 1992].
■ The principle takes into account that there is uncertainty in the representation of the information
need and the documents.
■ There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
■ The statistical approaches have the following strengths:
■ 1) They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the
output by setting a relevance threshold or by specifying a certain number of documents to display.
■ 2) Queries can be easier to formulate because users do not have to learn a query language and can use natural language.
■ 3) The uncertainty inherent in the choice of query concepts can be represented.

■ However, the statistical approaches have the following shortcomings:

■ 1) They have a limited expressive power. For example, the NOT operation can not be represented because only positive
weights are used. It can be proven that possible Boolean queries can be generated by the statistical approaches that use
weighted linear sums to rank the documents.
■ 2) The statistical approach lacks the structure to express important linguistic features such as phrases. Proximity
constraints are also difficult to express, a feature that is of great use for experienced searchers.
■ 3) The computation of the relevance scores can be computationally expensive.
■ 4) A ranked linear list provides users with a limited view of the information space and it does not directly suggest how
to modify a query if the need arises.
■ 5) The queries have to contain a large number of words to improve the retrieval performance. As is the case for the
Boolean approach, users are faced with the problem of having to choose the appropriate words that are also used in the
relevant documents.
■ If users provide the retrieval system with relevance feedback, then this information is used by the statistical approaches
to recompute the weights as follows: the weights of the query terms in the relevant documents are increased, whereas
the weights of the query terms that do not appear in the relevant documents are decreased.
■ 2.3 Latent Semantic Indexing
■ Several statistical and AI techniques have been used in association with domain semantics to extend the vector space model
to help overcome some of the retrieval problems described above, such as the "dependence problem" or the "vocabulary
problem".
■ One such method is Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are calculated
and exploited in the retrieval process. The assumption is that there is some "latent/hidden/dormant" structure in the pattern
of word usage across documents and that statistical techniques can be used to estimate this latent structure. An advantage of
this approach is that queries can retrieve documents even if they have no words in common.
■ The LSI technique captures deeper associative structure than simple term-to-term correlations and is completely automatic.
■ The only difference between LSI and vector space methods is that LSI represents terms and documents in a reduced
dimensional space of the derived indexing dimensions. As with the vector space method, differential term weighting and
relevance feedback can improve LSI performance substantially.
■ Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space model.
■ The four methods were the result of crossing two factors, the first factor being whether the retrieval method used Latent
Semantic Indexing or keyword matching, and the second factor being whether the profile was based on words or phrases
provided by the user (Word profile), or documents that the user had previously rated as relevant (Document profile).
■ The LSI match-document profile method proved to be the most successful of the four methods.
■ This method combines the advantages of both LSI and the document profile. The document profile provides a simple, but
effective, representation of the user's interests. Indicating just a few documents that are of interest is as effective as
generating a long list of words and phrases that describe one's interest. Document profiles have an added advantage over
word profiles: users can just indicate documents they find relevant without having to generate a description of their interests.
3.Linguistic and Knowledge-based Approaches
■ In the simplest form of automatic text retrieval, users enter a string of keywords that are used to search the
inverted indexes of the document keywords.
■ This approach retrieves documents based solely on the presence or absence of exact single word strings as
specified by the logical representation of the query. Clearly this approach will miss many relevant documents
because it does not capture the complete or deep meaning of the user's query.
■ The Smart Boolean approach and the statistical retrieval approaches, each in their specific way, try to address
this problem.
■ Linguistic and knowledge-based approaches have also been developed to address this problem by performing
a morphological, syntactic and semantic analysis to retrieve documents more effectively [Lancaster and
Warner 1993].
■ In a morphological analysis, roots and affixes are analyzed to determine the part of speech (noun, verb,
adjective etc.) of the words. Next complete phrases have to be parsed using some form of syntactic analysis.
Finally, the linguistic methods have to resolve word ambiguities and/or generate relevant synonyms or quasi-
synonyms based on the semantic relationships between words.
■ The development of a sophisticated linguistic retrieval system is difficult and it requires complex knowledge
bases of semantic information and retrieval heuristics. Hence these systems often require techniques that are
commonly referred to as artificial intelligence or expert systems techniques.

Sae Arp4754 Rev.A
88% (8)
Sae Arp4754 Rev.A
115 pages
NALEDI
67% (12)
NALEDI
277 pages
CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
No ratings yet
CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
237 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
State of Local Governance Report 2011
100% (1)
State of Local Governance Report 2011
103 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Boolean Model (1) 1
No ratings yet
Boolean Model (1) 1
52 pages
IRS Module 2
No ratings yet
IRS Module 2
24 pages
Module 2-Students
No ratings yet
Module 2-Students
143 pages
Stock Card Drug Management
75% (4)
Stock Card Drug Management
4 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
Retrieval Models: Boolean and Vector Space
No ratings yet
Retrieval Models: Boolean and Vector Space
41 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
CS726 Handouts
No ratings yet
CS726 Handouts
237 pages
Unit 2
No ratings yet
Unit 2
13 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
Ir4 Retrieval Models - 6up
No ratings yet
Ir4 Retrieval Models - 6up
7 pages
IR Models
No ratings yet
IR Models
65 pages
Supervisionguide16 17 Students
No ratings yet
Supervisionguide16 17 Students
17 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Unit II
No ratings yet
Unit II
73 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Web Search
No ratings yet
Web Search
30 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
03
No ratings yet
03
41 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
L03
No ratings yet
L03
16 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Unit 2
No ratings yet
Unit 2
58 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Apt-Ipt Procedure
No ratings yet
Apt-Ipt Procedure
13 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Scratch Programming (Scratch 3.0)
No ratings yet
Scratch Programming (Scratch 3.0)
13 pages
Remote Access Technical Whitepaper
No ratings yet
Remote Access Technical Whitepaper
9 pages
Missing Neighbors in WCDMA Analysis Guide
100% (2)
Missing Neighbors in WCDMA Analysis Guide
15 pages
QRTG0023 - Cube X RAM Clear Procedrue
No ratings yet
QRTG0023 - Cube X RAM Clear Procedrue
4 pages
How To Draw Manga - The Only Tutorial That You Need!
100% (19)
How To Draw Manga - The Only Tutorial That You Need!
78 pages
Product CI854A
No ratings yet
Product CI854A
3 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
Https WWW - Irctc.co - in Cgi-Bin Bv60
No ratings yet
Https WWW - Irctc.co - in Cgi-Bin Bv60
1 page
Water Tap
No ratings yet
Water Tap
14 pages
L1000A TM EN TOEP C710616 134G 6 0 Addendum
No ratings yet
L1000A TM EN TOEP C710616 134G 6 0 Addendum
106 pages
6th Sem LI-Fi Technology Research Paper Final
No ratings yet
6th Sem LI-Fi Technology Research Paper Final
42 pages
Lib 412 Presentation Group 8
No ratings yet
Lib 412 Presentation Group 8
12 pages
B1-2DA datasheet-EN PDF
No ratings yet
B1-2DA datasheet-EN PDF
4 pages
LPC-P1114 Development Board
No ratings yet
LPC-P1114 Development Board
15 pages
Evotech 3.0 Invitation
No ratings yet
Evotech 3.0 Invitation
18 pages
SSIT311 Chapter 10 Cognitive Therapy PDF
No ratings yet
SSIT311 Chapter 10 Cognitive Therapy PDF
68 pages
Lib 412 Group 8 Proposal Appendices
No ratings yet
Lib 412 Group 8 Proposal Appendices
6 pages
Mihir Patel - SaExperiments
No ratings yet
Mihir Patel - SaExperiments
57 pages
AUTONOMOUS 231GES104T - PROBLEM SOLVING THROUGH PYTHON PROGRAMMING Question Bank - Unit
No ratings yet
AUTONOMOUS 231GES104T - PROBLEM SOLVING THROUGH PYTHON PROGRAMMING Question Bank - Unit
6 pages
Reflective Report Search Strategies
No ratings yet
Reflective Report Search Strategies
3 pages
7 Day Articulation Checklist v2
No ratings yet
7 Day Articulation Checklist v2
3 pages
HX Je
100% (1)
HX Je
1 page
Application of Queueing Theory in Healthcare A Literature Review
No ratings yet
Application of Queueing Theory in Healthcare A Literature Review
5 pages
Comenzi Cisco
No ratings yet
Comenzi Cisco
3 pages
CCA3006 - CLOUD-SECURITY-MANAGEMENT - LT - 1.0 - 34 - Cloud Security Management
No ratings yet
CCA3006 - CLOUD-SECURITY-MANAGEMENT - LT - 1.0 - 34 - Cloud Security Management
2 pages
Job Sheet 60 2025 19 03 12 57 39
No ratings yet
Job Sheet 60 2025 19 03 12 57 39
1 page
Educating Financial Accounting-A Need Analysis For Technology Driven Problem Solving Skills
No ratings yet
Educating Financial Accounting-A Need Analysis For Technology Driven Problem Solving Skills
9 pages
Difference Between Microkernel and Exokernel
No ratings yet
Difference Between Microkernel and Exokernel
4 pages
K To 12 Basic Education Curriculum in TLE Caregiving Grade 8 Competencies Allocation For 180 Teaching-Learning Days SY 2014-2015
No ratings yet
K To 12 Basic Education Curriculum in TLE Caregiving Grade 8 Competencies Allocation For 180 Teaching-Learning Days SY 2014-2015
4 pages
Resume 2022
No ratings yet
Resume 2022
2 pages
AMS Non-Disclosure Agreement v1
No ratings yet
AMS Non-Disclosure Agreement v1
1 page
Optimization in Engineering Sciences: Exact Methods
From Everand
Optimization in Engineering Sciences: Exact Methods
Pierre Borne
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Information Retrieval Models

Uploaded by

Information Retrieval Models

Uploaded by

INFORMATION

■ Documents represented as a set of terms

■ However, the statistical approaches have the following shortcomings:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.