Chapter 5 Retrieval Efective
Chapter 5 Retrieval Efective
• Evaluation of IR systems
• Relevance judgement
• Performance measures
• Recall, Precision
• Single-valued measures
• User centred measures
Why System Evaluation?
• Any systems needs validation and verification
–Check whether the system is right or not
–Check whether it is the right system or not
• It provides the ability to measure the difference between
IR systems
–How well do our search engines work?
–Is system A better than B? Under what conditions?
• Evaluation drives what to study
–Identify techniques that work well and do not work
–There are many retrieval models/algorithms. which one is the
best?
–What is the best component for:
• Index term selection (tokenization, stop-word removal,
stemming, normalization…)
• Term weighting (TF, IDF, TF*IDF, P(R|D)…)
• Similarity measures (cosine, Euclidean, string editing…)
2
Evaluation Criteria
What are the main evaluation measures to check the
performance of an IR system?
• Efficiency
– Time and space complexity
❑ Speed in terms of retrieval time, indexing time, query
processing time
❑ The space taken by corpus vs. index file
• Index size: determine Index/corpus size ratio
• Is there a need for compression
• Effectiveness
–How is a system capable of retrieving relevant documents
from the collection: precision & recall?
–Is system X better than other systems?
–User satisfaction: How “good” are the documents that are
returned as a response to user query?
–Relevance of results to meet information need of users
Types of Evaluation Strategies
•User-centered evaluation
– Given several users, and at least two retrieval
systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users
information need
• How to measure users satisfaction?
•System-centered evaluation
– Given documents, queries, and relevance
judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
The Notion of Relevance Judgment
• Relevance is the measure of a correspondence
existing between a document and query.
–Construct document - query matrix as determined by:
(i) the user who posed the retrieval problem;
(ii) an external judge;
(iii) information specialist
–Is the relevance judgment made by users and external
person the same ?
Not
retrieved C D
“Type two error”
| {Relevant} {Retrieved} |
Re call =
| {Relevant} |
Relevant +
Relevant
Retrieved Retrieved
| {Relevant} {Retrieved} |
Pr ecision =
| {Retrieved} |
Irrelevant + Not Retrieved
Returns relevant
documents but
misses many Returns most relevant
useful ones too documents but includes
0 lots of junk
Recall 1
Need for Interpolation
•Two issues:
–How do you compare performance across
queries?
–Is the sawtooth shape intuitive of what’s going on?
1
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
Solution: Interpolation!
Interpolate a precision value for each standard recall level
Interpolation
• It is a general form of precision/recall calculation
• Precision change w.r.t. Recall (not a fixed point)
– It is an empirical fact that on average as recall increases,
precision decreases
• Interpolate precision at 11 standard recall levels:
– rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
• The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall
level between the jth and (j + 1)th level:
0.8
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
Exercise
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990
Missing one relevant document. Never reach 100% recall 13
Interpolating a Recall/Precision
Curve: Exercise
Precision
1.0
0.8
0.6
0.4
0.2
14
Computing Recall/Precision Points:
Exercise 2
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 576
3 589 x
R=1/6=0.167;
4 342
P=1/1=1
5 590 x
R=2/6=0.333;
6 717 P=2/3=0.66
7 984 R=3/6=0.5;
7
8 772 x P=3/5=0.6
9 321 x R=4/6=0.667;
10 498 P=4/8=0.5
11 113 R=5/6=0.833;
12 628 P=5/9=0.556
13 772
14 592 x R=6/6=1.0;
15 p=6/14=0.42
9
Interpolating a Recall/Precision Curve:
Exercise 2
Precision
1.0
0.8
0.6
0.4
0.2
16
Interpolating across queries
• For each query, calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.
2 PR 2
F= = 1 1
P + R R+P
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?
21
E Measure
• Associated with Van Rijsbergen
• Allows user to specify importance of recall and
precision
• It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on precision
over recall:
(1 + ) PR (1 + )
2 2
E= = 2 1
P+R
2
+
R P
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
| {Relevant} {NotRetrieved} |
Miss =
| {Relevant} |
| {Retrieved} {NotRelevant} |
Fallout =
| {NotRelevant} | 23
Programming Assignment (Due date: ____)
• Design an IR system for one of the local language following the
principle discussed in class.
– Form a group having up to three members
1. Construct inverted file (vocabulary file & posting file)
– Taking N document corpus, generate content-bearing index terms and
organize them using inverted file indexing structure, include frequencies
(TF, DF, CF) & position/location information of terms in each document.
2.Develop Vector space retrieval model
– Implement a Vector space model that retrieve relevant documents in
ranking order
3. Test your system using five queries (with three to six words)
and report its performance
• Required: write a publishable report that have abstract (½ page),
introduction, problem statement & objective (1 page), literature
review (2 pages), methods used & architecture of your system (1
page) & experimentation (test results & findings) (2-3 pages),
concluding remarks with one basic recommendation (1 page) and
References (1 page).