0% found this document useful (0 votes)
66 views24 pages

Chapter 5 Retrieval Efective

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views24 pages

Chapter 5 Retrieval Efective

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Retrieval Effectiveness

• Evaluation of IR systems
• Relevance judgement
• Performance measures
• Recall, Precision
• Single-valued measures
• User centred measures
Why System Evaluation?
• Any systems needs validation and verification
–Check whether the system is right or not
–Check whether it is the right system or not
• It provides the ability to measure the difference between
IR systems
–How well do our search engines work?
–Is system A better than B? Under what conditions?
• Evaluation drives what to study
–Identify techniques that work well and do not work
–There are many retrieval models/algorithms. which one is the
best?
–What is the best component for:
• Index term selection (tokenization, stop-word removal,
stemming, normalization…)
• Term weighting (TF, IDF, TF*IDF, P(R|D)…)
• Similarity measures (cosine, Euclidean, string editing…)
2
Evaluation Criteria
What are the main evaluation measures to check the
performance of an IR system?
• Efficiency
– Time and space complexity
❑ Speed in terms of retrieval time, indexing time, query
processing time
❑ The space taken by corpus vs. index file
• Index size: determine Index/corpus size ratio
• Is there a need for compression
• Effectiveness
–How is a system capable of retrieving relevant documents
from the collection: precision & recall?
–Is system X better than other systems?
–User satisfaction: How “good” are the documents that are
returned as a response to user query?
–Relevance of results to meet information need of users
Types of Evaluation Strategies
•User-centered evaluation
– Given several users, and at least two retrieval
systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users
information need
• How to measure users satisfaction?

•System-centered evaluation
– Given documents, queries, and relevance
judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
The Notion of Relevance Judgment
• Relevance is the measure of a correspondence
existing between a document and query.
–Construct document - query matrix as determined by:
(i) the user who posed the retrieval problem;
(ii) an external judge;
(iii) information specialist
–Is the relevance judgment made by users and external
person the same ?

• Relevance judgment is usually:


– Subjective: depends upon a specific user’s judgment.
– Situational: relates to user’s current needs.
– Cognitive: depends on human perception and behavior.
– Dynamic: changes over time.
Measuring Retrieval Effectiveness
Relevant Irrelevant
Metrics often “Type one
used to evaluate
effectiveness of
the system
Retrieved
A B error”

Not
retrieved C D
“Type two error”

• Retrieval of documents may result in:


–False positive (Errors of omission): some irrelevant
documents may be retrieved by the system as relevant.
–False negative (False drop or Errors of commission):
some relevant documents may not be retrieved by the system
as irrelevant.
–For many applications a good index should not permit any
false drops, but may permit a few false positives.
Measuring Retrieval Effectiveness
Relevant Not
relevant Collection size = A+B+C+D
Retrieved A B Relevant = A+C
Retrieved = A+B
Not retrieved C D

| {Relevant} {Retrieved} |
Re call =
| {Relevant} |
Relevant +
Relevant
Retrieved Retrieved
| {Relevant}  {Retrieved} |
Pr ecision =
| {Retrieved} |
Irrelevant + Not Retrieved

• When is precision important? When is recall important?


Example
Assume that there are a total of 10 relevant document
Ranking Relevance Recall Precision
1. Doc. 50 R 0.10 1.00
2. Doc. 34 NR 0.10 0.50
3. Doc. 45 R 0.20 0.67
4. Doc. 8 NR 0.20 0.50
5. Doc. 23 NR 0.20 0.40
6. Doc. 16 NR 0.20 0.33
7. Doc. 63 R 0.30 0.43
8. Doc 119 R 0.40 0.50
9. Doc 21 NR 0.40 0.44
10. Doc 80 R 0.50 0.50
Graphing Precision and Recall
• Plot each (recall, precision) point on a graph
• Recall is a non-decreasing function of the number of
documents retrieved,
–Precision usually decreases (in a good system)
• Precision/Recall tradeoff
– Can increase recall by retrieving many documents (down to
a low level of relevance ranking),
•but many irrelevant documents would be fetched, reducing precision
– Can get high recall (but low precision) by retrieving all
documents for all queries
1 The ideal
Precision

Returns relevant
documents but
misses many Returns most relevant
useful ones too documents but includes
0 lots of junk
Recall 1
Need for Interpolation
•Two issues:
–How do you compare performance across
queries?
–Is the sawtooth shape intuitive of what’s going on?
1

0.8

0.6
Precision

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall

Solution: Interpolation!
Interpolate a precision value for each standard recall level
Interpolation
• It is a general form of precision/recall calculation
• Precision change w.r.t. Recall (not a fixed point)
– It is an empirical fact that on average as recall increases,
precision decreases
• Interpolate precision at 11 standard recall levels:
– rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
• The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall
level between the jth and (j + 1)th level:

P(rj ) = max P(r )


r j  r  r j +1
Result of Interpolation

0.8

0.6
Precision

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5
Recall
Exercise
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990
Missing one relevant document. Never reach 100% recall 13
Interpolating a Recall/Precision
Curve: Exercise
Precision

1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0


Recall

14
Computing Recall/Precision Points:
Exercise 2
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 576
3 589 x
R=1/6=0.167;
4 342
P=1/1=1
5 590 x
R=2/6=0.333;
6 717 P=2/3=0.66
7 984 R=3/6=0.5;
7
8 772 x P=3/5=0.6
9 321 x R=4/6=0.667;
10 498 P=4/8=0.5
11 113 R=5/6=0.833;
12 628 P=5/9=0.556
13 772
14 592 x R=6/6=1.0;
15 p=6/14=0.42
9
Interpolating a Recall/Precision Curve:
Exercise 2
Precision

1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0


Recall

16
Interpolating across queries
• For each query, calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.

• Average precision favors systems which produce


relevant documents high in rankings
Single-valued measures
• Single value measures: calculate a single value for
each query to evaluate performance
– Mean Average Precision at seen relevant
documents
• Typically average performance over a large set of
queries.
– R-Precision
• Precision at Rth relevant documents
MAP (Mean Average Precision)
• Computing mean average for more than one query
MAP =   ( ) 
1 1 j
n Qi | Ri | D j Ri rij
– rij = rank of the jth relevant document for Qi
– |Ri| = number of relevant documents for Qi
– n = number of test queries
• E.g. Assume there are 3 relevant documents for Query 1 and 2
for query 2. Calculate MAP?
Relevant Docs. retrieved Query 1 Query 2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10 -
1 1 1 2 3 1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8
R- Precision
• Precision at the R-th position in the ranking of results
for a query, where R is the total number of relevant
documents.
– Calculate precision after R documents are seen
– Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x R = # of relevant docs = 6
3 576
4 590 x
5 986
6 592 x
7 984
R-Precision = 4/6 = 0.67
8 988
9 578
10 985
11 103
12 591
13 772 x
20
14 990
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:

2 PR 2
F= = 1 1
P + R R+P
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?

21
E Measure
• Associated with Van Rijsbergen
• Allows user to specify importance of recall and
precision
• It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on precision
over recall:
(1 +  ) PR (1 +  )
2 2
E= = 2 1
 P+R
2
+
R P

• Value of  controls trade-off:


– = 1: Equal weight for precision and recall (E=F).
– > 1: Weight recall more. It emphasizes recall.
– < 1: Weight precision more. It emphasizes precision.
22
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken
into account.
• Recall is undefined when there is no relevant document in
the collection.
• Precision is undefined when no document is retrieved.

Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall

| {Relevant}  {NotRetrieved} |
Miss =
| {Relevant} |
| {Retrieved} {NotRelevant} |
Fallout =
| {NotRelevant} | 23
Programming Assignment (Due date: ____)
• Design an IR system for one of the local language following the
principle discussed in class.
– Form a group having up to three members
1. Construct inverted file (vocabulary file & posting file)
– Taking N document corpus, generate content-bearing index terms and
organize them using inverted file indexing structure, include frequencies
(TF, DF, CF) & position/location information of terms in each document.
2.Develop Vector space retrieval model
– Implement a Vector space model that retrieve relevant documents in
ranking order
3. Test your system using five queries (with three to six words)
and report its performance
• Required: write a publishable report that have abstract (½ page),
introduction, problem statement & objective (1 page), literature
review (2 pages), methods used & architecture of your system (1
page) & experimentation (test results & findings) (2-3 pages),
concluding remarks with one basic recommendation (1 page) and
References (1 page).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy