4 IRinArabic2021 Ranked Retrieval I
4 IRinArabic2021 Ranked Retrieval I
2
How can we get it?
For web search, companies apply their own studies to assess the
performance of their search engine.
Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries
4. Ranked Retrieval I
(Vector Space Model & BM25)
Tamer Elsayed
Qatar University
Today’s Roadmap
Simple scoring
TF-IDF ranking
Vector Space Model
BM25 ranking
13
Boolean Retrieval
Thus far, our queries have all been Boolean.
● Documents either match or don’t.
Good for expert users with precise understanding of their needs
and the collection.
Not good for the majority of users.
● Most incapable of writing Boolean queries.
● Most don’t want to go through 1000s of results.
• This is particularly true of web search.
14
Ranked Retrieval
Typical queries: free text queries
Results are “ranked” with respect to a query
Large result sets are not an issue
● We just show the top k ( ≈ 10) results
● We don’t overwhelm the user
How?
Top ranked documents are the most likely to satisfy user’s query.
Assign a score – say in [0, 1] – to each document.
Score (d, q) measures how well doc d matches a query q.
15
Scoring Example: Jaccard coefficient
Commonly-used measure of overlap of two sets A and B
● What are the 2 sets in our context?
𝐴𝐴∩𝐵𝐵
jaccard(A, B) =
𝐴𝐴∪𝐵𝐵
jaccard(A,A) = 1 and jaccard(A, B) = 0 if A ∩ B = 0
A and B don’t have to be of the same size.
Always assigns a number between 0 and 1.
18
12
2/3
2/6
For the query "the arab world" and
2/7
the document "fifa world cup in arab country", 2/10
3/6
what is Jaccard similarity (after removing stop words)? 3/7
3/10
19
20
Example
Collection of 5 documents (balls = terms)
Query
Which is the least relevant document?
Which is the most relevant document?
D1 D2 D3 D4 D5
Term-Document Count Matrix
Each document as a count (frequency) vector in ℕv
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Bag-of-Words Model
Doesn’t consider the ordering of words in a document
John is quicker than Mary
Same vectors!
Mary is quicker than John 22
1. Frequent Terms in a Document
Term Frequency
tft,d: the number of times that term t occurs in doc d.
We want to use tf when computing query-document match
scores. But how?
24
Log-Frequency Weighting
The log frequency weight of term t in d is
1 + log10tf t,d , if tf t,d > 0
wt,d =
0, otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
26
Inverse Documents Frequency, idf
idf (inverse document frequency) of t:
𝑁𝑁
𝑖𝑖𝑖𝑖𝑓𝑓𝑡𝑡 = 𝑙𝑙𝑙𝑙𝑙𝑙10 ( )
𝑑𝑑𝑑𝑑𝑡𝑡
● log (N/dft) instead of N/dft to “dampen” the effect of idf.
Suppose N = 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
27
tf.idf Term Weighting
The tf-idf weight of a term is the product of its tf weight and its
idf weight.
𝑤𝑤𝑡𝑡,𝑑𝑑 = 1 + 𝑙𝑙𝑙𝑙𝑙𝑙10 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 × 𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡
D1 D2 D3 D4 D5
4 5 3 2 1
13
30
13
Term frequency
........ is a relation between
Document frequency
a term and a collection Collection frequency
Yes
A rare term has high IDF. No
33
Binary → Count → Weight Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
34
Sec. 6.3
Documents as Vectors
|V|-dimensional vector space
d
Terms are axes of the space
4
Queries as Vectors
Key idea1: Do the same for queries: represent them as vectors in
the space.
36
Sec. 6.3
Euclidean Distance?
Distance between the end points of the two vectors
How to compute?
38
Sec. 6.3
3. Length Normalization
A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:
𝑥𝑥⃗ 2 = � 𝑥𝑥𝑖𝑖2
𝑖𝑖
39
Cosine “Similarity” (Query, Document)
𝑞𝑞
⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the query
𝑑𝑑⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the doc
For normalized vectors:
⃗ ⃗ |𝑉𝑉|
⃗ 𝑑𝑑 = 𝑞𝑞⃗ � 𝑑𝑑 = ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
cos 𝑞𝑞,
For non-normalized vectors:
|𝑉𝑉|
𝑞𝑞⃗ � 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
⃗ 𝑑𝑑⃗ =
cos 𝑞𝑞, = � =
𝑞𝑞⃗ 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ |𝑉𝑉| 2 |𝑉𝑉| 2
∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 ∑𝑖𝑖=1 𝑑𝑑𝑖𝑖
Computing Cosine Scores?
TAAT
Query Processing
41
Sec. 6.4
Many search engines allow for different weightings for queries vs.
documents.
SMART Notation: use notation ddd.qqq, using the acronyms from the table
A very standard weighting scheme is: lnc.ltc
42
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 � + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
tf component idf component
1.0 6.0
5.8
0.8 L/L 5.6
5.4
0.6
Okapi TF
0.5
Classic
IDF
1.0 5.2
Okapi
2.0
0.4
5.0
4.8
0.2
4.6
0.0 4.4
0 5 10 15 20 25 0 5 10 15 20 25 43
𝑡𝑡∈𝑞𝑞∩𝑑𝑑
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection (𝑘𝑘1 + 1) 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝑘𝑘1 , 𝑏𝑏: parameters 𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑘𝑘1 ( 1 − 𝑏𝑏 + 𝑏𝑏 � ) + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑
𝐿𝐿
𝑘𝑘1 = 2, 𝑏𝑏 = 0.75
𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 �𝑑𝑑 + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
Summary – Vector Space Ranking
Represent the query as a term-weighted (e.g., tf-idf) vector.
Represent each document as a term-weighted (e.g., tf-idf) vector.
Compute the cosine similarity score for the query vector and
each document vector.
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user.
45
14
46
14
48
49