0% found this document useful (0 votes)
12 views49 pages

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

shaimaafawzygoda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views49 pages

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

shaimaafawzygoda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

1

Reusable Test Collections


 Document Collection
 Topics (sample of information needs)
 Relevance judgments (qrels)

2
How can we get it?
 For web search, companies apply their own studies to assess the
performance of their search engine.
 Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries

 Academia (or lab settings):


● Someone goes out and builds them (expensive)
● As a byproduct of large scale evaluations (collaborative effort)
 IR Evaluation Campaigns are created for this reason
IR Evaluation Campaigns
 IR test collections are provided for scientific communities to develop
better IR methods.
 Collections and queries are provided, relevance judgements are built
during the campaign.
 TREC = Text REtrieval Conference http://trec.nist.gov/
● Main IR evaluation campaign, sponsored by NIST (US gov).
● Series of annual evaluations, started in 1992.
 Other evaluation campaigns
● CLEF: European version (since 2000)
● NTCIR: Asian version (since 1999)
● FIRE: Indian version (since 2008)
TREC Tracks and Tasks
 TREC (and other campaigns) are formed of a set of tracks, each track
is about (one or more) search tasks.
● Each track/task is about searching a set of documents of given genre and
domain.
 Examples
● TREC Web track
● TREC Medical track
● TREC Legal track  CLEF-IP track  NTCIR patent mining track
● TREC Microblog track
• Adhoc search task
• Filtering task
TREC Collection
A set of hundreds of thousands or millions of docs
● 1B in case of web search (TREC ClueWeb09)
 The typical format of a document:
<DOC>
<DOCNO> 1234 </DOCNO>
<TEXT>
This is the document.
Multilines of plain text.
</TEXT>
</DOC>
TREC Topic
 Topic: a statement of information need
 Multiple topics (~50) developed (mostly) at NIST for a collection.
 Developed by experts and associated with additional details.
● Title: the query text
● Description: description of what is meant by the query.
● Narrative: what should be considered relevant.
<num>189</num>
<title>Health and Computer Terminals</title>
<desc>Is it hazardous to the health of individuals to work with computer terminals on a
daily basis?</desc>
<narr>Relevant documents would contain any information that expands on any
physical disorder/problems that may be associated with the daily working with
computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been
said to be associated, but how widespread are these or other problems and what is
being done to alleviate any health problems</narr>
Relevance Judgments
 For each topic, set of relevant docs is required to be known for an
effective evaluation!
 Exhaustive assessment is usually impractical
● TREC usually has 50 topics
● Collection usually has >1 million documents
 Random sampling won’t work
● If relevant docs are rare, none may be found!
 IR systems can help focus the sample (Pooling)
● Each system finds some relevant documents
● Different systems find different relevant documents
● Together, enough systems will find most of them
8
Pooled Assessment Methodology
1. Systems submit top 1000 documents per topic
2. Top 100 documents from each are manually judged
• Single pool, duplicates removed, arbitrary order
• Judged by the person who developed the topic
3. Treat unevaluated documents as not relevant
4. Compute MAP (or others) down to 1000 documents

 To make pooling work:


● Good number of participating systems
● Systems must do reasonably well
● Systems must be different (not all “do the same thing”)
9
Example
In one of TREC tracks, 3 teams T1, T2, and T3
have participated and they were asked to
retrieve up to 15 documents per query. In
reality (with exhaustive judgments), a query Q
has 9 relevant documents in the collection:
A, B, C, D, E, F, G, H, and I.
The submitted ranked lists are as follows:
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T1 A M Y R K L B Z E N D C W
T2 Y A J R N Z M C G B X P D K W
T3 G B Y K E A Z L N C H K W X
We constructed the judging pools for Q using
only the top 5 documents of each of the
submitted ranked lists.
What is the Average Precision of each of the 3
teams? 10
11
۲۰۲۱ ‫ ﺻﯾف‬- ‫دورة "اﺳﺗرﺟﺎع اﻟﻣﻌﻠوﻣﺎت" ﺑﺎﻟﻠﻐﺔ اﻟﻌرﺑﯾﺔ‬
Information Retrieval – Summer 2021

4. Ranked Retrieval I
(Vector Space Model & BM25)
Tamer Elsayed
Qatar University
Today’s Roadmap
 Simple scoring
 TF-IDF ranking
 Vector Space Model
 BM25 ranking

13
Boolean Retrieval
 Thus far, our queries have all been Boolean.
● Documents either match or don’t.
 Good for expert users with precise understanding of their needs
and the collection.
 Not good for the majority of users.
● Most incapable of writing Boolean queries.
● Most don’t want to go through 1000s of results.
• This is particularly true of web search.

14
Ranked Retrieval
 Typical queries: free text queries
 Results are “ranked” with respect to a query
 Large result sets are not an issue
● We just show the top k ( ≈ 10) results
● We don’t overwhelm the user
How?
 Top ranked documents are the most likely to satisfy user’s query.
 Assign a score – say in [0, 1] – to each document.
 Score (d, q) measures how well doc d matches a query q.
15
Scoring Example: Jaccard coefficient
 Commonly-used measure of overlap of two sets A and B
● What are the 2 sets in our context?

𝐴𝐴∩𝐵𝐵
jaccard(A, B) =
𝐴𝐴∪𝐵𝐵
 jaccard(A,A) = 1 and jaccard(A, B) = 0 if A ∩ B = 0
 A and B don’t have to be of the same size.
 Always assigns a number between 0 and 1.

Is it a good scoring function?


16
Issues With Jaccard for Scoring
Term frequency?
 Doesn’t consider term frequency (how many times a term occurs
in a document)
Term importance?
 It treats all terms equally!
● How about rare terms in a collection? more informative than frequent
terms.
Length?
 Needs more sophisticated way of length normalization
17
12

18
12

 2/3
2/6
For the query "the arab world" and 
 2/7
the document "fifa world cup in arab country",  2/10
 3/6
what is Jaccard similarity (after removing stop words)?  3/7
 3/10

19
20
Example
 Collection of 5 documents (balls = terms)
 Query
 Which is the least relevant document?
 Which is the most relevant document?

D1 D2 D3 D4 D5
Term-Document Count Matrix
 Each document as a count (frequency) vector in ℕv
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Bag-of-Words Model
 Doesn’t consider the ordering of words in a document
 John is quicker than Mary
Same vectors!
 Mary is quicker than John 22
1. Frequent Terms in a Document
Term Frequency
 tft,d: the number of times that term t occurs in doc d.
 We want to use tf when computing query-document match
scores. But how?

 Raw term frequency?


● A document with 10 occurrences of the term is more relevant than a
document with 1 occurrence of the term.
● But not 10 times more relevant.
 Relevance does not increase linearly with tf.
23
Log-Frequency Weighting

24
Log-Frequency Weighting
 The log frequency weight of term t in d is
1 + log10tf t,d , if tf t,d > 0
wt,d = 
 0, otherwise
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 Score for a document-query pair: sum over terms t in both q and d:

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑


𝑡𝑡∈𝑞𝑞∩𝑑𝑑
 The score is 0 if none of the query terms is present in the document.
25
2. Informative Terms in a Collection
 Rare terms are more informative than frequent terms
● Recall stop words
 We want a high weight for rare terms.
 Collection Frequency cft?
● number of occurrences of term t in the collection
 Document Frequency dft?
● the number of documents that contain t
● inverse measure of the informativeness of t
● dft ≤ N

26
Inverse Documents Frequency, idf
 idf (inverse document frequency) of t:
𝑁𝑁
𝑖𝑖𝑖𝑖𝑓𝑓𝑡𝑡 = 𝑙𝑙𝑙𝑙𝑙𝑙10 ( )
𝑑𝑑𝑑𝑑𝑡𝑡
● log (N/dft) instead of N/dft to “dampen” the effect of idf.
 Suppose N = 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
27
tf.idf Term Weighting
 The tf-idf weight of a term is the product of its tf weight and its
idf weight.
𝑤𝑤𝑡𝑡,𝑑𝑑 = 1 + 𝑙𝑙𝑙𝑙𝑙𝑙10 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 × 𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡

 One of the best-known weighting scheme in IR


● Increases with the number of occurrences within a document
● Increases with the rarity of the term in the collection

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑


𝑡𝑡∈𝑞𝑞∩𝑑𝑑
28
Back to our example …
 Collection of 5 documents (balls = terms)
 Query the destructive storm
 Which is the least relevant document?
 Which is the most relevant document?

D1 D2 D3 D4 D5

4 5 3 2 1
13

30
13

 Term frequency
........ is a relation between
 Document frequency
a term and a collection  Collection frequency

 Yes
A rare term has high IDF.  No

A high tf-idf of a term  a document


indicates the importance of  the collection
a term in …  both
31
32
Today’s Roadmap
 Simple scoring
 TF-IDF ranking
 Vector Space Model
 BM25 ranking

33
Binary → Count → Weight Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by


a real-valued vector of tf-idf weights ∈ R|V|

34
Sec. 6.3

Documents as Vectors
 |V|-dimensional vector space
d
 Terms are axes of the space
4

 Documents are points or


vectors in this space
 Very high-dimensional: tens of millions of dimensions when you
apply this to a web search engine.
 These are very sparse vectors - most entries are zero.

Vector Space Model


35
Sec. 6.3

Queries as Vectors
 Key idea1: Do the same for queries: represent them as vectors in
the space.

 Key idea2: Rank documents according to their proximity to the


query in this space.

 proximity = similarity of vectors How?

36
Sec. 6.3

Euclidean Distance?
 Distance between the end points of the two vectors

 Large for vectors of different lengths.


 Thought experiment: take a document d and append it to itself.
Call this document d′.
● “Semantically” d and d′ have the same content
● Euclidean distance can be quite large 37
Sec. 6.3

Angle Instead of Distance


 The angle between the two documents is 0, corresponding to
maximal similarity.
 Key idea: Rank documents according to angle with query.
● Rank documents in increasing order of the angle with query
● Rank documents in decreasing order of cosine(query, document)
 Cosine is a monotonically decreasing function for the interval [0o,
180o]

How to compute?

38
Sec. 6.3

3. Length Normalization
 A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

𝑥𝑥⃗ 2 = � 𝑥𝑥𝑖𝑖2
𝑖𝑖

 Dividing a vector by its L2 norm makes it a unit (length) vector (on


surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization.
● Long and short documents now have comparable weights

39
Cosine “Similarity” (Query, Document)
 𝑞𝑞
⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the query
 𝑑𝑑⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the doc
 For normalized vectors:
⃗ ⃗ |𝑉𝑉|
⃗ 𝑑𝑑 = 𝑞𝑞⃗ � 𝑑𝑑 = ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
cos 𝑞𝑞,
 For non-normalized vectors:
|𝑉𝑉|
𝑞𝑞⃗ � 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
⃗ 𝑑𝑑⃗ =
cos 𝑞𝑞, = � =
𝑞𝑞⃗ 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ |𝑉𝑉| 2 |𝑉𝑉| 2
∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 ∑𝑖𝑖=1 𝑑𝑑𝑖𝑖
Computing Cosine Scores?

TAAT
Query Processing

41
Sec. 6.4

Variants of tf-idf Weighting

 Many search engines allow for different weightings for queries vs.
documents.
 SMART Notation: use notation ddd.qqq, using the acronyms from the table
 A very standard weighting scheme is: lnc.ltc

42
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 � + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
tf component idf component
1.0 6.0

5.8
0.8 L/L 5.6

5.4
0.6
Okapi TF

0.5
Classic

IDF
1.0 5.2
Okapi
2.0
0.4
5.0

4.8
0.2
4.6

0.0 4.4
0 5 10 15 20 25 0 5 10 15 20 25 43

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑


Raw TF Raw DF

𝑡𝑡∈𝑞𝑞∩𝑑𝑑
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection (𝑘𝑘1 + 1) 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝑘𝑘1 , 𝑏𝑏: parameters 𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑘𝑘1 ( 1 − 𝑏𝑏 + 𝑏𝑏 � ) + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑
𝐿𝐿

𝑘𝑘1 = 2, 𝑏𝑏 = 0.75
𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 �𝑑𝑑 + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
Summary – Vector Space Ranking
 Represent the query as a term-weighted (e.g., tf-idf) vector.
 Represent each document as a term-weighted (e.g., tf-idf) vector.
 Compute the cosine similarity score for the query vector and
each document vector.
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user.

45
14

46
14

In Vector Space Model, .......  documents only


 queries only
are represented as vectors.  both documents and queries

In Vector Space Model, the  the vocabulary size


 the collection size
dimensionality of the space  the maximum length of documents
is ..... in the collection

Cosine similarity can be


used to compute similarity  a query and a document
 a query and another query
between .... (you can  a document and another document
choose multiple) 47
Can we “model the language”
to rank search results?

48
49

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy