0% found this document useful (0 votes)

12 views49 pages

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

shaimaafawzygoda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views49 pages

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

shaimaafawzygoda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

1

Reusable Test Collections

 Document Collection
 Topics (sample of information needs)
 Relevance judgments (qrels)

2
How can we get it?
 For web search, companies apply their own studies to assess the
performance of their search engine.
 Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries

 Academia (or lab settings):

● Someone goes out and builds them (expensive)
● As a byproduct of large scale evaluations (collaborative effort)
 IR Evaluation Campaigns are created for this reason
IR Evaluation Campaigns
 IR test collections are provided for scientific communities to develop
better IR methods.
 Collections and queries are provided, relevance judgements are built
during the campaign.
 TREC = Text REtrieval Conference http://trec.nist.gov/
● Main IR evaluation campaign, sponsored by NIST (US gov).
● Series of annual evaluations, started in 1992.
 Other evaluation campaigns
● CLEF: European version (since 2000)
● NTCIR: Asian version (since 1999)
● FIRE: Indian version (since 2008)
TREC Tracks and Tasks
 TREC (and other campaigns) are formed of a set of tracks, each track
is about (one or more) search tasks.
● Each track/task is about searching a set of documents of given genre and
domain.
 Examples
● TREC Web track
● TREC Medical track
● TREC Legal track  CLEF-IP track  NTCIR patent mining track
● TREC Microblog track
• Adhoc search task
• Filtering task
TREC Collection
A set of hundreds of thousands or millions of docs
● 1B in case of web search (TREC ClueWeb09)
 The typical format of a document:
<DOC>
<DOCNO> 1234 </DOCNO>
<TEXT>
This is the document.
Multilines of plain text.
</TEXT>
</DOC>
TREC Topic
 Topic: a statement of information need
 Multiple topics (~50) developed (mostly) at NIST for a collection.
 Developed by experts and associated with additional details.
● Title: the query text
● Description: description of what is meant by the query.
● Narrative: what should be considered relevant.
<num>189</num>
<title>Health and Computer Terminals</title>
<desc>Is it hazardous to the health of individuals to work with computer terminals on a
daily basis?</desc>
<narr>Relevant documents would contain any information that expands on any
physical disorder/problems that may be associated with the daily working with
computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been
said to be associated, but how widespread are these or other problems and what is
being done to alleviate any health problems</narr>
Relevance Judgments
 For each topic, set of relevant docs is required to be known for an
effective evaluation!
 Exhaustive assessment is usually impractical
● TREC usually has 50 topics
● Collection usually has >1 million documents
 Random sampling won’t work
● If relevant docs are rare, none may be found!
 IR systems can help focus the sample (Pooling)
● Each system finds some relevant documents
● Different systems find different relevant documents
● Together, enough systems will find most of them
8
Pooled Assessment Methodology
1. Systems submit top 1000 documents per topic
2. Top 100 documents from each are manually judged
• Single pool, duplicates removed, arbitrary order
• Judged by the person who developed the topic
3. Treat unevaluated documents as not relevant
4. Compute MAP (or others) down to 1000 documents

 To make pooling work:

● Good number of participating systems
● Systems must do reasonably well
● Systems must be different (not all “do the same thing”)
9
Example
In one of TREC tracks, 3 teams T1, T2, and T3
have participated and they were asked to
retrieve up to 15 documents per query. In
reality (with exhaustive judgments), a query Q
has 9 relevant documents in the collection:
A, B, C, D, E, F, G, H, and I.
The submitted ranked lists are as follows:
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T1 A M Y R K L B Z E N D C W
T2 Y A J R N Z M C G B X P D K W
T3 G B Y K E A Z L N C H K W X
We constructed the judging pools for Q using
only the top 5 documents of each of the
submitted ranked lists.
What is the Average Precision of each of the 3
teams? 10
11
۲۰۲۱ ‫ ﺻﯾف‬- ‫دورة "اﺳﺗرﺟﺎع اﻟﻣﻌﻠوﻣﺎت" ﺑﺎﻟﻠﻐﺔ اﻟﻌرﺑﯾﺔ‬
Information Retrieval – Summer 2021

4. Ranked Retrieval I
(Vector Space Model & BM25)
Tamer Elsayed
Qatar University
Today’s Roadmap
 Simple scoring
 TF-IDF ranking
 Vector Space Model
 BM25 ranking

13
Boolean Retrieval
 Thus far, our queries have all been Boolean.
● Documents either match or don’t.
 Good for expert users with precise understanding of their needs
and the collection.
 Not good for the majority of users.
● Most incapable of writing Boolean queries.
● Most don’t want to go through 1000s of results.
• This is particularly true of web search.

14
Ranked Retrieval
 Typical queries: free text queries
 Results are “ranked” with respect to a query
 Large result sets are not an issue
● We just show the top k ( ≈ 10) results
● We don’t overwhelm the user
How?
 Top ranked documents are the most likely to satisfy user’s query.
 Assign a score – say in [0, 1] – to each document.
 Score (d, q) measures how well doc d matches a query q.
15
Scoring Example: Jaccard coefficient
 Commonly-used measure of overlap of two sets A and B
● What are the 2 sets in our context?

𝐴𝐴∩𝐵𝐵
jaccard(A, B) =
𝐴𝐴∪𝐵𝐵
 jaccard(A,A) = 1 and jaccard(A, B) = 0 if A ∩ B = 0
 A and B don’t have to be of the same size.
 Always assigns a number between 0 and 1.

Is it a good scoring function?

16
Issues With Jaccard for Scoring
Term frequency?
 Doesn’t consider term frequency (how many times a term occurs
in a document)
Term importance?
 It treats all terms equally!
● How about rare terms in a collection? more informative than frequent
terms.
Length?
 Needs more sophisticated way of length normalization
17
12

18
12

 2/3
2/6
For the query "the arab world" and 
 2/7
the document "fifa world cup in arab country",  2/10
 3/6
what is Jaccard similarity (after removing stop words)?  3/7
 3/10

19
20
Example
 Collection of 5 documents (balls = terms)
 Query
 Which is the least relevant document?
 Which is the most relevant document?

D1 D2 D3 D4 D5
Term-Document Count Matrix
 Each document as a count (frequency) vector in ℕv
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Bag-of-Words Model
 Doesn’t consider the ordering of words in a document
 John is quicker than Mary
Same vectors!
 Mary is quicker than John 22
1. Frequent Terms in a Document
Term Frequency
 tft,d: the number of times that term t occurs in doc d.
 We want to use tf when computing query-document match
scores. But how?

 Raw term frequency?

● A document with 10 occurrences of the term is more relevant than a
document with 1 occurrence of the term.
● But not 10 times more relevant.
 Relevance does not increase linearly with tf.
23
Log-Frequency Weighting

24
Log-Frequency Weighting
 The log frequency weight of term t in d is
1 + log10tf t,d , if tf t,d > 0
wt,d = 
 0, otherwise
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 Score for a document-query pair: sum over terms t in both q and d:

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

𝑡𝑡∈𝑞𝑞∩𝑑𝑑
 The score is 0 if none of the query terms is present in the document.
25
2. Informative Terms in a Collection
 Rare terms are more informative than frequent terms
● Recall stop words
 We want a high weight for rare terms.
 Collection Frequency cft?
● number of occurrences of term t in the collection
 Document Frequency dft?
● the number of documents that contain t
● inverse measure of the informativeness of t
● dft ≤ N

26
Inverse Documents Frequency, idf
 idf (inverse document frequency) of t:
𝑁𝑁
𝑖𝑖𝑖𝑖𝑓𝑓𝑡𝑡 = 𝑙𝑙𝑙𝑙𝑙𝑙10 ( )
𝑑𝑑𝑑𝑑𝑡𝑡
● log (N/dft) instead of N/dft to “dampen” the effect of idf.
 Suppose N = 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
27
tf.idf Term Weighting
 The tf-idf weight of a term is the product of its tf weight and its
idf weight.
𝑤𝑤𝑡𝑡,𝑑𝑑 = 1 + 𝑙𝑙𝑙𝑙𝑙𝑙10 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 × 𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡

 One of the best-known weighting scheme in IR

● Increases with the number of occurrences within a document
● Increases with the rarity of the term in the collection

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

𝑡𝑡∈𝑞𝑞∩𝑑𝑑
28
Back to our example …
 Collection of 5 documents (balls = terms)
 Query the destructive storm
 Which is the least relevant document?
 Which is the most relevant document?

D1 D2 D3 D4 D5

4 5 3 2 1
13

30
13

 Term frequency
........ is a relation between
 Document frequency
a term and a collection  Collection frequency

 Yes
A rare term has high IDF.  No

A high tf-idf of a term  a document

indicates the importance of  the collection
a term in …  both
31
32
Today’s Roadmap
 Simple scoring
 TF-IDF ranking
 Vector Space Model
 BM25 ranking

33
Binary → Count → Weight Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by

a real-valued vector of tf-idf weights ∈ R|V|

34
Sec. 6.3

Documents as Vectors
 |V|-dimensional vector space
d
 Terms are axes of the space
4

 Documents are points or

vectors in this space
 Very high-dimensional: tens of millions of dimensions when you
apply this to a web search engine.
 These are very sparse vectors - most entries are zero.

Vector Space Model

35
Sec. 6.3

Queries as Vectors
 Key idea1: Do the same for queries: represent them as vectors in
the space.

 Key idea2: Rank documents according to their proximity to the

query in this space.

 proximity = similarity of vectors How?

36
Sec. 6.3

Euclidean Distance?
 Distance between the end points of the two vectors

 Large for vectors of different lengths.

 Thought experiment: take a document d and append it to itself.
Call this document d′.
● “Semantically” d and d′ have the same content
● Euclidean distance can be quite large 37
Sec. 6.3

Angle Instead of Distance

 The angle between the two documents is 0, corresponding to
maximal similarity.
 Key idea: Rank documents according to angle with query.
● Rank documents in increasing order of the angle with query
● Rank documents in decreasing order of cosine(query, document)
 Cosine is a monotonically decreasing function for the interval [0o,
180o]

How to compute?

38
Sec. 6.3

3. Length Normalization
 A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

𝑥𝑥⃗ 2 = � 𝑥𝑥𝑖𝑖2
𝑖𝑖

 Dividing a vector by its L2 norm makes it a unit (length) vector (on

surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization.
● Long and short documents now have comparable weights

39
Cosine “Similarity” (Query, Document)
 𝑞𝑞
⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the query
 𝑑𝑑⃗𝑖𝑖 is the [tf-idf] weight of term 𝑖𝑖 in the doc
 For normalized vectors:
⃗ ⃗ |𝑉𝑉|
⃗ 𝑑𝑑 = 𝑞𝑞⃗ � 𝑑𝑑 = ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
cos 𝑞𝑞,
 For non-normalized vectors:
|𝑉𝑉|
𝑞𝑞⃗ � 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ ∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 𝑑𝑑𝑖𝑖
⃗ 𝑑𝑑⃗ =
cos 𝑞𝑞, = � =
𝑞𝑞⃗ 𝑑𝑑⃗ 𝑞𝑞⃗ 𝑑𝑑⃗ |𝑉𝑉| 2 |𝑉𝑉| 2
∑𝑖𝑖=1 𝑞𝑞𝑖𝑖 ∑𝑖𝑖=1 𝑑𝑑𝑖𝑖
Computing Cosine Scores?

TAAT
Query Processing

41
Sec. 6.4

Variants of tf-idf Weighting

 Many search engines allow for different weightings for queries vs.
documents.
 SMART Notation: use notation ddd.qqq, using the acronyms from the table
 A very standard weighting scheme is: lnc.ltc

42
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 � + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
tf component idf component
1.0 6.0

5.8
0.8 L/L 5.6

5.4
0.6
Okapi TF

0.5
Classic

IDF
1.0 5.2
Okapi
2.0
0.4
5.0

4.8
0.2
4.6

0.0 4.4
0 5 10 15 20 25 0 5 10 15 20 25 43

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

Raw TF Raw DF

𝑡𝑡∈𝑞𝑞∩𝑑𝑑
Okapi BM25 Ranking Function
𝐿𝐿𝑑𝑑 : Length of 𝑑𝑑
𝐿𝐿� : average doc length
in collection (𝑘𝑘1 + 1) 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝑘𝑘1 , 𝑏𝑏: parameters 𝐿𝐿𝑑𝑑 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑘𝑘1 ( 1 − 𝑏𝑏 + 𝑏𝑏 � ) + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑
𝐿𝐿

𝑘𝑘1 = 2, 𝑏𝑏 = 0.75
𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 𝑁𝑁 − 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
𝑤𝑤𝑡𝑡,𝑑𝑑 = ∗ log
𝐿𝐿 𝑑𝑑𝑑𝑑𝑡𝑡 + 0.5
1.5 �𝑑𝑑 + 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 + 0.5
𝐿𝐿
Summary – Vector Space Ranking
 Represent the query as a term-weighted (e.g., tf-idf) vector.
 Represent each document as a term-weighted (e.g., tf-idf) vector.
 Compute the cosine similarity score for the query vector and
each document vector.
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user.

45
14

46
14

In Vector Space Model, .......  documents only

 queries only
are represented as vectors.  both documents and queries

In Vector Space Model, the  the vocabulary size

 the collection size
dimensionality of the space  the maximum length of documents
is ..... in the collection

Cosine similarity can be

used to compute similarity  a query and a document
 a query and another query
between .... (you can  a document and another document
choose multiple) 47
Can we “model the language”
to rank search results?

48
49

DSM 5 Chart
93% (30)
DSM 5 Chart
2 pages
UGEO - HM70A - Operation Manual (Vol1)
100% (1)
UGEO - HM70A - Operation Manual (Vol1)
232 pages
215 PDF
No ratings yet
215 PDF
7 pages
Lec 4
No ratings yet
Lec 4
39 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
TF Idf
100% (3)
TF Idf
38 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
1 Overview
No ratings yet
1 Overview
44 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
I R Rank
No ratings yet
I R Rank
52 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
46 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
L03
No ratings yet
L03
16 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
Bulu
No ratings yet
Bulu
47 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
IR - 2 Unit
No ratings yet
IR - 2 Unit
46 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Blockchain Foundation Courseware - English
From Everand
Blockchain Foundation Courseware - English
Eppo Luppes
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Basic Question Bank With Answers and Explanations
No ratings yet
Basic Question Bank With Answers and Explanations
275 pages
HAI Knowledge Questionnaire
No ratings yet
HAI Knowledge Questionnaire
3 pages
Electrical & Electronics Engineering: Velagapudi Ramakrishna Siddhartha Engineering College
No ratings yet
Electrical & Electronics Engineering: Velagapudi Ramakrishna Siddhartha Engineering College
94 pages
Introduction To Computer Graphics
No ratings yet
Introduction To Computer Graphics
2 pages
Thomasyl CV
No ratings yet
Thomasyl CV
7 pages
NRK CPSB 2 Revised 2021 Employment
No ratings yet
NRK CPSB 2 Revised 2021 Employment
4 pages
Agri Surfactants Handbook - V14 - 280225 - ENGLISH
No ratings yet
Agri Surfactants Handbook - V14 - 280225 - ENGLISH
35 pages
BITSAT Preference Sheet 2021
No ratings yet
BITSAT Preference Sheet 2021
4 pages
Tax Quizzer
No ratings yet
Tax Quizzer
3 pages
Euphoria User Manual
0% (1)
Euphoria User Manual
795 pages
Grade 10 Science Support Material Book Delhi
No ratings yet
Grade 10 Science Support Material Book Delhi
150 pages
CRC
No ratings yet
CRC
35 pages
Lecture No. 08. Manual Techniques at Shoulder - A
No ratings yet
Lecture No. 08. Manual Techniques at Shoulder - A
29 pages
Randomized Controlled Trials
100% (1)
Randomized Controlled Trials
9 pages
MVH3K Datasheet ENG PDF
No ratings yet
MVH3K Datasheet ENG PDF
3 pages
PK Flipped Lectures Answers
No ratings yet
PK Flipped Lectures Answers
12 pages
Year 10 Balancing Equations - Level 2 Year 10 Balancing Equations - Level 2
No ratings yet
Year 10 Balancing Equations - Level 2 Year 10 Balancing Equations - Level 2
2 pages
Government College of Engineering and Technology Jammu
No ratings yet
Government College of Engineering and Technology Jammu
20 pages
Campbell - Introduction To Geomagnetic Fields
No ratings yet
Campbell - Introduction To Geomagnetic Fields
26 pages
The Radius and Interval of Convergence PDF
100% (1)
The Radius and Interval of Convergence PDF
4 pages
Buckling of Orthotropic, Curved, Sandwich Panels Subjected To Edge Shear Loads
No ratings yet
Buckling of Orthotropic, Curved, Sandwich Panels Subjected To Edge Shear Loads
4 pages
Int J Mental Health Nurs - 2003 - Happell - Burnout and Job Satisfaction A Comparative Study of Psychiatric Nurses From
No ratings yet
Int J Mental Health Nurs - 2003 - Happell - Burnout and Job Satisfaction A Comparative Study of Psychiatric Nurses From
9 pages
Industrial Series HDX Models
No ratings yet
Industrial Series HDX Models
3 pages
Shutdown Isolation Procedures
No ratings yet
Shutdown Isolation Procedures
3 pages
The Drug That Obliterates 97% of Delhi Covid Cases Is IVERMECTIN
100% (1)
The Drug That Obliterates 97% of Delhi Covid Cases Is IVERMECTIN
10 pages
Part 1 Icao by Diogo
No ratings yet
Part 1 Icao by Diogo
8 pages
14.07.24 - SR - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification
No ratings yet
14.07.24 - SR - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

4 IRinArabic2021 Ranked Retrieval I

Uploaded by

1

Reusable Test Collections

 Academia (or lab settings):

 To make pooling work:

Is it a good scoring function?

 Raw term frequency?

 Score for a document-query pair: sum over terms t in both q and d:

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

 One of the best-known weighting scheme in IR

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

A high tf-idf of a term  a document

Antony 5.25 3.18 0 0 0 0.35

Each document is now represented by

 Documents are points or

Vector Space Model

 Key idea2: Rank documents according to their proximity to the

 proximity = similarity of vectors How?

 Large for vectors of different lengths.

Angle Instead of Distance

 Dividing a vector by its L2 norm makes it a unit (length) vector (on

Variants of tf-idf Weighting

𝒔𝒔𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄(𝒒𝒒, 𝒅𝒅) = � 𝑤𝑤𝑡𝑡,𝑑𝑑

In Vector Space Model, .......  documents only

In Vector Space Model, the  the vocabulary size

Cosine similarity can be

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.