0% found this document useful (0 votes)

26 views45 pages

IR Chapter 2 Part II

The document discusses the representation of documents and queries as vectors in a term-document matrix, emphasizing the importance of term weighting for relevance in information retrieval. It outlines various weighting schemes, including binary weights, term frequency (TF), inverse document frequency (IDF), and their combination (TF*IDF), highlighting their roles in determining the significance of terms within documents. The document also addresses the limitations of these methods and suggests solutions to improve document ranking based on relevance to queries.

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

IR Chapter 2 Part II

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

1

Terms
 Terms are usually stems. Terms can be also phrases, such
as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
 Each vector holds a place for every term in the collection.
 Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.

Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn W=0 if a term is absent

2
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

Documents are represented by binary weights or Non-

binary weighted vectors of 3terms.
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
included in the vector D2 1 0 0
D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a D6 1 1 0
document equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when D9 0 0 1
frequency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1 if freqij  0

freqij  
0 if freqij  0

Why use term weighting?
 Binary weights are too limiting.
 Terms are either present or absent.
 Not allow to order documents according to their level of
relevance for a given query.
 Non-binary weights allow to model partial matching .
 A technique that use statistical weighting schemes
 Partial matching allows retrieval of docs that approximate
the query.
• Term-weighting improves quality of answer set.
 Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
5
 Term weighting :
 This is done by collecting numerical values to each
of the index terms in a query or a document
reflecting their relative importance.
 A term with a high weight is assumed to be very
relevant to the document or query.
 A term with low weight indicates little relevance
to the content of the document or query.
 This makes it possible to retrieve documents in the
decreasing order of query-document similarity, the
most similar (presumably relevant) being retrieved
first.
 These weights can then be used to define a function which
measures the similarity or closeness between query and
documents. 6
 Some of the term weighting schemes (or functions or
methods) suggested are

 The term frequency (tf) weights

 An inverse document frequency (IDF or collection

frequency) weights

 The composite weight (tf*idf)

7
Term frequency (TF) weights
 The frequency of occurrence of a term is a useful indication
of its relative importance in describing a document.
 In other words, term importance is related to frequency of
occurrence.
 If term A is mentioned more than term B, then the
document is more about A than about B (assuming A
and B to be content bearing terms).
 Such measure assumes that the value, or weight, of a term
assigned to a document is simply proportional to the
term frequency (i.e., the frequency of occurrence of that
particular term in that particular document).
 The more frequently a term occurs in a document the
more likely it is to be of value in describing the content of
the document.
8
 TF (term frequency) - Count the
number of times term occurs in
docs t1 t2 t3
a document.
D1 2 0 3
f ij = frequency of term i in D2 1 0 0
document j D3 0 4 7
D4 3 0 0
D5 1 6 3
 The more times a term t occurs D6 3 5 0
in document d the more likely it D7 0 8 0
is that t is relevant to the D8 0 10 0
D9 0 0 1
document, i.e. more indicative
D10 0 3 5
of the topic.. D11 4 0 1
 Accordingly, the weight of term j in document i,
denoted by wij, might be determined by

wij  FREQij
where, FREQij is the frequency of term j in
document i
 It is a simple count of the number of occurrences of a
term in a particular document (or query).
 Is a measure of term density in a document.
 Experiments have shown that this is better than
Boolean.
 Having all weaknesses this method shows better
results than that of Boolean.
10
Problems with Term frequency (TF)
 Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency
words are equally distributed throughout the collection.
 Since it does not take into account the role of term j in
any document other than document i.
 This simple measure is not normalized to account for
variances in the length of documents (i.e. Long documents
have an unfair advantage.)
 A one-page document with 10 mentions of A is “more about
A” than a 100 page document with 20 mentions of A.
 Used alone, favors common words, long documents (Why
favor common words and long documents? )

11
Solutions to the Problems with Term frequency (TF)
 Two solutions
1. Divide each frequency count by the length of the
document (length Normalization).
 In this case the normalized frequency tfij is used
instead of FREQij.
2. Divide each frequency count by the maximum
frequency count of any item in the document.
 The normalized tf is given by
FREQij
tfij 
 Where, max l ( FREQlj )
 tfij is the normalized frequency of term j in document i
 maxl is the maximum frequency of any term in document dl12
Document Frequency
 It is defined to be the number of documents in the
collection that contain a term

DF = document frequency

 Count the frequency considering the whole collection

of documents.
 Less frequently a term appears in the whole collection,
the more discriminating it is.
df i = document frequency of term i
= number of documents containing term i
• The example shows that collection
frequency and document
frequency behaves differently. 13
Inverse Document Frequency (IDF) weights
 The concept is introduced by Spark Jones.
 Assuming that term k occurs in at least one document (dk ≠ 0) a
possible measure of the inverse document frequency is defined
by N
(d )
wk  log 2
k

 Where,
 N is the total number of documents in the collection.
 dk the number of documents in which term k occurs
 Wk the weight assigned to term k (i.e inverse document
frequency of term i,)
 That is, the weight of a term in a document is the logarithm of
the number of documents in the collection divided by the
number of documents in the collection that contain the term
(with 2 as the base of the logarithm).
14
 IDF measures rarity of the term in collection. The IDF is
a measure of the general importance of the term
 Inverts the document frequency.
 It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of
terms that occur rarely.
 Gives full weight to terms that occur in one
document only.
 Gives lowest weight to terms that occur in all
documents.
 Terms that appear in many different documents are
less indicative of overall topic.
 The more a term t occurs throughout all documents, the
more poorly that term t discriminates between documents.
 If a term occurs in many of the documents in the
collection, then it does not serve well as a document
identifier and should be given low weight as a
potential index term
15
 As the collection frequency of a term decreases its
weight increases.
 The total number of frequency of a term across the entire
document collection.
 Emphasis is on terms exhibiting the lowest
document frequency.
 Term importance is
 Inversely proportional to the total number of
documents to which the each term is assigned
 Associated towards terms appearing in less
number of documents or items

16
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?

Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966

• IDF provides high values for rare words and low values for
common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency? 17
Problems with IDF weights
 Identifies that a term that appears in many documents
is not very useful for distinguishing relevant
documents from non-relevant ones.
 Because, this function does not take into account the
frequency of a term in a given document (i.e., FREQij)
 That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author uses important
terms now and then.

18
Solution to the problem of IDF weights
 Weights should combine two measurements.
 Weights should be in direct proportion to the
frequency of the term in a document.=TF
 tf i,j = FREQi,j / maxk{FREQk,j}
 This quantifies how well a term describes the
document (or the content).
 Weights should be in inverse proportion to the
number of documents in the collection in which the
term appears. =IDF (N)
wk  log 2 dk

 This quantifies the ability of the term to

separate documents.
 Altogether: wi,k = tf i,k · idf k
19
TF*IDF Weighting
 Combines term frequency (TF) and inverse document
frequency (IDF).
 The most used term-weighting is tf*idf weighting scheme:
wij = tf ij idf i = tf ij * log2 (N/ df i)
 According to this function
 Weight of term j in a given document i would increase as the frequency
of the term in the document (FREQij ) increases but decreases as the
document frequency dj increases.
 A term occurring frequently in the document but rarely in the
rest of the collection is given high weight.
 The tf*idf value for a term will always be greater than or equal to
zero.
 Experimentally, tf*idf has been found to work well.
 It is often used in the vector space model together with cosine
similarity to determine the similarity between two documents.
20
 A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
 A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same
time that the importance of the term in the remainder of the
collection is relatively small so that the term can actually
distinguish the documents to which it is assigned from the
remainder of the collection
 Thus, such a term can be considered as being of potentially
greater importance for retrieval purposes
 This scheme assigns a weight to each term (vocabulary
word) in a given document.

21
TF*IDF weighting
 When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents.
 Highest tf*idf for a term shows a term has a high
term frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
 The weights hence tend to filter out common terms.
 Thus, lending high discriminating power to those
documents.
 Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents.
 Thus, offering a less pronounced relevance signal.
 Lowest TF*IDF is registered when the term occurs in
virtually all documents.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies (DF)
of three terms are: A(50), B(1300), C(250). And also term
frequencies (TF) of these terms are: A(3), B(2), C(1) with
a maximum term frequency of 3. Compute TF*IDF for
each term?
A: tf = 3/3=1.0 idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf *idf = 1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf = 1.774
 Query vector is typically treated as a document and also tf*idf
weighted.

23
More Example
 Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.

 The term frequency (TF) for cow :

3/100 = 0.03

 The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

 The TFIDF score is the product of these frequencies: 0.03

13.228 = 0.39684

24
Exercise Word C TW TD DF TF IDF TF*IDF
• Let C = number of times
a given word appears in a airplane 5 46 3 1
document; blue 1 46 3 1
• TW = total number of
words in a document; chair 7 46 3 3
• TD = total number of computer 3 46 3 1
documents in a corpus,
and forest 2 46 3 1
• DF = total number of justice 7 46 3 3
documents containing a
given word; love 2 46 3 1
• compute TF, IDF and might 2 46 3 1
TF*IDF score for each
perl 5 46 3 2
term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
25
Concluding remarks
 Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
 A simple way to start out is by eliminating documents that do not
contain all three words "the," "brown," and "cow," but this still
leaves many documents.
 To further distinguish them, we might count the number of times
each term occurs in each document and sum them all together;
 The number of times a term occurs in a document is called its TF.
However, because the term "the" is so common, this will tend to
incorrectly emphasize documents which happen to use the word "the"
more, without giving enough weight to the more meaningful terms
"brown" and "cow".
 Also the term "the" is not a good keyword to distinguish relevant and
non-relevant documents and terms like "brown" and "cow" that occur
rarely are good keywords to distinguish relevant documents from the
non-relevant once.

26
Concluding remarks
 Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the
collection and increases the weight of terms that
occur rarely.
 This leads to use TF*IDF as a better weighting
technique
 On top of that we apply similarity measures to
calculate the distance between document i and
query j.
Similarity
Measure
28
Similarity Measure
 We now have vectors for all documents in
the collection and a vector for the query,
t3
 how do we compute similarity?
 A similarity measure is a function that 1
computes the degree of similarity or
distance between document vector and D1
Q
query vector.
2 t1
 Using a similarity measure between the
query and each document:
t2 D2
 It is possible to rank the retrieved
documents in the order of presumed
relevance.
 It is possible to enforce a certain threshold
so that the size of the retrieved set can be
controlled.
29
Intuition

t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together” in the vector

space talk about the same things and more similar than
others.
Similarity Measure
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
 Sometimes it is a good idea to determine the maximum
possible similarity as the “distance” between a document
d and itself.
 A similarity measure attempts to compute the distance
between document vector wj and query wq vector.
 The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query than
documents whose vectors are away from the query vector.

31
Similarity Measure: Techniques
• There are a number of similarity measures; the most common
similarity measures are
 Euclidean distance , Inner or Dot product, Cosine
similarity, etc.
 Euclidean distance
 It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of a
pair of document and query terms.
 Dot product
 The dot product is also known as the scalar product or inner
product.
 The dot product is defined as the product of the magnitudes of
query and document vectors.
 Cosine similarity (or normalized inner product)
 It projects document and query vectors into a term space and
calculate the cosine angle between these. 32
Euclidean distance
 Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj, q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the

weight of term i in the query
• Example: Determine the Euclidean distance between the
document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0).
0 means corresponding term not found in document or
query.

 (0  2)  (3  7)  (2 1)  (1 0)  (10  0)  11.05

2 2 2 2 2

33
Inner Product
 Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n
sim(dj, q) = dj• q = 
w ·wij
i 1
iq

where wij is the weight of term i in document j and wiq is the

weight of term i in the query q.
 For binary vectors, the inner product is the number of
matched query terms in the document (size of
intersection).
 For weighted term vectors, it is the sum of the
products of the weights of the matched terms.

34
Properties of Inner Product
 Favors long documents with a large number of unique
terms.
 Again, the issue of normalization.
 Measures how many terms matched but not how many
terms are not matched.

35
Inner Product -- Examples
 Binary weight :
 Size of vector = size of vocabulary = 7
sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

• Term Weighted: Retrieval Database Architecture

D1 2 3 5
D2 3 7 1
Q 1 0 2

Sim (D1 , Q) = 21 + 30 + 5*2 = 12

Sim (D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Inner Product: Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q  dj
d1 1 0 1 2 k3
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
37
Inner Product: Exercise
k2
k1 d2
d6 d7
d4 d5
d3
d1
k1 k2 k3 q  dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3
38
Cosine similarity
 Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 
 w w
n
dj q
sim(d j , q)     i 1 i, j i ,q

 w 
n n
dj q 2
w 2
i 1 i, j i 1 i ,q
 Or;
 
 w w
n
d j  dk
sim(d j , d k )     i 1 i, j i ,k

 w 
n n
d j dk 2
w 2
i 1 i, j i 1 i ,k

 The denominator involves the lengths of the vectors


Length d j  
n 2
i 1
w
i, j
 So, the cosine measure is also known as the normalized
inner product.
Example: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity
using cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim(Q, D2 ) 
[(0.4) 2  (0.8) 2 ] *[(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q =
(0.4, 0.8), determine which document is the most
relevant one for the query?

cos 1  0.74
1.0 Q
D2
0.8

cos  2  0.98 0.6 2

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

41
Example
 Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are
more similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

42
Cosine Similarity vs. Inner Product
 Cosine similarity measures the cosine of the angle between two
vectors.
 Inner product normalized by the vector lengths.
 Eg. D1(2,3,5) D2(3,7,1) and Q(0,0,2)
  t

dj q
 (wij  wiq )
Cosin(dj, q) =    i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1
 
InnerProduct(dj, q) =dj q 

D1 = 2T1 + 3T2 + 5T3 Cosin(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D2 = 3T1 + 7T2 + 1T3 Cosin(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times

better using inner product in terms of closeness to query Q.
43
Exercises
 A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season. A document repeats
holiday 7 times and season 5 times. It is known that
holiday is repeated more than any other term in the
document. Calculate the weight of both terms in this
document using :
1. Normalized and un-normalized TF;
2. TF*IDF based on normalized and un-normalized TF
44
45

XCMG Catalogue 2017
67% (6)
XCMG Catalogue 2017
14 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
TF Idf
No ratings yet
TF Idf
4 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Text Representation
No ratings yet
Text Representation
16 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Types of Bus Bar System
No ratings yet
Types of Bus Bar System
7 pages
TF Idf
100% (3)
TF Idf
38 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
The Vector Space Model in Information Re
No ratings yet
The Vector Space Model in Information Re
9 pages
Understanding Inverse Document Frequency On Theoretical Arguments For IDF
No ratings yet
Understanding Inverse Document Frequency On Theoretical Arguments For IDF
19 pages
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
No ratings yet
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
15 pages
Tf-Idf Weighting
No ratings yet
Tf-Idf Weighting
7 pages
Automatic Indexing
100% (1)
Automatic Indexing
15 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
InverseDocumentFrequency
No ratings yet
InverseDocumentFrequency
6 pages
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
No ratings yet
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
4 pages
New Misc Mod
No ratings yet
New Misc Mod
36 pages
Cambridge International AS & A Level: HISTORY 9489/41
No ratings yet
Cambridge International AS & A Level: HISTORY 9489/41
4 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
(Jaffar) IR - Modeling - II
No ratings yet
(Jaffar) IR - Modeling - II
39 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Surface Evolver Manual
No ratings yet
Surface Evolver Manual
291 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Vmodel
No ratings yet
Vmodel
10 pages
Lec 4
No ratings yet
Lec 4
39 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
115 Ir 8
No ratings yet
115 Ir 8
8 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
TF Idf
No ratings yet
TF Idf
3 pages
Term Frequency
No ratings yet
Term Frequency
3 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Try Free Fortinet NSE 6 - FortiMail 6.2 NSE6-FML - 6.2 Real Dumps PDF
No ratings yet
Try Free Fortinet NSE 6 - FortiMail 6.2 NSE6-FML - 6.2 Real Dumps PDF
11 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
No ratings yet
Dayananda Sagar College of Engineering: M.TECH: Digital Electronics and Communication
4 pages
Eric Garland V SCHROEDER Et Al 2023-2024 Custody & DENIAL - MTD Case
No ratings yet
Eric Garland V SCHROEDER Et Al 2023-2024 Custody & DENIAL - MTD Case
7 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
08Pr067C Electrical Safety: Safety Management System Procedure
No ratings yet
08Pr067C Electrical Safety: Safety Management System Procedure
8 pages
01 Road Roller Basic Knowledge (6611E)
0% (1)
01 Road Roller Basic Knowledge (6611E)
16 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
R&S ESW User Manual en 01
No ratings yet
R&S ESW User Manual en 01
828 pages
Chapter Four
No ratings yet
Chapter Four
49 pages
F 305 Final Bill Checklist
No ratings yet
F 305 Final Bill Checklist
2 pages
Fee Structure 2024 25 MBBS
No ratings yet
Fee Structure 2024 25 MBBS
1 page
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
2 pages
Royal Ahold NV
No ratings yet
Royal Ahold NV
6 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
8 Gabriel
No ratings yet
8 Gabriel
22 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
IRS Unit-3
No ratings yet
IRS Unit-3
30 pages
5 Diego V Castillo
No ratings yet
5 Diego V Castillo
2 pages
Attachment J - Weekly Inspection Report - New
No ratings yet
Attachment J - Weekly Inspection Report - New
8 pages
Split Learning Over Wireless Networks Parallel Design and Resource Management
No ratings yet
Split Learning Over Wireless Networks Parallel Design and Resource Management
30 pages
Corporation Testbank
No ratings yet
Corporation Testbank
45 pages
Design of Single Precision Floating Point Arithmetic Logic Unit
No ratings yet
Design of Single Precision Floating Point Arithmetic Logic Unit
5 pages
Unified Case Study
No ratings yet
Unified Case Study
2 pages
Introduction To The USA and Canada
No ratings yet
Introduction To The USA and Canada
10 pages
Subject: Insufficient Fuel Tank Wall Thickness/Fuel Leak: 1200 New Jersey Avenue SE Washington, DC 20590
No ratings yet
Subject: Insufficient Fuel Tank Wall Thickness/Fuel Leak: 1200 New Jersey Avenue SE Washington, DC 20590
2 pages
Data Mining UNIT - 2 (Data Warehouse Architecture)
No ratings yet
Data Mining UNIT - 2 (Data Warehouse Architecture)
3 pages
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
No ratings yet
EN Checklist ISO Aanvulling Ontwerp 7 - 3 260303
3 pages
Network Design, Configuration-IP Assignment
No ratings yet
Network Design, Configuration-IP Assignment
58 pages
Wube Lab Report
No ratings yet
Wube Lab Report
21 pages
Dao 2015-09
No ratings yet
Dao 2015-09
14 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Artificial Intelligence Ass
No ratings yet
Artificial Intelligence Ass
33 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Chapter 4
No ratings yet
Chapter 4
37 pages
Mobile App Chapter 2
No ratings yet
Mobile App Chapter 2
44 pages
Basic Firefighting Course
No ratings yet
Basic Firefighting Course
15 pages
Chapter 1 Event
No ratings yet
Chapter 1 Event
39 pages
SUpervised Result in Graphy
No ratings yet
SUpervised Result in Graphy
1 page
BAC GIANG - Đề thi chọn ĐT 2023 (chính thức)
No ratings yet
BAC GIANG - Đề thi chọn ĐT 2023 (chính thức)
19 pages
IGCSE-OL Geo CB Answers Theme 2 Natural Environment
100% (1)
IGCSE-OL Geo CB Answers Theme 2 Natural Environment
55 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Impeccability of the Source Text in Translations
From Everand
Impeccability of the Source Text in Translations
Luis R. Cerna
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR Chapter 2 Part II

Uploaded by

IR Chapter 2 Part II

Uploaded by

1

Documents are represented by binary weights or Non-

 The term frequency (tf) weights

 An inverse document frequency (IDF or collection

 The composite weight (tf*idf)

 Count the frequency considering the whole collection

 This quantifies the ability of the term to

 The term frequency (TF) for cow :

 The inverse document frequency is

 The TFIDF score is the product of these frequencies: 0.03

Postulate: Documents that are “close together” in the vector

where wij is the weight of term i in document j and wiq is the

 (0  2)  (3  7)  (2 1)  (1 0)  (10  0)  11.05

where wij is the weight of term i in document j and wiq is the

• Term Weighted: Retrieval Database Architecture

Sim (D1 , Q) = 21 + 30 + 5*2 = 12

 The denominator involves the lengths of the vectors

(0.4 * 0.2)  (0.8 * 0.7)

cos  2  0.98 0.6 2

0.2 0.4 0.6 0.8 1.0

D1 = 2T1 + 3T2 + 5T3 Cosin(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D1 is 6 times better than D2 using cosine similarity but only 5 times

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

IR Chapter 2 Part II

Uploaded by

IR Chapter 2 Part II

Uploaded by

1

Documents are represented by binary weights or Non-

 The term frequency (tf) weights

 An inverse document frequency (IDF or collection

 The composite weight (tf*idf)

 Count the frequency considering the whole collection

 This quantifies the ability of the term to

 The term frequency (TF) for cow :

 The inverse document frequency is

 The TF*IDF score is the product of these frequencies: 0.03 *

Postulate: Documents that are “close together” in the vector

where wij is the weight of term i in document j and wiq is the

 (0  2)  (3  7)  (2 1)  (1 0)  (10  0)  11.05

where wij is the weight of term i in document j and wiq is the

• Term Weighted: Retrieval Database Architecture

Sim (D1 , Q) = 2*1 + 3*0 + 5*2 = 12

 The denominator involves the lengths of the vectors

(0.4 * 0.2)  (0.8 * 0.7)

cos  2  0.98 0.6 2

0.2 0.4 0.6 0.8 1.0

D1 = 2T1 + 3T2 + 5T3 Cosin(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D1 is 6 times better than D2 using cosine similarity but only 5 times

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

 The TFIDF score is the product of these frequencies: 0.03

Sim (D1 , Q) = 21 + 30 + 5*2 = 12