IR Chapter 2 Part II
IR Chapter 2 Part II
Terms
Terms are usually stems. Terms can be also phrases, such
as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn W=0 if a term is absent
2
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
7
Term frequency (TF) weights
The frequency of occurrence of a term is a useful indication
of its relative importance in describing a document.
In other words, term importance is related to frequency of
occurrence.
If term A is mentioned more than term B, then the
document is more about A than about B (assuming A
and B to be content bearing terms).
Such measure assumes that the value, or weight, of a term
assigned to a document is simply proportional to the
term frequency (i.e., the frequency of occurrence of that
particular term in that particular document).
The more frequently a term occurs in a document the
more likely it is to be of value in describing the content of
the document.
8
TF (term frequency) - Count the
number of times term occurs in
docs t1 t2 t3
a document.
D1 2 0 3
f ij = frequency of term i in D2 1 0 0
document j D3 0 4 7
D4 3 0 0
D5 1 6 3
The more times a term t occurs D6 3 5 0
in document d the more likely it D7 0 8 0
is that t is relevant to the D8 0 10 0
D9 0 0 1
document, i.e. more indicative
D10 0 3 5
of the topic.. D11 4 0 1
Accordingly, the weight of term j in document i,
denoted by wij, might be determined by
wij FREQij
where, FREQij is the frequency of term j in
document i
It is a simple count of the number of occurrences of a
term in a particular document (or query).
Is a measure of term density in a document.
Experiments have shown that this is better than
Boolean.
Having all weaknesses this method shows better
results than that of Boolean.
10
Problems with Term frequency (TF)
Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency
words are equally distributed throughout the collection.
Since it does not take into account the role of term j in
any document other than document i.
This simple measure is not normalized to account for
variances in the length of documents (i.e. Long documents
have an unfair advantage.)
A one-page document with 10 mentions of A is “more about
A” than a 100 page document with 20 mentions of A.
Used alone, favors common words, long documents (Why
favor common words and long documents? )
11
Solutions to the Problems with Term frequency (TF)
Two solutions
1. Divide each frequency count by the length of the
document (length Normalization).
In this case the normalized frequency tfij is used
instead of FREQij.
2. Divide each frequency count by the maximum
frequency count of any item in the document.
The normalized tf is given by
FREQij
tfij
Where, max l ( FREQlj )
tfij is the normalized frequency of term j in document i
maxl is the maximum frequency of any term in document dl12
Document Frequency
It is defined to be the number of documents in the
collection that contain a term
DF = document frequency
Where,
N is the total number of documents in the collection.
dk the number of documents in which term k occurs
Wk the weight assigned to term k (i.e inverse document
frequency of term i,)
That is, the weight of a term in a document is the logarithm of
the number of documents in the collection divided by the
number of documents in the collection that contain the term
(with 2 as the base of the logarithm).
14
IDF measures rarity of the term in collection. The IDF is
a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of
terms that occur rarely.
Gives full weight to terms that occur in one
document only.
Gives lowest weight to terms that occur in all
documents.
Terms that appear in many different documents are
less indicative of overall topic.
The more a term t occurs throughout all documents, the
more poorly that term t discriminates between documents.
If a term occurs in many of the documents in the
collection, then it does not serve well as a document
identifier and should be given low weight as a
potential index term
15
As the collection frequency of a term decreases its
weight increases.
The total number of frequency of a term across the entire
document collection.
Emphasis is on terms exhibiting the lowest
document frequency.
Term importance is
Inversely proportional to the total number of
documents to which the each term is assigned
Associated towards terms appearing in less
number of documents or items
16
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values for
common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency? 17
Problems with IDF weights
Identifies that a term that appears in many documents
is not very useful for distinguishing relevant
documents from non-relevant ones.
Because, this function does not take into account the
frequency of a term in a given document (i.e., FREQij)
That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author uses important
terms now and then.
18
Solution to the problem of IDF weights
Weights should combine two measurements.
Weights should be in direct proportion to the
frequency of the term in a document.=TF
tf i,j = FREQi,j / maxk{FREQk,j}
This quantifies how well a term describes the
document (or the content).
Weights should be in inverse proportion to the
number of documents in the collection in which the
term appears. =IDF (N)
wk log 2 dk
21
TF*IDF weighting
When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents.
Highest tf*idf for a term shows a term has a high
term frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
The weights hence tend to filter out common terms.
Thus, lending high discriminating power to those
documents.
Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents.
Thus, offering a less pronounced relevance signal.
Lowest TF*IDF is registered when the term occurs in
virtually all documents.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies (DF)
of three terms are: A(50), B(1300), C(250). And also term
frequencies (TF) of these terms are: A(3), B(2), C(1) with
a maximum term frequency of 3. Compute TF*IDF for
each term?
A: tf = 3/3=1.0 idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf *idf = 1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf = 1.774
Query vector is typically treated as a document and also tf*idf
weighted.
23
More Example
Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.
24
Exercise Word C TW TD DF TF IDF TF*IDF
• Let C = number of times
a given word appears in a airplane 5 46 3 1
document; blue 1 46 3 1
• TW = total number of
words in a document; chair 7 46 3 3
• TD = total number of computer 3 46 3 1
documents in a corpus,
and forest 2 46 3 1
• DF = total number of justice 7 46 3 3
documents containing a
given word; love 2 46 3 1
• compute TF, IDF and might 2 46 3 1
TF*IDF score for each
perl 5 46 3 2
term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
25
Concluding remarks
Suppose from a set of English documents, we wish to determine
which once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not
contain all three words "the," "brown," and "cow," but this still
leaves many documents.
To further distinguish them, we might count the number of times
each term occurs in each document and sum them all together;
The number of times a term occurs in a document is called its TF.
However, because the term "the" is so common, this will tend to
incorrectly emphasize documents which happen to use the word "the"
more, without giving enough weight to the more meaningful terms
"brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and
non-relevant documents and terms like "brown" and "cow" that occur
rarely are good keywords to distinguish relevant documents from the
non-relevant once.
26
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the
collection and increases the weight of terms that
occur rarely.
This leads to use TF*IDF as a better weighting
technique
On top of that we apply similarity measures to
calculate the distance between document i and
query j.
Similarity
Measure
28
Similarity Measure
We now have vectors for all documents in
the collection and a vector for the query,
t3
how do we compute similarity?
A similarity measure is a function that 1
computes the degree of similarity or
distance between document vector and D1
Q
query vector.
2 t1
Using a similarity measure between the
query and each document:
t2 D2
It is possible to rank the retrieved
documents in the order of presumed
relevance.
It is possible to enforce a certain threshold
so that the size of the retrieved set can be
controlled.
29
Intuition
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
31
Similarity Measure: Techniques
• There are a number of similarity measures; the most common
similarity measures are
Euclidean distance , Inner or Dot product, Cosine
similarity, etc.
Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of a
pair of document and query terms.
Dot product
The dot product is also known as the scalar product or inner
product.
The dot product is defined as the product of the magnitudes of
query and document vectors.
Cosine similarity (or normalized inner product)
It projects document and query vectors into a term space and
calculate the cosine angle between these. 32
Euclidean distance
Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj, q) = |dj – q| = (w
i 1
ij wiq ) 2
33
Inner Product
Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n
sim(dj, q) = dj• q =
w ·wij
i 1
iq
34
Properties of Inner Product
Favors long documents with a large number of unique
terms.
Again, the issue of normalization.
Measures how many terms matched but not how many
terms are not matched.
35
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7
sim(D, Q) = 3
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
q 1 1 1
37
Inner Product: Exercise
k2
k1 d2
d6 d7
d4 d5
d3
d1
k1 k2 k3 q dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
38
Cosine similarity
Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
w w
n
dj q
sim(d j , q) i 1 i, j i ,q
w
n n
dj q 2
w 2
i 1 i, j i 1 i ,q
Or;
w w
n
d j dk
sim(d j , d k ) i 1 i, j i ,k
w
n n
d j dk 2
w 2
i 1 i, j i 1 i ,k
cos 1 0.74
1.0 Q
D2
0.8
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
42
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle between two
vectors.
Inner product normalized by the vector lengths.
Eg. D1(2,3,5) D2(3,7,1) and Q(0,0,2)
t
dj q
(wij wiq )
Cosin(dj, q) = i 1
t t
wij wiq
2 2
dj q
i 1 i 1
InnerProduct(dj, q) =dj q