Text Similarity Metrics
Text Similarity Metrics
info
Limitations
In this assignment, we only work with the NLTK library.
Jaccard Similarity Coefficient Score
There are various text similarity metrics that exist, such as Cosine
similarity, Euclidean distance, and Jaccard Similarity.
The numerator or top part of the fraction is the intersection between the
documents. In Jaccard, it is literally the count of words that are in both
documents:
The denominator or bottom part of the fraction is the union between the
documents. In Jaccard, it is literally the count of all unique words that are
in either document:
The intersection between these two documents is the word like which is 1
word.
The union between these two documents, or the count of unique words, is
9.
print(jaccard_score(X.toarray()[0], X.toarray()[1]))
Try out different documents to see what their similarity score is!
Euclidean Distance
Euclidean distance is something you have probably seen in a math class -
the basic idea behind it is Pythagorean’s theorem ( for the
sides of a right triangle where is the hypotenuse).
After 3 dimensions, distance gets hard to visualize, but you can generalize
the equation to dimensions:
print( euclidean_distances(X) )
[[0. 2.82842712]
[2.82842712 0. ]]
The way to read this is the top-left is the distance between the first vector
and the first vector – so it makes sense that the distance to itself is 0.
The top-right and bottom-left are both representing the distance between
the first vector and the second vector. We see that the code got the same
2.83 number we calculated above.
The bottom-right number is the distance between the second vector and
the second vector – so it makes sense that the distance to itself is 0.
This representation makes more sense when you are comparing more than
two documents – try adding a third document!
Cosine Similarity
If Euclidean distance is the straight line distance between vectors, Cosine
similarity is the angular distance between vectors:
.guides/img/cosine
[0, 1, 1, 1, 0, 1, 1, 0, 0]
[1, 0, 0, 0, 1, 0, 1, 1, 1]
Programming Cosine Similarity with
sklearn.metrics.pairwise.cosine_similarity
print( cosine_similarity(X) )
[[1. 0.2]
[0.2 1. ]]
The way to read this is the top-left is the cosine between the first vector and
the first vector. Cosine ranges from 1 to -1 where and
– so it makes sense that the cosine of a vector to itself
is 1.
The top-right and bottom-left are both representing the cosine between the
first vector and the second vector. We see that the code got the same 0.2
number we calculated above.
The bottom-right number is the cosine between the second vector and the
second vector – so it makes sense that the cosine to itself is 1.
Formative Assessment 1
Formative Assessment 2