0% found this document useful (0 votes)

6 views10 pages

Text Similarity Metrics

The document outlines learning objectives related to calculating text similarity metrics, including Jaccard similarity, Euclidean distance, and Cosine similarity using Python's NLTK library. It provides detailed explanations and examples for each metric, including how to programmatically compute them using sklearn. The document emphasizes the importance of understanding these metrics for comparing text documents.

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Text Similarity Metrics

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning Objectives

Learners will be able to…

Describe how text similarity metrics are calculated

Calculate Euclidean distance between texts
Calculate Jaccard Similarity between texts
Calculate Cosine Similarity between texts

info

Make Sure to Know

Intermediate Python.

Limitations
In this assignment, we only work with the NLTK library.
Jaccard Similarity Coeﬃcient Score
There are various text similarity metrics that exist, such as Cosine
similarity, Euclidean distance, and Jaccard Similarity.

If we consider two documents, document A and document B, the Jaccard

Score (sometimes called the similarity coeﬃcient) is calculated as follows:

The numerator or top part of the fraction is the intersection between the
documents. In Jaccard, it is literally the count of words that are in both
documents:

The denominator or bottom part of the fraction is the union between the
documents. In Jaccard, it is literally the count of all unique words that are
in either document:

Example calculation of Jaccard Score

Let’s consider the following documents:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.
Let’s start by removing stop words, punctuation and capitalization:
1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

The intersection between these two documents is the word like which is 1
word.

The union between these two documents, or the count of unique words, is
9.

The Jaccard score is then calculated as follows:

Programming Jaccard Score with

sklearn.metrics.jaccard_score

To get started, we need to vectorize the text. Because jaccard_score is

looking for integer vectors, we will use CountVectorizer instead of
TfidfVectorizer:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call jaccard_score on the vector representation of each

document:

print(jaccard_score(X.toarray()[0], X.toarray()[1]))

You should get 0.1111111111111111 - the same value we calculated above.

Try out diﬀerent documents to see what their similarity score is!
Euclidean Distance
Euclidean distance is something you have probably seen in a math class -
the basic idea behind it is Pythagorean’s theorem ( for the
sides of a right triangle where is the hypotenuse).

On a two-dimensional grid, this equation looks slightly diﬀerent as each

side of the right triangle is the diﬀerence along either the x or y dimension:

The resulting equation for distance in two dimensions is then:

You can generalize this to three dimensions by using the resulting

hypotenuse of the two-dimensional ﬁgure as one side and the third
dimension as the second side:
The resulting equation for distance in three dimensions is then:

After 3 dimensions, distance gets hard to visualize, but you can generalize
the equation to dimensions:

Example calculation of Euclidean Distance

Let’s consider the following documents again:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.

Let’s start by removing stop words, punctuation and capitalization:

1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

'play' 'swimming']

We then have the following two 9-dimensional vectors to represent the

documents:
[0, 1, 1, 1, 0, 1, 1, 0, 0]
[1, 0, 0, 0, 1, 0, 1, 1, 1]

Programming Euclidean Distance with

sklearn.metrics.pairwise.euclidean_distances

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call euclidean_distances on the vector representations:

print( euclidean_distances(X) )

You should see the following output:

[[0. 2.82842712]
[2.82842712 0. ]]

The way to read this is the top-left is the distance between the ﬁrst vector
and the ﬁrst vector – so it makes sense that the distance to itself is 0.

The top-right and bottom-left are both representing the distance between
the ﬁrst vector and the second vector. We see that the code got the same
2.83 number we calculated above.

The bottom-right number is the distance between the second vector and
the second vector – so it makes sense that the distance to itself is 0.

This representation makes more sense when you are comparing more than
two documents – try adding a third document!
Cosine Similarity
If Euclidean distance is the straight line distance between vectors, Cosine
similarity is the angular distance between vectors:

.guides/img/cosine

The formula is:

Example calculation of Cosine Similarity

Let’s consider the following documents again:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.

Let’s start by removing stop words, punctuation and capitalization:

1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

'play' 'swimming']

We then have the following two 9-dimensional vectors to represent the

documents:

[0, 1, 1, 1, 0, 1, 1, 0, 0]
[1, 0, 0, 0, 1, 0, 1, 1, 1]
Programming Cosine Similarity with
sklearn.metrics.pairwise.cosine_similarity

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call cosine_similarity on the vector representations:

print( cosine_similarity(X) )

You should see the following output:

[[1. 0.2]
[0.2 1. ]]

The way to read this is the top-left is the cosine between the ﬁrst vector and
the ﬁrst vector. Cosine ranges from 1 to -1 where and
– so it makes sense that the cosine of a vector to itself
is 1.

The top-right and bottom-left are both representing the cosine between the
ﬁrst vector and the second vector. We see that the code got the same 0.2
number we calculated above.

The bottom-right number is the cosine between the second vector and the
second vector – so it makes sense that the cosine to itself is 1.
Formative Assessment 1
Formative Assessment 2

Class 11 Geography Eng Tamil Nadu
No ratings yet
Class 11 Geography Eng Tamil Nadu
298 pages
TTS Notes-Unit 12
No ratings yet
TTS Notes-Unit 12
556 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
Unit 3
No ratings yet
Unit 3
114 pages
Unit5 - Updated
No ratings yet
Unit5 - Updated
112 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Unit4 C
No ratings yet
Unit4 C
107 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Lecture 11 Collaborative Filtering
No ratings yet
Lecture 11 Collaborative Filtering
37 pages
Ads 5295
No ratings yet
Ads 5295
93 pages
Kendre Anitha Consumer Forum Final
No ratings yet
Kendre Anitha Consumer Forum Final
19 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Zorba
No ratings yet
Zorba
28 pages
Similarity
No ratings yet
Similarity
20 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
12 Capital Budgeting Version 2 Key
No ratings yet
12 Capital Budgeting Version 2 Key
10 pages
Wika 0900766b813ecd99
No ratings yet
Wika 0900766b813ecd99
2 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
2 (C) - Jaccard and Cosine Method
No ratings yet
2 (C) - Jaccard and Cosine Method
6 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Clustering
No ratings yet
Clustering
43 pages
Problem Set 3: Document Distance: Pset Buddy
No ratings yet
Problem Set 3: Document Distance: Pset Buddy
7 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
Internal Reconstruction
No ratings yet
Internal Reconstruction
21 pages
303-01c Engine - V8 (4V)
No ratings yet
303-01c Engine - V8 (4V)
94 pages
Comparison Jaccard Similarity Cosine Similarity and Combined
No ratings yet
Comparison Jaccard Similarity Cosine Similarity and Combined
8 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Experiment 4 Code
No ratings yet
Experiment 4 Code
3 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
BDA
No ratings yet
BDA
31 pages
Mla Bibliography Website
100% (1)
Mla Bibliography Website
4 pages
Critical Thinking Debate
No ratings yet
Critical Thinking Debate
16 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
Lecture 10
No ratings yet
Lecture 10
7 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
9 Distance Measures in Data Science
No ratings yet
9 Distance Measures in Data Science
9 pages
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
No ratings yet
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
11 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Lab 2
No ratings yet
Lab 2
21 pages
Entrep Q1 Mod1
No ratings yet
Entrep Q1 Mod1
18 pages
Bush & James JR 2020 - Adolescents in Individualistics Cultures
No ratings yet
Bush & James JR 2020 - Adolescents in Individualistics Cultures
11 pages
Introduction To Social Representation Theory
No ratings yet
Introduction To Social Representation Theory
8 pages
Similarity
No ratings yet
Similarity
20 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
ETC - Self-DevElopment and Communication
No ratings yet
ETC - Self-DevElopment and Communication
16 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Ref. No. 430938 - Cat. No. FDEMFN: Representative Image
No ratings yet
Ref. No. 430938 - Cat. No. FDEMFN: Representative Image
2 pages
What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
No ratings yet
What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
34 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Dist
No ratings yet
Dist
14 pages
Java Assignment
No ratings yet
Java Assignment
2 pages
The Egyptian Culture PowerPoint
No ratings yet
The Egyptian Culture PowerPoint
29 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Quality and Efficacy of Accounting Information System (Ais) in The Decision-Making Using Enterprise Resources Planning System (Erps)
No ratings yet
Quality and Efficacy of Accounting Information System (Ais) in The Decision-Making Using Enterprise Resources Planning System (Erps)
9 pages
DSM V Adhd
No ratings yet
DSM V Adhd
1 page
Music Assignment 1
No ratings yet
Music Assignment 1
3 pages
Ortega Crim Law 2 Notes
No ratings yet
Ortega Crim Law 2 Notes
4 pages
Relevant Costing
No ratings yet
Relevant Costing
8 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
What Is Cosine Similarity and Why Is It Advantageous?
No ratings yet
What Is Cosine Similarity and Why Is It Advantageous?
2 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Ansul Kumar: Objective
No ratings yet
Ansul Kumar: Objective
1 page
Gandhi, Islam and More
No ratings yet
Gandhi, Islam and More
2 pages
Acad Cal S1 2013-2014 - v1 PDF
No ratings yet
Acad Cal S1 2013-2014 - v1 PDF
2 pages
York County Court Schedule For March 15
No ratings yet
York County Court Schedule For March 15
2 pages
Case 2 - Revenue and Expense Recognition
No ratings yet
Case 2 - Revenue and Expense Recognition
4 pages
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
No ratings yet
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
7 pages
Relativity, decays and electromagnetic fields
From Everand
Relativity, decays and electromagnetic fields
Alessio Mangoni
No ratings yet
The Really Useful Piano Poster-1
No ratings yet
The Really Useful Piano Poster-1
1 page
List of FPOs in The State of Meghalaya
No ratings yet
List of FPOs in The State of Meghalaya
1 page
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Text Similarity Metrics

Uploaded by

Text Similarity Metrics

Uploaded by

Learning Objectives

Learners will be able to…

Describe how text similarity metrics are calculated

Make Sure to Know

If we consider two documents, document A and document B, the Jaccard

Example calculation of Jaccard Score

The Jaccard score is then calculated as follows:

Programming Jaccard Score with

To get started, we need to vectorize the text. Because jaccard_score is

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call jaccard_score on the vector representation of each

You should get 0.1111111111111111 - the same value we calculated above.

On a two-dimensional grid, this equation looks slightly diﬀerent as each

The resulting equation for distance in two dimensions is then:

You can generalize this to three dimensions by using the resulting

Example calculation of Euclidean Distance

Let’s start by removing stop words, punctuation and capitalization:

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

We then have the following two 9-dimensional vectors to represent the

Programming Euclidean Distance with

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call euclidean_distances on the vector representations:

You should see the following output:

The formula is:

Example calculation of Cosine Similarity

Let’s start by removing stop words, punctuation and capitalization:

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

We then have the following two 9-dimensional vectors to represent the

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call cosine_similarity on the vector representations:

You should see the following output:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.