0% found this document useful (0 votes)

69 views20 pages

Similarity

This document discusses various measures of similarity and dissimilarity between data objects. It defines similarity and dissimilarity measures, and notes that proximity refers to either similarity or dissimilarity. It then discusses specific measures like Euclidean distance, Minkowski distance including L1 and L-infinity norms, Hamming distance, Jaccard similarity, and cosine similarity. It also outlines common properties that similarity and distance measures should satisfy such as symmetry, maximum value when objects are identical, and triangle inequality for distances.

Uploaded by

Abdalah Saleh Moustafa Elgholmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views20 pages

Similarity

Uploaded by

Abdalah Saleh Moustafa Elgholmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,

respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
• dist x, y = (3 − 1)2 +(6 − 2)2 + (0 − 0)2 +(3 − 1)2 +(6 − 2)2
dist x, y = 6.324

Standardization is necessary, if scales differ.

Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of

Euclidean Distance Where r is a parameter, n is the
number of dimensions (attributes) and xk and yk
are, respectively, the kth attributes (components) or
data objects x and y.

•
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different between
two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the
vectors
Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors

8
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )
• Lp norms or Minkowski distance:
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 𝑝 + ⋯ + 𝑥𝑑 − 𝑦𝑑 𝑝 1ൗ𝑝

• L2 norm: Euclidean distance:

𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:

𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |

• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
11
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a metric

Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)

f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Jaccard: Example
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2,
and || d || is the length of vector d.
X = x 2
i X •Y  (x i  yi )
i cos( X , Y ) = = i
X  y xi
2
i  y
i
2
i

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1 • d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),

x and y.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)

Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Quadratic Equation Theory
No ratings yet
Quadratic Equation Theory
13 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Module 2 Rational Functions
100% (1)
Module 2 Rational Functions
27 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Distance Functions
No ratings yet
Distance Functions
10 pages
Allen - Sets and Relations
100% (1)
Allen - Sets and Relations
19 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Algebra Questions For CAT
No ratings yet
Algebra Questions For CAT
10 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Rsfinal
No ratings yet
Rsfinal
30 pages
Clustering
0% (1)
Clustering
127 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
7 pages
Unit 3
No ratings yet
Unit 3
13 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
General Math Week 1
No ratings yet
General Math Week 1
49 pages
Poly 10th 3 May 2025
No ratings yet
Poly 10th 3 May 2025
2 pages
(Ebook PDF) A First Course in Differential Equations With Modeling Applications 10 - Download The Ebook Now For Instant Access To All Chapters
100% (4)
(Ebook PDF) A First Course in Differential Equations With Modeling Applications 10 - Download The Ebook Now For Instant Access To All Chapters
49 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Lec 5
No ratings yet
Lec 5
22 pages
Lab 2
No ratings yet
Lab 2
21 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
19 pages
Sy 36
No ratings yet
Sy 36
34 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Pages From DOC-20230719-WA0002.
No ratings yet
Pages From DOC-20230719-WA0002.
30 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Trigonometry 11th Edition Lial Solutions Manual
No ratings yet
Trigonometry 11th Edition Lial Solutions Manual
55 pages
Class XII Session 2024-25 Subject - Mathematics Sample Question Paper - 4
No ratings yet
Class XII Session 2024-25 Subject - Mathematics Sample Question Paper - 4
19 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Advanced-Math MMSU
No ratings yet
Advanced-Math MMSU
50 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
11th Annual Exam Maths 2023-24 Set B
No ratings yet
11th Annual Exam Maths 2023-24 Set B
5 pages
Dist
No ratings yet
Dist
14 pages
25 Mathematics Trigonometrical Equations Inequations
No ratings yet
25 Mathematics Trigonometrical Equations Inequations
13 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Linear Canonical Jacobi-Dunkl Transform-Theory and Applications
No ratings yet
Linear Canonical Jacobi-Dunkl Transform-Theory and Applications
23 pages
1-2022 May HSE II
No ratings yet
1-2022 May HSE II
15 pages
Worksheet 1
No ratings yet
Worksheet 1
7 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
ATP 2023-24 GR 8 Maths Final
No ratings yet
ATP 2023-24 GR 8 Maths Final
6 pages
Roll No 13 Math Term 1 Paper
No ratings yet
Roll No 13 Math Term 1 Paper
4 pages
1.4.4 Practice - Modeling - Solving Inequalities (Practice)
No ratings yet
1.4.4 Practice - Modeling - Solving Inequalities (Practice)
6 pages
Lyapunov-Based Methods in Control: Dr. Alexander Schaum
No ratings yet
Lyapunov-Based Methods in Control: Dr. Alexander Schaum
22 pages
33 - Polynomial and Rational Functions - Graphing A Rational Function With More Than One Vertical Asymptote
No ratings yet
33 - Polynomial and Rational Functions - Graphing A Rational Function With More Than One Vertical Asymptote
5 pages
Multiple Integrals 1 Multiple Integrals Over Compact Intervals
No ratings yet
Multiple Integrals 1 Multiple Integrals Over Compact Intervals
8 pages
Quizizz - 03.25.21 Trigonometric Functions Quiz 1
No ratings yet
Quizizz - 03.25.21 Trigonometric Functions Quiz 1
7 pages
Nsopde Book
No ratings yet
Nsopde Book
40 pages
Serrin Dirichlet PDF
No ratings yet
Serrin Dirichlet PDF
85 pages
Problem 3.156: 30 LB 10 LB
No ratings yet
Problem 3.156: 30 LB 10 LB
7 pages
Midterm MAT1025 Calculus I 2010g
No ratings yet
Midterm MAT1025 Calculus I 2010g
10 pages
Brief Review of MATLAB Instructions Known From Previous Courses
No ratings yet
Brief Review of MATLAB Instructions Known From Previous Courses
4 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Similarity

Uploaded by

Similarity

Uploaded by

Similarity and Dissimilarity Measures

where n is the number of dimensions (attributes) and xk and yk are,

Standardization is necessary, if scales differ.

• Minkowski Distance is a generalization of

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• r → . “supremum” (Lmax norm, L norm) distance.

• L2 norm: Euclidean distance:

• L1 norm: Manhattan distance:

• A distance that satisfies these properties is a metric

• Simple Matching and Jaccard Coefficients

f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

where s(x, y) is the similarity between points (data objects),

• Jaccard Distance is a metric

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.