0% found this document useful (0 votes)
69 views20 pages

Similarity

This document discusses various measures of similarity and dissimilarity between data objects. It defines similarity and dissimilarity measures, and notes that proximity refers to either similarity or dissimilarity. It then discusses specific measures like Euclidean distance, Minkowski distance including L1 and L-infinity norms, Hamming distance, Jaccard similarity, and cosine similarity. It also outlines common properties that similarity and distance measures should satisfy such as symmetry, maximum value when objects are identical, and triangle inequality for distances.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views20 pages

Similarity

This document discusses various measures of similarity and dissimilarity between data objects. It defines similarity and dissimilarity measures, and notes that proximity refers to either similarity or dissimilarity. It then discusses specific measures like Euclidean distance, Minkowski distance including L1 and L-infinity norms, Hamming distance, Jaccard similarity, and cosine similarity. It also outlines common properties that similarity and distance measures should satisfy such as symmetry, maximum value when objects are identical, and triangle inequality for distances.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,


respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
• dist x, y = (3 − 1)2 +(6 − 2)2 + (0 − 0)2 +(3 − 1)2 +(6 − 2)2
dist x, y = 6.324

Standardization is necessary, if scales differ.


Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of


Euclidean Distance Where r is a parameter, n is the
number of dimensions (attributes) and xk and yk
are, respectively, the kth attributes (components) or
data objects x and y.


Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.


• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different between
two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the
vectors
Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors

8
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )
• Lp norms or Minkowski distance:
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 𝑝 + ⋯ + 𝑥𝑑 − 𝑦𝑑 𝑝 1ൗ𝑝

• L2 norm: Euclidean distance:


𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:


𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |

• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
11
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a metric


Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Jaccard: Example
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2,
and || d || is the length of vector d.
X = x 2
i X •Y  (x i  yi )
i cos( X , Y ) = = i
X  y xi
2
i  y
i
2
i

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1 • d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),


x and y.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy