0% found this document useful (0 votes)
16 views29 pages

Module-3Conti.. Similarity& Dissimlarity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

Module-3Conti.. Similarity& Dissimlarity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Measures of

Similarity and
Dissimilarity
Unit - II
Datamining
Measures of Similarity and
Dissimilarity

● Similarity and dissimilarity are important because they are used by a


number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.
Measures of Similarity and Dissimilarity

● Similarity between two objects is a numerical measure of the


degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity
Measures of Similarity and Dissimilarity

Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely
similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s−1)/9
■ s - Original Similarity
■ s’ - New similarity values
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects

Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.

1. Positivity

(a) d(x, x) ≥ 0 for all x and y,

(b) d(x, y) = 0 only if x = y.

2. Symmetry

d(x, y) = d(y, x) for all x and y.

3. Triangle Inequality

d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

Note:-Measures that satisfy all three properties are known as metrics.


Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences

A = {1, 2, 3, 4} and B = {2, 3, 4},


then A − B = {1} and
B − A = ∅, the empty set.

If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.

d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)


Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.

d(1PM, 2PM) = 1 hour


d(2PM, 1PM) = 23 hours

● Example:- when answering the question: “If an event occurs at 1PM


every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”
Distance in python
Measures of Similarity and Dissimilarity
Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
■ S` - new similarity measure.
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients

○ Let x and y be two objects that consist of n binary attributes.

○ The comparison of two objects (or two binary vectors), leads to


the following four quantities (frequencies):

f00 = the number of attributes where x is 0 and y is 0


f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Simple Matching Coefficient(SMC)

Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)

# import required libraries


import numpy as np
from numpy.linalg import norm

# define two lists or array


A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])

print("A:", A)
print("B:", B)

# compute cosine similarity


cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)

● Cosine similarity - measure of angle between x and y.


● Cosine similarity = 1 (angle is 0◦, and x & y are same (except magnitude or length))


Cosine similarity = 0 (angle is 90 , and x & y do not share any terms (words))
Measures of Similarity and Dissimilarity
Examples of proximity measures

cosine similarity (Document similarity)

Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures

Extended Jaccard Coefficient (Tanimoto Coefficient)


Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
● The more tightly linear two variables X and Y are,
the closer Pearson's correlation coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated numbers
■ an increase in the value of one decreases the value of another variable.
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy