0% found this document useful (0 votes)
28 views34 pages

ML12 Clustering

Uploaded by

TANISHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views34 pages

ML12 Clustering

Uploaded by

TANISHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Clustering

•Cluster: a collection of data objects


• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
•Cluster analysis
• Natural grouping a set of data objects into
clusters
•Clustering is unsupervised classification: no
predefined classes /class label
What Is Good Clustering?

• High intra-class similarity &


Low inter-class similarity
• Appropriateness of
method for dataset, The
(dis)similarity measure
used & its implementation.
• Ability to discover some or
all of the hidden patterns.
Applications of Clustering Algorithm
• Pattern Recognition
• Spatial Data Analysis, like create thematic maps in GIS
by clustering feature spaces
• Image Processing
• Business applications, like customer segmentation etc.
• Web applications like, Document classification, Cluster
Weblog data to discover groups of similar access
patterns
Data Structures
Data matrix Dissimilarity matrix
𝑥11 . . . 𝑥1𝑓 ... 𝑥1𝑝 0
... ... ... ... ... 𝑑(2,1) 0
𝑥𝑖1 . . . 𝑥𝑖𝑓 ... 𝑥𝑖𝑝
𝑑(3,1) 𝑑(3,2) 0
... ... ... ... ...
: : :
𝑥𝑛1 . . . 𝑥𝑛𝑓 ... 𝑥𝑛𝑝
𝑑(𝑛, 1) 𝑑(𝑛, 2) . . . ... 0
Nominal (categorical)
• A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
• Method: Simple matching
• m: # of matches, p: total # of variables

𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝
Ordinal (numerical)
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank 𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1

• compute the dissimilarity using methods for interval-scaled


variables
Binary Variables
• A contingency table for binary data
• Simple matching coefficient
(invariant, if the binary variable is
symmetric):
• Jaccard coefficient (noninvariant if
the binary variable is asymmetric):
In cases where the two binary 𝑏+𝑐
𝑆𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐: 𝑑(𝑖, 𝑗) =
states are not equally important, 𝑎+𝑏+𝑐+𝑑
such as in the asymmetric type of
binary data, the positive matches
𝑏+𝑐
are usually more significant than 𝐴𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐: 𝑑(𝑖, 𝑗) =
𝑎+𝑏+𝑐
the negative matches
Dissimilarity between Binary Variables
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Example
Jack Y N P N N N
• the attributes Mary Y N P N P N
Jim Y P N N N N
are asymmetric
binary 0+1
𝑑(𝑗𝑎𝑐𝑘, 𝑚𝑎𝑟𝑦) = = 0.33
• let the values Y 2+0+1
and P be set to 1+1
1, and the value 𝑑(𝑗𝑎𝑐𝑘, 𝑗𝑖𝑚) = = 0.67
1+1+1
N be set to 0
1+2
𝑑(𝑗𝑖𝑚, 𝑚𝑎𝑟𝑦) = = 0.75
1+1+2
Example: Calculating the values of a similarity coefficient,
Suppose five individuals possess the following characteristics
• The scores for individuals 1 and 2 on the p = 6 binary variables are

• and the number of matches and mismatches are indicated in the


two-way array
• Employing similarity Note: Similarity
coefficient 1, which gives is computed
equal weight to matches, here instead of
dissimilarity

• Calculate the remaining


similarity for pairs of
individuals
• Based on the magnitudes
of the similarity
coefficient, individuals 2
and 5 are most similar
and individuals 1 and 5
are least similar.
Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between two
data objects (Generally Standardized or Normalized before computing distances)
• Some popular ones include: Minkowski distance:
𝑞
𝑑(𝑖, 𝑗) = (|𝑥𝑖1 − 𝑥𝑗1 |𝑞 + |𝑥𝑖2 − 𝑥𝑗2 |𝑞 +. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |𝑞 )

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects,
and q is a positive integer
• If q = 1, d is Manhattan distance

𝑑(𝑖, 𝑗) = |𝑥𝑖1 − 𝑥𝑗1 | + |𝑥𝑖2 − 𝑥𝑗2 |+. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |


• If q = 2, d is Euclidean distance:

𝑑(𝑖, 𝑗) = (|𝑥𝑖1 − 𝑥𝑗1 |2 + |𝑥𝑖2 − 𝑥𝑗2 |2 +. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |2 )

• Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)

• Also one can use weighted distance, parametric Pearson


product moment correlation, or other similarity measures.
Variables of Mixed Types
• A database may contain mixed types of variables
• One may use a weighted formula to combine their effects.
• f is binary: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is interval-based: may use the normalized distance
• f is ordinal: Compute zif from ranks rif and treat as
interval-scaled
Example: Clustering of genomic data sets
Example: Measuring the similarities of 11 languages
The meanings of words change with the course of history.
However, the meaning of the numbers represents one noticeable
exception. Thus, a first comparison of languages might be based
on the numerals alone. Table gives the first 10 numbers in English,
Polish, Hungarian, and eight other modern European languages.
(Only languages that use the Roman alphabet are considered, and
accent marks, cedillas, diereses, etc., are omitted.)
For illustrative purposes, compare languages by looking at the first
letters of the numbers. Two different languages are concordant if
they have the same first letter and discordant if they do not.
Table suggests that
• First five
languages
(English,
Norwegian,
Danish, Dutch,
and German) are
very much alike.
• French, Spanish,
and Italian are in
closer agreement.
• Hungarian and
• Polish has some of the characteristics of the
Finnish seem to
languages in each of the larger subgroups. be closer.
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative nesting
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
Divisive analysis
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Distances between clusters
• Single. Single linkage defines the distance between two objects
or clusters as the distance between the two closest members of
those clusters.
Complete. Complete linkage uses the most distant pair of objects
in two clusters to compute between-cluster distances.

Average. Average linkage averages all distances between pairs of


objects in different clusters to decide how far apart they are.
• Median. Median linkage uses the median distances between pairs
of objects in different clusters to decide how far apart they are.
• Centroid. Centroid linkage uses the average value of all objects in
a cluster (the cluster centroid) as the reference point for distances
to other objects or clusters.
• Ward. Ward's method averages all distances between pairs of
objects in different clusters, with adjustments for covariances, to
decide how far apart the clusters are.
• Weighted. Weighted average linkage uses a weighted average
distance between pairs of objects in different clusters to decide
how far apart they are. The weights used are proportional to the
size of the cluster.
Agglomerative hierarchical clustering algorithm for grouping N objects
Dendrogram
 Graphical representation to show how the Clusters
are merged hierarchically
 A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster.
A Dendrogram
Construct a Dendrogram with the following distance matrix using
(i) Single linkage method (ii) Complete linkage method
(iii) Average Linkage
SL Companies V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 3M India Ltd, Bangalore 3 3 1 1 4 4 2 5 5 1 1 1 1 1
2 A C C Machinery Co. Ltd, Mumbai 1 1 3 3 2 1 1 1 1 1 5 5 4 4
3 Ashok Leyland Ltd, Chennai 5 5 2 2 5 4 4 1 1 3 3 3 3 3
4 Atlas Cycles, Sonepat, Haryana 4 4 4 1 4 3 5 3 1 4 5 4 4 2
5 Audco India Ltd, Mumbai 1 1 3 3 2 1 1 1 1 1 5 4 4 4
6 Automotive Axles Ltd, Mysore 5 5 2 2 4 4 4 2 1 3 2 3 3 3
7 Avon Cycles Ltd, Ludhiana 3 3 5 2 3 1 5 2 1 3 4 3 3 3
8 B S Refrigerators Ltd, Mumbai 2 2 5 3 4 4 4 1 2 4 4 4 3 5
9 Batliboi Environmental Engg. Ltd, Mumbai 2 2 3 3 2 2 2 1 1 1 4 4 4 4
10 Cable Corporation of India, Mumbai 4 4 1 1 5 5 5 5 1 3 1 3 3 4
11 CMC Ltd, Hyderabad 3 3 1 1 4 4 3 5 5 1 1 2 2 2
12 Crompton Greaves Ltd, Mumbai 4 4 1 1 4 3 3 4 5 1 1 2 2 2
13 Delphi-TVS Diesel Systems Ltd, Chennai 5 5 2 1 4 3 3 1 1 2 3 3 3 3
14 D-Link (India) Ltd, Goa 3 3 1 1 4 4 4 5 4 1 1 1 1 1
15 Eitcher Motors Ltd, Dhar, MP 1 1 1 1 3 3 3 1 1 1 3 1 5 4
16 Ford India Ltd, Chennai 4 4 1 2 5 5 2 1 1 3 4 2 4 3
17 G K N Driveshafts (India) Ltd., Faridabad 5 4 2 1 3 3 4 1 1 2 3 3 3 3
18 Gabriel India Ltd, Pune 5 5 2 1 4 3 4 2 1 2 3 3 3 3
19 Godrej Boyce Manufacturing Co. Ltd., Mumbai 2 2 3 3 4 5 1 1 1 3 4 4 4 4
20 Guindy Machine Tools Ltd, Chennai 2 2 3 3 1 1 1 1 1 1 3 3 3 3
21 Hindustan Shipyard Ltd, New Delhi 1 1 1 1 1 1 4 3 1 1 4 3 3 3
22 International Auto Limited, Jamshedpur 5 3 1 1 2 3 2 2 2 3 4 3 3 2
23 Jaiprakash Industries Ltd, Lucknow 1 2 3 3 2 1 1 1 1 1 4 4 4 4
24 John Fowler (India) Ltd, Mumbai 2 2 4 4 1 1 2 1 2 2 5 5 5 5
25 Kirloskar Copeland Ltd, Pune 2 1 3 3 2 1 1 1 1 1 4 5 4 5
26 Lakshmi Machine Works Ltd, Coimbatore 1 1 3 3 1 1 2 1 1 1 4 5 4 4
27 Marvel India Ltd, Chennai 2 1 4 3 2 1 2 1 2 1 4 3 4 4
28 Menon Pistons Ltd, Kolhapur, Maharashtra 1 2 3 3 2 1 2 2 2 1 3 3 4 4
29 National Engineering Industries Ltd, Kolkata 1 1 1 1 1 1 1 5 1 1 1 1 1 1
30 Premier Instruments and Controls Ltd, Coimbatore 5 5 2 1 4 3 4 1 1 1 2 3 3 3
31 Rico Auto Industries Ltd, Rewari, Haryana 5 5 1 1 4 4 4 1 1 2 3 3 4 4
32 Rolta India Ltd, Mumbai 3 3 1 1 2 5 4 5 4 1 1 1 1 1
33 RSB Transmission (I) Ltd, Jamshedpur 4 3 1 1 4 3 3 2 4 4 3 3 2 3
34 Shriram Pistons and Rings Ltd, Gaziabad 5 5 1 1 4 3 3 1 2 2 3 3 3 3
35 Sundaram fasteners Ltd, Chennai 2 3 2 3 3 1 2 3 3 4 4 4 2 3
36 Swaraj Mazda Ltd, Ludhiana 4 4 1 1 5 4 2 2 2 3 4 4 4 4
37 TAFE, Chennai 4 4 2 2 5 5 4 1 1 4 3 3 3 3
38 Tata Honeywell Ltd, Mumbai 3 3 3 3 3 3 1 5 4 1 2 4 4 5
39 Tata Motors, Jamshedpur 5 3 1 1 5 4 1 2 1 1 3 3 3 3
40 TVS Electronics Ltd, Chennai 3 3 1 1 3 3 2 4 4 1 1 1 1 1
41 Vijai Electrical Ltd, Hyderabad 3 3 2 2 3 3 1 3 3 1 3 3 3 3
Cluster
Distance metric Tree distance
is Euclidean
Cluster Tree
Complete linkage method
Single linkage method (nearest neighbor) (farthest neighbor)
Case 4
Case 29 Case 7
Case 32 Case 35
Case 14 Case 33
Case 40 Case 22
Case 1 Case 39
Case 11 Case 36
Case 12 Case 16
Case 10 Case 37
Case 33 Case 3
Case 31 Case 6
Case 30 Case 31
Case 17 Case 34
Case 18 Case 17
Case 13 Case 13
Case 34 Case 18
Case 3 Case 30
Case 6 Case 10
Case 37 Case 32
Case 16 Case 14
Case 1
Case 36 Case 11
Case 22 Case 12
Case 39 Case 40
Case 41 Case 29
Case 38 Case 21
Case 20 Case 15
Case 28 Case 8
Case 27 Case 19
Case 9 Case 20
Case 23 Case 9
Case 5 Case 27
Case 2 Case 28
Case 25 Case 26
Case 26 Case 23
Case 24 Case 5
Case 19 Case 2
Case 8 Case 25
Case 7 Case 24
Case 4 Case 41
Case 21 Case 38
Case 35
Case 15

0 1 2 3 4
0.0 0.5 1.0 1.5
Distances
Distances
Cluster Tree Cluster Tree
Average linkage method Median linkage method

Case 29
Case 10 Case 38
Case 30 Case 10
Case 17 Case 12
Case 18 Case 11
Case 13 Case 1
Case 34 Case 14
Case 31 Case 40
Case 6 Case 32
Case 3 Case 21
Case 37 Case 15
Case 16 Case 31
Case 36 Case 30
Case 39 Case 17
Case 22 Case 18
Case 33 Case 13
Case 4 Case 34
Case 7 Case 3
Case 8 Case 6
Case 19 Case 37
Case 15 Case 16
Case 21 Case 36
Case 20 Case 39
Case 28 Case 22
Case 27 Case 41
Case 9 Case 33
Case 25 Case 35
Case 2 Case 20
Case 5 Case 28
Case 23 Case 27
Case 26 Case 9
Case 24 Case 25
Case 35 Case 2
Case 41 Case 5
Case 38 Case 23
Case 12 Case 26
Case 11 Case 24
Case 1 Case 19
Case 14 Case 8
Case 40 Case 7
Case 32 Case 4
Case 29

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0
Distances Distances
Cluster Tree Cluster Tree
Centroid linkage method Ward minimum variance method

Case 8 Case 24
Case 38 Case 26
Case 15 Case 25
Case 7 Case 2
Case 19 Case 5
Case 24 Case 23
Case 26 Case 9
Case 23 Case 27
Case 5 Case 28
Case 2 Case 20
Case 25 Case 29
Case 9 Case 21
Case 27 Case 15
Case 28 Case 38
Case 20 Case 41
Case 21 Case 22
Case 35 Case 33
Case 10 Case 35
Case 4 Case 19
Case 30 Case 8
Case 17 Case 7
Case 18 Case 4
Case 13 Case 30
Case 34 Case 17
Case 31 Case 18
Case 6 Case 13
Case 3 Case 34
Case 37 Case 31
Case 36 Case 6
Case 16 Case 3
Case 39 Case 37
Case 22 Case 16
Case 33 Case 36
Case 41 Case 39
Case 12 Case 10
Case 11 Case 12
Case 1 Case 11
Case 14 Case 1
Case 40 Case 40
Case 32 Case 14
Case 29 Case 32

0.0 0.5 1.0 1.5 0 5 10 15


Distances Distances
import pandas as pd
df= pd.read_csv("E:/MY DOCUMENTS/Desktop/Python/testdata.csv")
y = df
df.drop(['SL','Companies'], axis=1,inplace=True)
print("Dimension of the data set is : ", df.shape)

#Perform Clustering
from sklearn.cluster import AgglomerativeClustering
agg=AgglomerativeClustering(n_clusters=5)
ypred=agg.fit_predict(df)
x=agg.labels_
print(x)

#Creating dendrogram
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, ward
result=ward(df)
dendrogram(result)
plt.title("DENDROGRAM")
plt.xlabel('Observations')
plt.ylabel('Distances')
plt.show()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy