0% found this document useful (0 votes)
49 views18 pages

AB1202 Statistics and Analysis

The document discusses different cluster analysis techniques including K-means clustering, agglomerative clustering, and divisive clustering. K-means clustering requires specifying the number of clusters k in advance and initial cluster centers, while agglomerative and divisive clustering can detect inherent clusters without prior assumptions. Agglomerative clustering starts by treating each data point as a separate cluster and iteratively merges the closest pairs of clusters until all data points are in one cluster, allowing visualization using a dendrogram.

Uploaded by

xthele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

AB1202 Statistics and Analysis

The document discusses different cluster analysis techniques including K-means clustering, agglomerative clustering, and divisive clustering. K-means clustering requires specifying the number of clusters k in advance and initial cluster centers, while agglomerative and divisive clustering can detect inherent clusters without prior assumptions. Agglomerative clustering starts by treating each data point as a separate cluster and iteratively merges the closest pairs of clusters until all data points are in one cluster, allowing visualization using a dendrogram.

Uploaded by

xthele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

AB1202

Statistics and Analysis


Lecture 13
Cluster Analysis

Chin Chee Kai


cheekai@ntu.edu.sg
Nanyang Business School
Nanyang Technological University
NBS 2016S1 AB1202 CCK-STAT-018
2

Cluster Analysis
• Goals of Cluster Analysis
• Distance Function
• K-Means Clustering
• Agglomerative Clustering
• Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
3

Goals of Cluster Analysis


• Identify “similar” items in the same cluster
▫ Eg, low-quality bank loan applicants, big spending
customers, group of suspicious accounts, etc
• Detect structures in terms of clusters
▫ Eg, a retail store may not know it is serving two distinct
groups of customers until it clusters its members’
transactions.
• Filtering dissimilar data
▫ Eg, applying multiple regression on two distinct clusters of
data will produce meaningless results; instead, analyse each
cluster separately.
• Noise removal
▫ Cluster centers can be summaries of clusters, identifying
core characteristics, and allowing outliers and noise to be
ignored.
NBS 2016S1 AB1202 CCK-STAT-018
4

Distance Function
• A distance function is basically a formula applied on
two data points to give a decimal number (≥ 0).
• Remember that a data point may be
▫ a number (eg 2),
▫ a coordinate (eg (3, 4)), or
▫ a record of values (eg (20 years, 1.75 m, “Male”, 72 kg,
3.9999 GPA)).
• But to calculate distance, we convert all non-
numerical values into numerical values (see Coding
in Chapter 15).
▫ (20, 1.75, 1, 72, 3.9999)  we call this “vector”
• So, not surprisingly, when we think of a data point 𝑝
in clustering, it is also a vector 𝑥1 , 𝑥2 , … , 𝑥𝑘 .
NBS 2016S1 AB1202 CCK-STAT-018
5

Distance Function
• Suppose we have 2 data points 𝑝1 = 𝑥1 , 𝑥2 , … , 𝑥𝑘 and
𝑝2 = 𝑦1 , 𝑦2 , … , 𝑦𝑘
• 3 kinds of distance functions are commonly used.
• Euclidean distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑘 − 𝑦𝑘 2
 More “natural” like how we measure distance
 But takes lots of CPU time. Sometimes we ignore taking
square-root to save a bit of calculation
• Manhattan distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 + ⋯ + 𝑥𝑘 − 𝑦𝑘
 Fast calculation; preserves practical sense of “further means
larger distance”.
• Max distance:
▫ dist 𝑝1 , 𝑝2 = max⁡( 𝑥1 − 𝑦1 , … , 𝑥𝑘 − 𝑦𝑘 )
 Also fast calculation
 A little hard to interpret from a layman’s perspective.
NBS 2016S1 AB1202 CCK-STAT-018
6

Clustering Methods
• All clustering methods are programmable steps to
use a distance function to assign a bunch of data
points to cluster numbers (1, 2, 3, …).
▫ ie all clustering methods need:
 Data points
 A distance function
 Any other input needed by the clustering method
• We will look at 3 clustering methods:
▫ K-Means Clustering
▫ Agglomerative Clustering
▫ Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
7

K-Means Clustering
• If we have a pre-determined number of cluster 𝑘
(eg 2 clusters) in mind, then K-Means clustering
can be used.
• This is not as restrictive as it sounds, since
decision making through clustering typically does
not involve large number of clusters.
▫ Or we could also try to analyze clustering results
from 𝑘 = 2, 3 and 4 to compare and contrast.
• K-Means clustering also requires 𝑘 starting cluster
centers as input – different starting cluster centers
may result in different clustering outcomes.
NBS 2016S1 AB1202 CCK-STAT-018
8

K-Means Clustering Example


• Suppose we have a data set 2, 6, 9, 3, 5, 7. If we
impose 2 clusters on this data, which point belongs to
which cluster?
• Using K-Means with starting centers 1 and 10, we
calculate squared-Euclidean distances to center 1 (d1)
and 2 (d2) for every point.
[1] "Step 1: -------------"
name x curCluster newCluster d1 d2
1 a 2 0 1 1 64 Initial cluster centers
2 b 6 0 2 25 16
3 c 9 0 2 64 1 • The distances are tabulated on the
4 d 3 0 1 4 49 left. Each point is assigned to the
5 e 5 0 1 16 25
6 f 7 0 2 36 9
cluster whose center is closest to it,
[1] " Centers =====" as indicated by the “newCluster”
[1] 3.333333 7.333333 column.
• For the new clusters, new center
Cluster 1 new Cluster 2 new points are calculated by averaging
center = (2 + 3 + center = (6 + 9 + the data points.
5)/3 = 3.3333 7)/3 = 7.3333
NBS 2016S1 AB1202 CCK-STAT-018
9

K-Means Clustering Example


• Again, we calculate squared-Euclidean distance
for all data points to new cluster centers (3.3333
and 7.3333) to get updated columns of d1 and d2.
• Each data point is re-assigned to new cluster
closest to it.
• But we see no change, and so K-Means stops.
[1] "Step 2: -------------"
name x curCluster newCluster d1 d2
1 a 2 1 1 1.7777778 28.4444444
2 b 6 2 2 7.1111111 1.7777778
3 c 9 2 2 32.1111111 2.7777778
4 d 3 1 1 0.1111111 18.7777778
5 e 5 1 1 2.7777778 5.4444444
6 f 7 2 2 13.4444444 0.1111111
[1] " Centers ====="
[1] 3.333333 7.333333
Final clustering
assignments Final cluster centers
NBS 2016S1 AB1202 CCK-STAT-018
10

Agglomerative and Divisive Clustering


• Unlike K-Means, these methods do not require prior
knowledge of decision on the number of clusters.
They are, therefore, great to help detect inherent
clusters in the vast pool of data.
• Unlike K-Means also, they do not require guessing
initial cluster centers as they derive clustering results
based on data values themselves.

• Generates hierarchical layering of data points to show


which points are clustered with which.
▫ Thus, these are also called “Hierarchical Clustering”
▫ Drawing a tree diagram of the links results in what is
called “dendrogram” in clustering terminology (just an
inverted tree with leaves on the ground)
NBS 2016S1 AB1202 CCK-STAT-018
11

Agglomerative Clustering
• The great idea:
▫ i. Start by letting each data point be its own cluster of 1.
▫ ii. Calculate average distance between all pairs of clusters.
▫ iii. Merge the pair of clusters with shortest distance.
▫ iv. Then repeat step (ii) until we get only one big cluster.
• Because we keep of history of the merging steps, it is
possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared to
other clusters, it is a good sign that they should be separate
clusters and should not be merged.
NBS 2016S1 AB1202 CCK-STAT-018
12

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7.
[1] "Step 1: -------------"
name x curCluster newCluster 1 2 3 4 5 6
1 a 2 1 7 - Clusters 1 and 4 are
2 b 6 2 2 4 - agglomerated into
3 c 9 3 3 7 3 - new cluster 7
4 d 3 4 7 1 3 6 -
5 e 5 5 5 3 1 4 2 -
6 f 7 6 6 5 1 2 4 2 -

[1] "Step 2: -------------"


name x curCluster newCluster 7 2 3 5 6
1 a 2 7 7
4 d 3 7 7 -
2 b 6 2 8 3.5 - Clusters 2 and 6
3 c 9 3 3 6.5 3 -
are agglomerated
5 e 5 5 5 2.5 1 4 -
6 f 7 6 8 4.5 1 2 2 -
into new cluster 8
(dist(6,2)+dist(6,3))/2 =
(4+3)/2 = 3.5 (dist(7,2)+dist(7,3))/2 =
(dist(5,2)+dist(5,3))/2 =
(dist(9,2)+dist(9,3))/2 = (3+2)/2 = 2.5 (5+4)/2 = 4.5
(7+6)/2 = 6.5
NBS 2016S1 AB1202 CCK-STAT-018
13

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7. (dist(6,2)+dist(7,2)+dist(6,3)+dist(7,3)/4 =
(4+5+3+4)/4 = 4
[1] "Step 3: -------------"
(dist(9,2)+dist(9,3))/2 = (7+6)/2 = 6.5
name x curCluster newCluster 7 8 3 5
1 a 2 7 7 (dist(5,2)+dist(5,3))/2 = (3+2)/2 = 2.5
4 d 3 7 7 -
(dist(9,6)+dist(9,7))/2 = (3+2)/2 = 2.5
2 b 6 8 9
6 f 7 8 9 4 - (dist(5,6)+dist(5,7))/2 = (1+2)/2 = 1.5
3 c 9 3 3 6.5 2.5 -
5 e 5 5 9 2.5 1.5 4 - Clusters 8 and 5 are merged into
new cluster 9
[1] "Step 4: -------------"
name x curCluster newCluster 7 9 3
1 a 2 7 7
4 d 3 7 7 - Clusters 9 and 3 are merged into
2 b 6 9 10 new cluster 10
6 f 7 9 10
5 e 5 9 10 3.5 -
3 c 9 3 10 6.5 3 -

(dist(6,2)+dist(7,2)+dist(5,2)+
(dist(9,2)+dist(9,3))/2 = (dist(6,9)+dist(7,9)+dist(5,9))/3 =
dist(6,3)+dist(7,3)+dist(5,3))/6
(7+6)/2 = 6.5 (3+2+4)/3 = 3
= (4+5+3+3+4+2)/6 = 3.5
NBS 2016S1 AB1202 CCK-STAT-018
14

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7.
[1] "Step 5: -------------"
(dist(6,2)+dist(7,2)
name x curCluster newCluster 7 10
+dist(5,2)+dist(9,2)
1 a 2 7 7 +dist(6,3)+dist(7,3)
4 d 3 7 7 - +dist(5,3)+dist(9,3)
2 b 6 10 10 )/8 =
6 f 7 10 10 (4+5+3+7+3+4+2+
5 e 5 10 10 6)/8 = 4.25
C11
3 c 9 10 10 4.25 -
Height = 4.25
Clusters 7 and 10 are merged into
new cluster 11. Done!
C10
Call: agnes(x = dis)
Height = 3
Agglomerative coefficient: 0.6666667
Order of objects:
[1] 2 3 6 7 5 9
Height (summary): C9
Min. 1st Qu. Median Mean 3rd Qu. Max. Height = 1.5
1.00 1.00 1.50 2.15 3.00 4.25 C7
> cst$height Height = 1 C8
[1] 1.00 4.25 1.00 1.50 3.00 Height = 1
> cst$order
[1] 1 4 2 6 5 3
NBS 2016S1 AB1202 CCK-STAT-018
15

Divisive Clustering
• The great idea:
▫ i. Start by letting all data point as one big cluster.
▫ ii. Split each cluster into 2 new smaller clusters starting
with two furthest data points.
▫ iii. Then calculate average distance of all other remaining
data points to the two new clusters, moving the data
points to the closer new cluster.
▫ iv. Then repeat step (ii) until we get only one data point
per cluster.
• Also possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared
to other clusters, it is a good sign that they should be
separate clusters and should not be merged.
▫ May not be the same dendrogram as from Agglomerative
NBS 2016S1 AB1202 CCK-STAT-018
16
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 1: -------------"
name x curCluster newCluster Pt 2 6 9 3 5 7 Points 2 and 9 are starting
1 a 2 1 2 2 - points for dividing into 2
2 b 6 1 3 6 4 - clusters 2 and 3.
3 c 9 1 3 9 7 3 -
4 d 3 1 2 3 1 3 6 -
5 e 5 1 3 5 3 1 4 2 -
6 f 7 1 3 7 5 1 2 4 2 -

Points 2 and 3 are starting


[1] "Step 2: -------------"
name x curCluster newCluster Pt 2 3 6 9 5 7
points for dividing into 2
1 a 2 2 4 2 - clusters 4 and 5 (then stop).
4 d 3 2 5 3 1 -
2 b 6 3 6 6 - Points 5 and 9 are starting
3 c 9 3 7 9 3 - points for dividing into 2
5 e 5 3 6 5 1 4 - clusters 6 and 7.
6 f 7 3 6 7 1 2 2 -
NBS 2016S1 AB1202 CCK-STAT-018
17
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 3: -------------"
name x curCluster newCluster Pt 2 3 6 5 7 9 Points 5 and 7 are starting
1 a 2 4 4 2 - points for dividing into 2
4 d 3 5 5 3 - clusters 8 and 9.
2 b 6 6 9 6 -
5 e 5 6 8 5 1 -
6 f 7 6 9 7 1 2 -
3 c 9 7 7 9 -

[1] "Step 4: -------------"


name x curCluster newCluster Pt 2 3 6 7 5 9 Points 6 and 7 are starting
1 a 2 4 4 2 - points for dividing into 2
4 d 3 5 5 3 - clusters 10 and 11 (and stop).
2 b 6 9 10 6 -
6 f 7 9 11 7 1 -
5 e 5 8 8 5 -
3 c 9 7 7 9 -
NBS 2016S1 AB1202 CCK-STAT-018
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2 18

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 4: -------------"
name x curCluster newCluster Pt 2 3 6 7 5 9
1 a 2 4 4 2 -
4 d 3 5 5 3 -
2 b 6 10 10 6 -
6 f 7 11 11 7 - C1
5 e 5 8 8 5 -
3 c 9 7 7 9 - Height = 7

C3
Height = 4

> cst = diana(dis) C6


> cst$height Height = 2
[1] 1 7 1 2 4
> cst$order C2
C9
[1] 1 4 2 6 5 3 Height = 1
Height = 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy