Cluster Lecture-1
Cluster Lecture-1
&
(Classification And
Regression Trees = CART)
James McCreight
mccreigh >at< gmail >dot< com
1
Why talk about them together?
Partitioning data:
cluster analysis partitions vectors of data based on the
properties of the vectors.
CART partitions a response (one entry in a vector)
variable based on predictor variables (other entries in a
vector)
If we are going to talk about clustering it is worth the time to expose you to CART.
2
Cluster Analysis
Overview from wikipedia (font of all fact checks) reveals a broad topic with lots of
applications.
Cluster analysis
Clusters and clusterings
From Wikipedia, the free encyclopedia
The notion of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a particular problem. At first
Cluster
the analysis
terminology of or clustering
a cluster seemsis the task ofa assigning
obvious: a setobjects.
group of data of objects into groups
However, (called found
the clusters clusters) so that the
by different objects in
algorithms thesignificantly
vary same cluster in are
theirmore similar and
properties, (in some
sense or another)
understanding to each
these other
cluster than to
models is those
key toin other clusters.the differences between the various algorithms. Typical cluster models include:
understanding
■ Connectivity
Clustering is a mainmodels:
task of for exampledata
explorative hierarchical
mining, clustering buildstechnique
and a common models based on distance
for statistical dataconnectivity.
analysis used in many fields, including machine learning, pattern
■ Centroid
recognition, models:
image for example the retrieval,
analysis,information k-means and
algorithm represents each cluster by a single mean vector.
bioinformatics.
■ Distribution models: clusters are modeled using statistic distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of
■ Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
what constitutes a cluster and how to efficiently find them.
■ Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
■ Group models: some algorithms (unfortunately) do not provide a refined model for their results and just provide the grouping information.
A clustering is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other,
for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished in:
■ hard clustering: each object belongs to a cluster or not
■ soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster)
There are also finer distinctions possible, for example:
■ strict partitioning clustering: here each object belongs to exactly one cluster
■ strict partitioning clustering with outliers: object can also belong to no cluster, and are considered outliers.
■ overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
■ hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
■ subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.
2 Clustering Algorithms
1 2.1 Connectivity based clustering (Hierarchical clustering)
2 2.2 Centroid-based clustering
3 2.3 Distribution-based clustering
4 2.4 Density-based clustering
5 2.5 Newer Developments
3
Some Nomenclature
K-Means K-Medoids
• hard clustering
• hard clustering
• medoid model (cluster member)
• centroid model
• quantative + ordinal + categorical
• quantitative variables
variables
4
Outline
The algorithm
hierarchical clustering
5
1-D Clusters and Variance
The 1-D squared euclidean distance/dissimilarity
d(xi , xi ) = (xi − xi )2
between any data point and its associated centroid xi .
For a single 1-D cluster with centroid µ , k-means clustering minimizes the
within-cluster scatter which looks like the (unnormalized) variance
� �
2
W (C) = d(xi , µ) = (xi − µ)
i i
For K clusters (K centroids), we have:
� �
2 2
W (C) = (xi1 − u1 ) + ... + (xiK − uK )
i1 iK
K
� � N
�
2
= Nk (xi − uk ) (NK = I(C(i) = k))
i=1
k=1 C(i)=k
6
Intuitive 1-D non-example
Amazon monthly rainfall, 3 ways
●
●
●
●
10 ● ●
●
●
● ●
●
●
●
● ●
● ●
● ● ● ● ●
● ● ● ● ●●
● ● ●
● ● ●
●
● ●
● ● ● ●
● ●
●
● ● ● ● ●●
● ●● ● ●
●
● ●
● ●
● ●● ● ● ● ● ● ● ●
●
● ●● ● ●
●
● ●● ● ●
● ● ● ● ● ●● ● ● ● ●●
● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ●●
●●
●
● ● ● ●
●● ●● ● ● ● ●
●
● ● ● ●● ● ●
● ● ● ● ● ● ●
●
●● ● ● ●
●
●
● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ●
● ●
●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
●
●● ● ●
● ● ●
●
● ●
● ● ● ●
● ●
●
5 ●
●
●
● ● ● ●● ●
● ●● ●
●
● ● ●
● ● ●
● ● ● ●
● ●● ● ●
● ●
●● ● ● ●
● ● ● ● ●
●
●●● ● ● ●
● ● ● ● ● ●●
●● ● ●● ● ● ●
●
●●
● ●
●
●●
●●
● ●
●
● ●
●
●●
● ● ● ● ●
● ●
●
●
● variable
● ● ●
●
●● ● ● ● ● ● variable
mm/day
●
● ● ●
● ● ● ● ●
● ● ● ● ● ●●
●
● ● ● ●
● ● ●
●
● ● ●
● ●
●
●
● original.gpcp
variable
●
●
●
● ● ● ● ● ● ●
●
● original.gpcp
● ● ● ● ● original.gpcp
cluster.1
●
● ●
●
● ● cluster.1
● ● cluster.12
−5
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
●
●
●
10 ● ●
Amazon avg precipitation (mm/day)
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●
● ● ●
● ● ●
● ●● ●
● ● ●
● ● ● ●
● ●
● ● ●● ●
●● ● ●● ● ● ●
● ● ● ●
● ● ●
●
● ● ●● ● ●
● ●● ● ● ●● ● ●
●● ● ● ● ● ● ● ● ● ●
8 ●
●
●
● ●
● ●
● ● ●
●
●
●
●●
●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●●
● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
6 ● ●
●
●
● ●
●
● ●
● ●
● ●
●
●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ●
● ●
● ●
●
●
●
● ● ● ●● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
4 ● ●
●
●
●
●
● ● ●
●
● ●● ●
●
●
●
●
●
● ●
● ● ● ●● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●
● ● ●
● ●●● ●
● ●
● ● ● ●
● ●
● ●
●
2 ●
1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010
year
7
non-example:
example was a priori clustering.
illustrates:
The total scatter, T, is a constant function of the data points, under
euclidean norm it is proportional to their total variance
T = W (C) + B(C)
To minimize W is to maximize B.
W and B are functions of the specific cluster centers, C(K), and their
number, K.
8
Clustering in 2-d
The 2-d euclidean measure has xi as 2-d vector, and the within-cluster
scatter is minimized:
K
� �
2 2
W (C) = Nk (xi1 − ui1 ) + (xi2 − ui2 )
k=1 C(i)=k
K
� 2
� �
2
= Nk (xid − µkd )
k=1 C(i)=k d=1
K
� � 2
= Nk ||xi − µk ||
k=1 C(i)=k
... example in R.
9
Clustering in D-d
Let xi be a D-dimensional vector:
�K �
2 2
W (C) = Nk (xi1 − ui1 ) + ... + (xiD − uiD )
k=1 C(i)=k
K
� D
� �
= Nk (xid − µkd )2
k=1 C(i)=k d=1
K
� � 2
= Nk ||xi − µk ||
k=1 C(i)=k
Examples:
1-d: O rainfall observations 11-d: mtcars 32 obs of 11 vars
(rows=obs in dataframe)
2-d: P points in 2-d space
T-d: P points with length T
3-d: P points in 3-d space timeseries (homework)
10
Lloyd’s “hill-climbing” algorithm
11
... on and on ...
clustering packages in R:
clust, flexclust, mclust, pvclust, fpc, som, clusterfly
see: http://cran.r-project.org/web/views/Multivariate.html
12
The problem of K
in some situations, k is known. Fine.
graph kink
hierarchical approach
Consider we dont know anything about the physical problem, then consider
W(K)
## Determine number of clusters, adapted
kink.wss <- function(data, maxclusts=15) {
t <- kmeans(data,1)$totss
w <- laply( as.list(2:maxclusts), function(nc) kmeans(data,nc)$tot.withinss )
plot(1:maxclusts, c(t,w), type="b",
xlab="Number of Clusters", ylab="Within groups sum of squares",
main=paste(deparse(substitute(data))) )
}
13
Amazon Rainfall redux continued
clframe$original.gpcp
1500
Within groups sum of squares
1000
500
●
●
● ● ● ● ● ● ● ● ● ● ●
0
2 4 6 8 10 12 14
Number of Clusters
14
aside...
EOF/PCA vs Cluster Analysis
15
mclust: 2 cluster mixture model via EM
Density
0.20
−1650
0.15
−1700
density
−1750
BIC
0.10
−1800
0.05
−1850
E
V
5 10 15 20 2 4 6 8 10
number of components
16
Height
0 2 4 6 8
109
114
123
368
711
356
376
111
369
74
101
26
49
116
37
121
39
384
373
73
97
17
361
128
61
330
99
108
96
349
103
335
323
337
340
363
3485
98
371
158
326
14
144
141
25
372
332
334
20
329
110
350
321
138
24
104
145
149
360
139
338
353
374
115
328
119
27
322
153
122
378
126
352
354
150
152
327
132
336
377
146
296
342
299
134
303
347
159
325
290
308
331
135
156
294
346
151
341
136
370
324
157
81
38232
12
18
33
16
102 6
19
87
22
105
469
117
355
40
127
60
80
78
93
50
57
54
118
58
52
366
64
362 7
120
15
70
69
375 8
10
13
106
124
381
32
85
125
45
66
29
41
357
84
107
359
88
34
383
76
112
113
364
380
55
358
367674
23
63
38
42
35
51
65
47
62
91
30
95
36
11
43
44
77
31
59
82
72
100
90
94
21
79
379
28
92
53
68
56
89
48
75
83
86
129
154
140
147
293
315
317
343
131
155
148
344
171
333
295
160
365
130
300
143
191
298
176
304
314
339
345
291
316
318
351
306
133
319
clframe$original.gpcp
Divisive Coefficient = 1 310
137
289
185
297
313
320
312
173
292
311
260
264
277
181
262
278
163
177
305
166
169
263
184
188
168
172
272
178
cluster: diana
183
302
170
182
309
270
301
142
258
162
186
190
281
167
180
200
214
271
164
192
204
257
238
269
276
282
175
307
274
199
259
286
268
279
284
161
230
249
205
220
266
165
202
250
239
261
198
224
246
235
275
174
225
273
223
216
242
267
189
227
196
203
218
231
215
265
179
194
207
212
232
187
285
213
240
Dendrogram of diana(x = clframe$original.gpcp)
201
208
206
209
236
280
222
248
287
221
195
210
255
283
226
228
229
217
253
233
243
193
237
254
197
251
234
219
288
244
245
252
211
256
247
241
17
Resolving spatial non-stationarity in snow depth
distribution
)" !"
*+ #$
," %"
** #&
-" '"
*& #(
18
New Observations
-> CART
19
require(ggplot2)
## generate 3 random clusters about fixed centroids (5,5), (5,-5) and (-5,-5)
clust.2d <- function(var=0) {
data <- as.data.frame(rbind( cbind(x=rnorm(10, +5, var), y=rnorm(10, 5, var)),
cbind(rnorm(15, +5, var), rnorm(15,-5, var)),
cbind(rnorm(12, -5, var), rnorm(12,-5, var)) ) )
plot.frame <- as.data.frame(data); plot.frame$orig.clust <- factor(c(rep(1,10),rep(2,15),rep(3,12)) )
plot.frame$k.clust <- factor(kmeans( data, 3)$cluster) ## make it a factor, since it's categorical
ggplot( plot.frame, aes(x=x,y=y,color=orig.clust,shape=k.clust) ) + geom_point(size=3)
}
clust.2d()
clust.2d(var=2)
clust.2d(var=3)
clust.2d(var=10)
## calculate T = W + B
kdata <- kmeans(data,3)
str(data)
str(kdata)
W <- sum((data-kdata$centers[kdata$cluster,])^2)
kdata$tot.withinss
kink.wss(data, max=8)
20