0% found this document useful (0 votes)

35 views

Cluster Lecture-1

This document discusses and compares cluster analysis and classification and regression trees (CART). Both techniques partition data into groups, but they do so in different ways. Cluster analysis partitions vectors of data based on the properties of the vectors, while CART partitions a response variable based on predictor variables. The document provides examples of how k-means clustering and CART both seek to minimize variance in their partitioning approaches. It also notes differences between the techniques in terms of whether they handle continuous or categorical data and their partitioning structures.

Uploaded by

aychewchernet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Cluster Lecture-1

Uploaded by

aychewchernet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

(Cluster Analysis)

&
(Classification And
Regression Trees = CART)

James McCreight
mccreigh >at< gmail >dot< com

1
Why talk about them together?

Partitioning data:
cluster analysis partitions vectors of data based on the
properties of the vectors.
CART partitions a response (one entry in a vector)
variable based on predictor variables (other entries in a
vector)

K-means clustering and CART select clusters which minimize

variance.
Continuous or categorical partitioning (regression vs
classification).
Hierarchical clustering and CART have the same partition structure

If we are going to talk about clustering it is worth the time to expose you to CART.

2
Cluster Analysis
Overview from wikipedia (font of all fact checks) reveals a broad topic with lots of
applications.

Cluster analysis
Clusters and clusterings
From Wikipedia, the free encyclopedia
The notion of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a particular problem. At first
Cluster
the analysis
terminology of or clustering
a cluster seemsis the task ofa assigning
obvious: a setobjects.
group of data of objects into groups
However, (called found
the clusters clusters) so that the
by different objects in
algorithms thesignificantly
vary same cluster in are
theirmore similar and
properties, (in some
sense or another)
understanding to each
these other
cluster than to
models is those
key toin other clusters.the differences between the various algorithms. Typical cluster models include:
understanding
■ Connectivity
Clustering is a mainmodels:
task of for exampledata
explorative hierarchical
mining, clustering buildstechnique
and a common models based on distance
for statistical dataconnectivity.
analysis used in many fields, including machine learning, pattern
■ Centroid
recognition, models:
image for example the retrieval,
analysis,information k-means and
algorithm represents each cluster by a single mean vector.
bioinformatics.
■ Distribution models: clusters are modeled using statistic distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of
■ Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
what constitutes a cluster and how to efficiently find them.
■ Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
■ Group models: some algorithms (unfortunately) do not provide a refined model for their results and just provide the grouping information.
A clustering is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other,
for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished in:
■ hard clustering: each object belongs to a cluster or not
■ soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster)
There are also finer distinctions possible, for example:
■ strict partitioning clustering: here each object belongs to exactly one cluster
■ strict partitioning clustering with outliers: object can also belong to no cluster, and are considered outliers.
■ overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
■ hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
■ subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

2 Clustering Algorithms
1 2.1 Connectivity based clustering (Hierarchical clustering)
2 2.2 Centroid-based clustering
3 2.3 Distribution-based clustering
4 2.4 Density-based clustering
5 2.5 Newer Developments

3
Some Nomenclature

Clustering is unsupervised learning: dosent require predictor variables; there's

no reward function, no training examples; it’s not regression.

Elements of Statistical Learning (5th ed.)

ch 14 on unsupervised learning
chapter 14.3 (p501-528) focuses on the two most popular kinds of
clustering for a wide variety of applications:

K-Means K-Medoids
• hard clustering
• hard clustering
• medoid model (cluster member)
• centroid model
• quantative + ordinal + categorical
• quantitative variables
variables

Both require a distance/dissimilarity metric.

4
Outline

1-d non-example: the idea of variance and clusters

2-d example, dissimilarity/variance in 2-d

Dissimilarity / variance in N-d

The algorithm

Problem of a priori selection of K

hierarchical clustering

5
1-D Clusters and Variance
The 1-D squared euclidean distance/dissimilarity

d(xi , xi ) = (xi − xi )2
between any data point and its associated centroid xi .

For a single 1-D cluster with centroid µ , k-means clustering minimizes the
within-cluster scatter which looks like the (unnormalized) variance
� �
2
W (C) = d(xi , µ) = (xi − µ)
i i
For K clusters (K centroids), we have:
� �
2 2
W (C) = (xi1 − u1 ) + ... + (xiK − uK )
i1 iK
K
� � N
�
2
= Nk (xi − uk ) (NK = I(C(i) = k))
i=1
k=1 C(i)=k
6
Intuitive 1-D non-example
Amazon monthly rainfall, 3 ways

●
●
●
●
10 ● ●
●
●
● ●
●
●
●
● ●
● ●
● ● ● ● ●
● ● ● ● ●●
● ● ●
● ● ●
●
● ●
● ● ● ●
● ●
●
● ● ● ● ●●
● ●● ● ●
●
● ●
● ●
● ●● ● ● ● ● ● ● ●
●
● ●● ● ●
●
● ●● ● ●
● ● ● ● ● ●● ● ● ● ●●
● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ●●
●●
●
● ● ● ●
●● ●● ● ● ● ●
●
● ● ● ●● ● ●
● ● ● ● ● ● ●
●
●● ● ● ●
●
●
● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
●
● ● ● ● ● ● ● ●
● ●
●
● ● ● ●
● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ●
● ●
●
●● ● ●
● ● ●
●
● ●
● ● ● ●
● ●
●
5 ●
●
●
● ● ● ●● ●
● ●● ●
●
● ● ●
● ● ●
● ● ● ●
● ●● ● ●
● ●
●● ● ● ●
● ● ● ● ●
●
●●● ● ● ●
● ● ● ● ● ●●
●● ● ●● ● ● ●
●
●●
● ●
●
●●
●●
● ●
●
● ●
●
●●
● ● ● ● ●
● ●
●
●
● variable
● ● ●
●
●● ● ● ● ● ● variable
mm/day

●
● ● ●
● ● ● ● ●
● ● ● ● ● ●●
●
● ● ● ●
● ● ●
●
● ● ●
● ●
●
●
● original.gpcp
variable
●

●
●
● ● ● ● ● ● ●
●
● original.gpcp
● ● ● ● ● original.gpcp
cluster.1
●
● ●
●
● ● cluster.1
● ● cluster.12

−5

1979 1984 1989 1994 1999 2004 2009

POSIXct

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
●
●
●

10 ● ●
Amazon avg precipitation (mm/day)

● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●
● ● ●
● ● ●
● ●● ●
● ● ●
● ● ● ●
● ●
● ● ●● ●
●● ● ●● ● ● ●
● ● ● ●
● ● ●
●
● ● ●● ● ●
● ●● ● ● ●● ● ●
●● ● ● ● ● ● ● ● ● ●
8 ●
●
●
● ●
● ●
● ● ●
●
●
●
●●
●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ●● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●●
● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
6 ● ●
●
●
● ●
●
● ●
● ●
● ●
●
●

● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ●
● ●
● ●
●
●
●
● ● ● ●● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
4 ● ●
●
●
●
●
● ● ●
●
● ●● ●
●
●
●
●
●
● ●
● ● ● ●● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●
● ● ●
● ●●● ●
● ●
● ● ● ●
● ●
● ●
●

2 ●

1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010 1980 1995 2010

year

7
non-example:
example was a priori clustering.

“cluster analysis” is machine learning driven by an algorithm.

for a specified number of clusters, machine learning would have

found different centroids.

the algorithm minimizes the scatter about the centroids.

illustrates:
The total scatter, T, is a constant function of the data points, under
euclidean norm it is proportional to their total variance

T is the sum of the within-cluster scatter and between cluster scatter

T = W (C) + B(C)
To minimize W is to maximize B.

W and B are functions of the specific cluster centers, C(K), and their
number, K.
8
Clustering in 2-d

The 2-d euclidean measure has xi as 2-d vector, and the within-cluster
scatter is minimized:

K
� �
2 2
W (C) = Nk (xi1 − ui1 ) + (xi2 − ui2 )
k=1 C(i)=k
K
� 2
� �
2
= Nk (xid − µkd )
k=1 C(i)=k d=1

K
� � 2
= Nk ||xi − µk ||
k=1 C(i)=k

... example in R.

9
Clustering in D-d
Let xi be a D-dimensional vector:
�K �
2 2
W (C) = Nk (xi1 − ui1 ) + ... + (xiD − uiD )
k=1 C(i)=k
K
� D
� �
= Nk (xid − µkd )2
k=1 C(i)=k d=1
K
� � 2
= Nk ||xi − µk ||
k=1 C(i)=k

Examples:
1-d: O rainfall observations 11-d: mtcars 32 obs of 11 vars
(rows=obs in dataframe)
2-d: P points in 2-d space
T-d: P points with length T
3-d: P points in 3-d space timeseries (homework)

10
Lloyd’s “hill-climbing” algorithm

K-means Clustering Algorithm:

0. Assign an initial set of cluster centers, {µ1 , ..., µk } .
1. Assign each observation to its closest centroid in {µ1 , ..., µk } .
2. Update the centroids based on the last assignment.
3. Iterate steps 1 and 2 until the assignments (1) do not change.

the algorithm is expensive (NP-hard: O(ndk+1 log n) )

this is a stochastic algorithm because of the 1st step,
results may vary from run to run!
convergence depends on the assumptions of the model and the
nature of the data:
model: spherical clusters which are separable so that their
centroids converge.
data: try clustering a smooth gradient.

11
... on and on ...

note: gaussian mixtures as soft k-means clustering (Hastie et al. p.

510),

mclust package: model based clustering, BIC...

recent link of k-means and PCA under certain assumptions.

see:http://en.wikipedia.org/wiki/K-means_clustering

clustering built in to R (stats): kmeans, hclust

clustering packages in R:
clust, flexclust, mclust, pvclust, fpc, som, clusterfly
see: http://cran.r-project.org/web/views/Multivariate.html

QuickR page on clustering has some useful overview:

http://www.statmethods.net/advstats/cluster.html

12
The problem of K
in some situations, k is known. Fine.

when k is not known we have a new problem, some approaches:

graph kink

model clustering EM/BIC approach

hierarchical approach

Amazon Rainfall redux

A priori, we had a reason for 12 clusters: months of the year

Consider we dont know anything about the physical problem, then consider

W(K)
## Determine number of clusters, adapted
kink.wss <- function(data, maxclusts=15) {
t <- kmeans(data,1)$totss
w <- laply( as.list(2:maxclusts), function(nc) kmeans(data,nc)$tot.withinss )
plot(1:maxclusts, c(t,w), type="b",
xlab="Number of Clusters", ylab="Within groups sum of squares",
main=paste(deparse(substitute(data))) )
}

13
Amazon Rainfall redux continued

clframe$original.gpcp

1500
Within groups sum of squares

1000
500

●
●
● ● ● ● ● ● ● ● ● ● ●
0

2 4 6 8 10 12 14

Number of Clusters

looking for a number of clusters after which W dosent decrease much.

14
aside...
EOF/PCA vs Cluster Analysis

Dominant variability (modes) vs similar observations (clusters),

one chooses the # of clusters but not the # of modes.

EOF/PCA: data subspaces which explain maximum variance.

Cluster analysis: similarities/differences in observations

identify observations which vary similarly,

decompose non-stationarity, homogenize a variable.

15
mclust: 2 cluster mixture model via EM

Density

0.20
−1650

0.15
−1700

density
−1750
BIC

0.10
−1800

0.05
−1850

E
V

5 10 15 20 2 4 6 8 10

number of components

16
Height

0 2 4 6 8

109
114
123
368
711
356
376
111
369
74
101
26
49
116
37
121
39
384
373
73
97
17
361
128
61
330
99
108
96
349
103
335
323
337
340
363
3485
98
371
158
326
14
144
141
25
372
332
334
20
329
110
350
321
138
24
104
145
149
360
139
338
353
374
115
328
119
27
322
153
122
378
126
352
354
150
152
327
132
336
377
146
296
342
299
134
303
347
159
325
290
308
331
135
156
294
346
151
341
136
370
324
157
81
38232
12
18
33
16
102 6
19
87
22
105
469
117
355
40
127
60
80
78
93
50
57
54
118
58
52
366
64
362 7
120
15
70
69
375 8
10
13
106
124
381
32
85
125
45
66
29
41
357
84
107
359
88
34
383
76
112
113
364
380
55
358
367674
23
63
38
42
35
51
65
47
62
91
30
95
36
11
43
44
77
31
59
82
72
100
90
94
21
79
379
28
92
53
68
56
89
48
75
83
86
129
154
140
147
293
315
317
343
131
155
148
344
171
333
295
160
365
130
300
143
191
298
176
304
314
339
345
291
316
318
351
306
133
319
clframe$original.gpcp
Divisive Coefficient = 1 310
137
289
185
297
313
320
312
173
292
311
260
264
277
181
262
278
163
177
305
166
169
263
184
188
168
172
272
178
cluster: diana

183
302
170
182
309
270
301
142
258
162
186
190
281
167
180
200
214
271
164
192
204
257
238
269
276
282
175
307
274
199
259
286
268
279
284
161
230
249
205
220
266
165
202
250
239
261
198
224
246
235
275
174
225
273
223
216
242
267
189
227
196
203
218
231
215
265
179
194
207
212
232
187
285
213
240
Dendrogram of diana(x = clframe$original.gpcp)

201
208
206
209
236
280
222
248
287
221
195
210
255
283
226
228
229
217
253
233
243
193
237
254
197
251
234
219
288
244
245
252
211
256
247
241

17
Resolving spatial non-stationarity in snow depth
distribution
)" !"
*+ #$

," %"
** #&

-" '"
*& #(

18
New Observations

Classification: assign a new observation to its closest centroid of an

existing clustering.

But what does that get you??

Typically we want an estimate or prediction of some variable from new

data, not just a classification.

-> CART

19
require(ggplot2)

## generate 3 random clusters about fixed centroids (5,5), (5,-5) and (-5,-5)
clust.2d <- function(var=0) {
data <- as.data.frame(rbind( cbind(x=rnorm(10, +5, var), y=rnorm(10, 5, var)),
cbind(rnorm(15, +5, var), rnorm(15,-5, var)),
cbind(rnorm(12, -5, var), rnorm(12,-5, var)) ) )
plot.frame <- as.data.frame(data); plot.frame$orig.clust <- factor(c(rep(1,10),rep(2,15),rep(3,12)) )
plot.frame$k.clust <- factor(kmeans( data, 3)$cluster) ## make it a factor, since it's categorical
ggplot( plot.frame, aes(x=x,y=y,color=orig.clust,shape=k.clust) ) + geom_point(size=3)
}

clust.2d()
clust.2d(var=2)
clust.2d(var=3)
clust.2d(var=10)

# what is the total scatter?

var=1
data <- as.data.frame(rbind( cbind(x=rnorm(10, +5, var), y=rnorm(10, 5, var)),
cbind(rnorm(15, +5, var), rnorm(15,-5, var)),
cbind(rnorm(12, -5, var), rnorm(12,-5, var)) ) )

## calculate T = W + B
kdata <- kmeans(data,3)
str(data)
str(kdata)

T <- sum(diag(var(data))*(length(data[,1])-1)) ## unbiased sample variance is used in var()

T
T2 <- sum( (data$x-mean(data$x))^2 + (data$y-mean(data$y))^2 )
T2

W <- sum((data-kdata$centers[kdata$cluster,])^2)
kdata$tot.withinss

# Determine number of clusters, adapted

kink.wss <- function(data, maxclusts=15) {
t <- kmeans(data,1)$totss
w <- laply( as.list(2:maxclusts), function(nc) kmeans(data,nc)$tot.withinss )
plot(1:maxclusts, c(t,w), type="b",
xlab="Number of Clusters", ylab="Within groups sum of squares",
main=paste(deparse(substitute(data))) ) ## oooh, fancy!
}

kink.wss(data, max=8)

K-Means Data Clustering Approach: Jaipur National University
No ratings yet
K-Means Data Clustering Approach: Jaipur National University
43 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Clustering
No ratings yet
Clustering
84 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Lecture Unsupervised (17!04!2024).Pptx
No ratings yet
Lecture Unsupervised (17!04!2024).Pptx
61 pages
Clustering
No ratings yet
Clustering
34 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
29 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
fuzzy meaning
No ratings yet
fuzzy meaning
6 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
DM_C6
No ratings yet
DM_C6
37 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
M5
No ratings yet
M5
40 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
M5
No ratings yet
M5
40 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
Clustering
No ratings yet
Clustering
125 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Clustering and Pattern Recognition Unit 5
No ratings yet
Clustering and Pattern Recognition Unit 5
21 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
clustering
No ratings yet
clustering
6 pages
Clustering
No ratings yet
Clustering
39 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering
No ratings yet
Clustering
28 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Lecture 4.6 Unsupervised-learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-learning Clustering
60 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
From Everand
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Presentation guideline
No ratings yet
Presentation guideline
1 page
short for test
No ratings yet
short for test
9 pages
Android-Based-JavaMCQS-A-Mobile-Learning-Platform
No ratings yet
Android-Based-JavaMCQS-A-Mobile-Learning-Platform
8 pages
DS Chapter 5
No ratings yet
DS Chapter 5
28 pages
C++ Chapter3
No ratings yet
C++ Chapter3
20 pages
Bsc Computer Science Cs Semester 1 2023 April Problem Solving Using Computer and c Programming 2019 Pattern
No ratings yet
Bsc Computer Science Cs Semester 1 2023 April Problem Solving Using Computer and c Programming 2019 Pattern
3 pages
Parsons PDF
No ratings yet
Parsons PDF
16 pages
Linear Algebra and Optimization T2
No ratings yet
Linear Algebra and Optimization T2
19 pages
Experiment - 1: AIM: Write A Program To Implement QUICK SORT
No ratings yet
Experiment - 1: AIM: Write A Program To Implement QUICK SORT
24 pages
6 BSTs and AVL Trees
No ratings yet
6 BSTs and AVL Trees
12 pages
Daa C6
No ratings yet
Daa C6
14 pages
Assignment 2 - 228265B - Excel Solver Attachement
No ratings yet
Assignment 2 - 228265B - Excel Solver Attachement
4 pages
ML Question Papper
100% (1)
ML Question Papper
2 pages
Blast 2 Sequences, A New Tool For Comparing Protein and Nucleotide Sequences
No ratings yet
Blast 2 Sequences, A New Tool For Comparing Protein and Nucleotide Sequences
17 pages
AoA Important Question
100% (1)
AoA Important Question
3 pages
Integer Programming
No ratings yet
Integer Programming
16 pages
05 Brute Force
No ratings yet
05 Brute Force
56 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
btech-cs-3-sem-data-structures-cs205-dec-2023
No ratings yet
btech-cs-3-sem-data-structures-cs205-dec-2023
3 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Zobrist Hashing: Zobrist Hashing (Also Referred To As Zobrist Keys or Zobrist Signatures
No ratings yet
Zobrist Hashing: Zobrist Hashing (Also Referred To As Zobrist Keys or Zobrist Signatures
2 pages
1.5 Recursion
No ratings yet
1.5 Recursion
8 pages
NLP Programming en 04 HMM
No ratings yet
NLP Programming en 04 HMM
24 pages
Design Filter
No ratings yet
Design Filter
27 pages
Data Mining - Density Based Clustering
No ratings yet
Data Mining - Density Based Clustering
8 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
49 pages
Procedure C Lab-16
No ratings yet
Procedure C Lab-16
4 pages
Quantitative Methods MM ZG515 / QM ZG515: L11.1: Transportation Problem L11.2: Assignment Problem
No ratings yet
Quantitative Methods MM ZG515 / QM ZG515: L11.1: Transportation Problem L11.2: Assignment Problem
32 pages
Sandesh (DA) DSA PDF
No ratings yet
Sandesh (DA) DSA PDF
7 pages
Lab 3. Spectral Analysis in Matlab - Part IILab 3. Spectral Analysis in Matlab - Part II
No ratings yet
Lab 3. Spectral Analysis in Matlab - Part IILab 3. Spectral Analysis in Matlab - Part II
12 pages
Parsers
No ratings yet
Parsers
11 pages
Anti-Aliasing Filter Design and Applications in Sampling - Advanced PCB Design Blog - Cadence
No ratings yet
Anti-Aliasing Filter Design and Applications in Sampling - Advanced PCB Design Blog - Cadence
8 pages
Practical - 1: Aim Software Requirements Hardware Requirements Knowledge Requirements
No ratings yet
Practical - 1: Aim Software Requirements Hardware Requirements Knowledge Requirements
31 pages
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
100% (1)
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cluster Lecture-1

Uploaded by

Cluster Lecture-1

Uploaded by

(Cluster Analysis)

K-means clustering and CART select clusters which minimize

Clustering is unsupervised learning: dosent require predictor variables; there's

Elements of Statistical Learning (5th ed.)

Both require a distance/dissimilarity metric.

1-d non-example: the idea of variance and clusters

2-d example, dissimilarity/variance in 2-d

Dissimilarity / variance in N-d

Problem of a priori selection of K

1979 1984 1989 1994 1999 2004 2009

“cluster analysis” is machine learning driven by an algorithm.

for a specified number of clusters, machine learning would have

the algorithm minimizes the scatter about the centroids.

T is the sum of the within-cluster scatter and between cluster scatter

K-means Clustering Algorithm:

the algorithm is expensive (NP-hard: O(ndk+1 log n) )

note: gaussian mixtures as soft k-means clustering (Hastie et al. p.

mclust package: model based clustering, BIC...

recent link of k-means and PCA under certain assumptions.

clustering built in to R (stats): kmeans, hclust

QuickR page on clustering has some useful overview:

when k is not known we have a new problem, some approaches:

model clustering EM/BIC approach

Amazon Rainfall redux

looking for a number of clusters after which W dosent decrease much.

Dominant variability (modes) vs similar observations (clusters),

one chooses the # of clusters but not the # of modes.

EOF/PCA: data subspaces which explain maximum variance.

Cluster analysis: similarities/differences in observations

identify observations which vary similarly,

decompose non-stationarity, homogenize a variable.

Classification: assign a new observation to its closest centroid of an

But what does that get you??

Typically we want an estimate or prediction of some variable from new

# what is the total scatter?

T <- sum(diag(var(data))*(length(data[,1])-1)) ## unbiased sample variance is used in var()

# Determine number of clusters, adapted

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.