0% found this document useful (0 votes)
191 views3 pages

Lattin Et Al - Analyzing Multivariate Data - 281-283

Agglomerative clustering builds clusters gradually by linking or grouping objects together based on their similarity. It results in a hierarchical clustering represented as a dendrogram. There is no definitive number of clusters - the dendrogram shows nested cluster solutions that can be interpreted differently based on a chosen cut-off distance. Single linkage clustering tends to produce elongated, non-compact clusters and is sensitive to outliers. Other methods like complete linkage, average linkage, and centroid clustering aim to produce more balanced clusters of comparable diameter to address weaknesses of single linkage.

Uploaded by

vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views3 pages

Lattin Et Al - Analyzing Multivariate Data - 281-283

Agglomerative clustering builds clusters gradually by linking or grouping objects together based on their similarity. It results in a hierarchical clustering represented as a dendrogram. There is no definitive number of clusters - the dendrogram shows nested cluster solutions that can be interpreted differently based on a chosen cut-off distance. Single linkage clustering tends to produce elongated, non-compact clusters and is sensitive to outliers. Other methods like complete linkage, average linkage, and centroid clustering aim to produce more balanced clusters of comparable diameter to address weaknesses of single linkage.

Uploaded by

vignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

8.

4 Agglomerative Clustering: How It Works 281

1
flGURE 8.1 Observation
p en dro gr am for A---------.
kage clus­
single-lin
o utlon for
fe/ing s !
data in Figure 8.10
B _________,

c------------�

0------------�

0 2 3 4 5 6
Minimum distance between clusters

How Many Clusters?


Agglomerative clustering does not provide a definitive answer to the question, how
many clusters are there? In fact, the dendrogram is a graphical representation of a hi­
erarchy of nested cluster solutions: a one-cluster solution, two-cluster solution, and
so on, all the way up to an n-cluster solution. Drawing a vertical line on the dendro­
gram (corresponding to a particular distance valued) reveals the cluster solution at
that level of distance and the membership of the different clusters. For example, a ver­
tical line at d = 4 defines the two-cluster solution with clusters [ A, BJ and { C, DJ.
So how does one tell, by looking at the dendrogram, whether one of these nested
cluster solutions provides a "better" representation of the data? One thing to look for
is a relatively wide range of distances over which the number of clusters in the solu­
tion does not change. In this simple example, the two-cluster structure is stable over
the range of distances in the interval (3, 6). There is no question that reading the num­
ber of clusters from a dendrogram ( just like reading the number of factors from a
scree plot) involves a considerable amount of subjectivity and requires judgment on
the part of the analyst.

Properties of Single Linkage


The results from single-linkage clustering (and from all of the agglomerative ap­
proaches discussed subsequently in this section) are hierarchical in nature. That
means that a cluster solution near the top of the tree can always be obtained by com­
bining the clusters from any solution nearer the bottom of the tree. This property is a
natural consequence of the algorithm.
Single-linkage clustering is computationally efficient. As the number of objects
n increases, the worst-case amount of computational effort required increases on the
order of n2 • The algorithm is even more efficient for sparse data (e.g., for network
282 Chapter 8 Cluster Analysis

structures, where each object is connected to only a fraction of the other objects in
the set). Here the computational effort is on the order nA, where A is the average num.
ber of connections for each object in the set. Furthermore, ingJe-linkage clusteting
does not require metric data. The implementation of the algorithm described above
works just as well for ordinal measures of dissimilarity.
One drawback of single linkage is that it tends to be extremely myopic. An ob.
ject will be added to a cluster so long as it is close to any one of the other objects in
the cluster, even if it is relatively far from all the others. Thus, single linkage has a
tendency to produce long, stringy clusters and nonconvex cluster shapes. If the true
underlying clusters are nonconvex, then this property is not necessarily a bad thing;
however, in most cases the naturally occurring modes in our data will tend to be
convex and compact and a better reflection of internal homogeneity. As a direct re­
sult, the approach has not performed well in Monte Carlo studies (see, e.g., Milli­
gan, 1980).

Alternatives to Single Linkage


Many different approaches have been developed to deal with the we�knesses inher­
ent in single linkage. Some of these approaches are described briefly below. Note that
all of these approaches are agglomerative in nature and produce hierarchical cluster
solutions.

Complete Linkage. Instead of defining the distance (or dissimilarity) between clus­
ters as the distance between the closest pair of objects (as in single linkage), we use
the distance between the farthest pair of objects. This ensures that each object added
to a cluster is close to all objects in the cluster and not just one. The only change re­
quired to go from single linkage to complete linkage clustering is to rewrite step 3 as
follows:
dC.,+, Ck
= max{dcc,dcc
I j A
"
} (8.6)
Compared to single linkage, complete linkage is much more likely to produce
convex clusters that tend to be of comparable diameter. Although complete linkage
has a tendency to produce convenient and homogeneous groupings, these are not nec­
essarily driven by the natural modality of the data. Milligan (1980) found that com­
plete linkage can be highly sensitive to outliers in the data. When a tie occurs at step
3 (i.e., more than two clusters can be joined together), the choice can affect the sub­
sequent shape of the cluster solution (a problem that does not occur in the case of
single linkage).

Average Linkage. This approach can be considered a sort of compromise between


single linkage and complete linkage. Some authors prefer this method because it
comes closest to fitting a tree that satisfies a least squares minimization criterion.
Instead of using the minimum (single linkage) or the maximum (complete link age),
the new distance is defined as the average distance between cluster Ck and the new
8.4 Agglomerative Clustering: How It Works 283

cluster C,,+, (formed by joining together clusters C; and C). Thus, we rewrite step 3
as follows:
- n-dee + n) dee
I
de,.,,c, -
i ( J
(8.7)

n- + 11·J
I

where 11; + 11 is the number of objects in the newly formed cluster C,,+,· Note that if
1

the data are nonmetric, the average can be replaced by the median (in which case the
method is called median linkage).

Centroid Method. Instead of defining the distance between two clusters as the aver­
age distance between all pairs of objects, it is also possible to first "average" the ob­
jects in each cluster (in effect, calculating the cluster centroids) and then define the
distance between the two centroids. For this method, it simplifies things if we work
with squared distances. Let dj represent the squared Euclidean distance between ob­
jects i and j. If cluster C = { i,j}, then the squared distance between object k and the
centroid of cluster C can be written as
d;i + d} cf;7
12
�ke = (8.8)
2 4
In general, the squared distance between any cluster Ck and the new cluster C,,+,
created by joining clusters C; and CJ can be written as
nc Ile d� c
(8.9)
/ ) /I j

(nc' + nc., ) 2
By writing the rule in step 3 as a function of squared Euclidean distance (rather
than the attribute measures X), the centroid method can be used with directly as­
sessed proximity measures as well as derived distance measures (e.g., squared dis­
tances calculated from attribute data). According to Milligan (1980), the centroid
method is robust to outliers but may be outperformed by average linkage.

Ward's Method. The three methods just described (complete linkage, average link­
age, and the centroid method) are all variants of a general agglomerative approach
called the pair group method that differs only in terms of the distance relation
specified in step 3. By contrast, Ward's method (sometimes referred to as the mini­
mum variance method) adopts a slightly different strategy at step 1. Instead of join­
ing the two closest clusters, Ward's method seeks to join the two clusters whose
merger leads to the smallest within-cluster sum of squares (i.e., minimum within­
group variance).
Ward's method has a tendency to produce equal-sized clusters (i.e., clusters with
approximately the same 11umber of observation in each) that are convex and compact.
Because the approach is based on the minimization of within-cluster distances, it
often produces a clustering solution - if the tree is "cut" in the right place - that
is similar to the partitioning methods described in section 8.5 below (which also
focuses on minimizing the within-group sum of squares).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy