Lattin Et Al - Analyzing Multivariate Data - 281-283
Lattin Et Al - Analyzing Multivariate Data - 281-283
1
flGURE 8.1 Observation
p en dro gr am for A---------.
kage clus
single-lin
o utlon for
fe/ing s !
data in Figure 8.10
B _________,
c------------�
0------------�
0 2 3 4 5 6
Minimum distance between clusters
structures, where each object is connected to only a fraction of the other objects in
the set). Here the computational effort is on the order nA, where A is the average num.
ber of connections for each object in the set. Furthermore, ingJe-linkage clusteting
does not require metric data. The implementation of the algorithm described above
works just as well for ordinal measures of dissimilarity.
One drawback of single linkage is that it tends to be extremely myopic. An ob.
ject will be added to a cluster so long as it is close to any one of the other objects in
the cluster, even if it is relatively far from all the others. Thus, single linkage has a
tendency to produce long, stringy clusters and nonconvex cluster shapes. If the true
underlying clusters are nonconvex, then this property is not necessarily a bad thing;
however, in most cases the naturally occurring modes in our data will tend to be
convex and compact and a better reflection of internal homogeneity. As a direct re
sult, the approach has not performed well in Monte Carlo studies (see, e.g., Milli
gan, 1980).
Complete Linkage. Instead of defining the distance (or dissimilarity) between clus
ters as the distance between the closest pair of objects (as in single linkage), we use
the distance between the farthest pair of objects. This ensures that each object added
to a cluster is close to all objects in the cluster and not just one. The only change re
quired to go from single linkage to complete linkage clustering is to rewrite step 3 as
follows:
dC.,+, Ck
= max{dcc,dcc
I j A
"
} (8.6)
Compared to single linkage, complete linkage is much more likely to produce
convex clusters that tend to be of comparable diameter. Although complete linkage
has a tendency to produce convenient and homogeneous groupings, these are not nec
essarily driven by the natural modality of the data. Milligan (1980) found that com
plete linkage can be highly sensitive to outliers in the data. When a tie occurs at step
3 (i.e., more than two clusters can be joined together), the choice can affect the sub
sequent shape of the cluster solution (a problem that does not occur in the case of
single linkage).
cluster C,,+, (formed by joining together clusters C; and C). Thus, we rewrite step 3
as follows:
- n-dee + n) dee
I
de,.,,c, -
i ( J
(8.7)
(·
n- + 11·J
I
where 11; + 11 is the number of objects in the newly formed cluster C,,+,· Note that if
1
the data are nonmetric, the average can be replaced by the median (in which case the
method is called median linkage).
Centroid Method. Instead of defining the distance between two clusters as the aver
age distance between all pairs of objects, it is also possible to first "average" the ob
jects in each cluster (in effect, calculating the cluster centroids) and then define the
distance between the two centroids. For this method, it simplifies things if we work
with squared distances. Let dj represent the squared Euclidean distance between ob
jects i and j. If cluster C = { i,j}, then the squared distance between object k and the
centroid of cluster C can be written as
d;i + d} cf;7
12
�ke = (8.8)
2 4
In general, the squared distance between any cluster Ck and the new cluster C,,+,
created by joining clusters C; and CJ can be written as
nc Ile d� c
(8.9)
/ ) /I j
(nc' + nc., ) 2
By writing the rule in step 3 as a function of squared Euclidean distance (rather
than the attribute measures X), the centroid method can be used with directly as
sessed proximity measures as well as derived distance measures (e.g., squared dis
tances calculated from attribute data). According to Milligan (1980), the centroid
method is robust to outliers but may be outperformed by average linkage.
Ward's Method. The three methods just described (complete linkage, average link
age, and the centroid method) are all variants of a general agglomerative approach
called the pair group method that differs only in terms of the distance relation
specified in step 3. By contrast, Ward's method (sometimes referred to as the mini
mum variance method) adopts a slightly different strategy at step 1. Instead of join
ing the two closest clusters, Ward's method seeks to join the two clusters whose
merger leads to the smallest within-cluster sum of squares (i.e., minimum within
group variance).
Ward's method has a tendency to produce equal-sized clusters (i.e., clusters with
approximately the same 11umber of observation in each) that are convex and compact.
Because the approach is based on the minimization of within-cluster distances, it
often produces a clustering solution - if the tree is "cut" in the right place - that
is similar to the partitioning methods described in section 8.5 below (which also
focuses on minimizing the within-group sum of squares).