001 - Clustering - Jain - Dubes (1) - 69-103
001 - Clustering - Jain - Dubes (1) - 69-103
Clustering Methods
and Algorithms
Cluster analysis is the process of classifying objects into subsets that have meaning
in the context of a particular problem. The objects are thereby organized into an
efficient representation that characterizes the population being sampled. In this
chapter we present the clustering methods themselves and explain algorithms for
performing cluster analysis. Section 3.1 lists the factors involved in classifying
objects, and in Sections 3.2 and 3.3 we explain the two most common types of
classification. Computer software for cluster analysis is described in Section 3.4.
Section 3.5 outlines a methodology for using clustering algorithms to the best
advantage. This chapter focuses on the act of clustering itself by concentrating
on the inputs to and outputs from clustering algorithms. The need for the formal
validation methods in Chapter 4 will become apparent during the discussion.
55
56 Clustering Methods and Algorithms Chap.
Cl a ssi fi cations
I
Mon-Exclusive
Exclusive
(Overlapping)
I_
Extrinsic Intrinsic
(Supervised) (Unsupervised)
Hierarchical
Partitional
Neither (.6 nor a is nested into the following partition, and this partition is
not nested into or a.
10,1, 12, x3, x4), (X5, X. .17, Xs), (X9, xrn)}
A hierarchical clustering is a sequence of partitions in which each partition
is nested into the next partition in the sequence. An agglomerative algorithm for
hierarchical clustering starts with the disjoint clustering, which places each of the
n objects in an individual cluster. The clustering algorithm being employed dictates
how the proximity matrix should be interpreted to merge two or more of these
trivial clusters, thus nesting the trivial clustering into a second partition. The
process is repeated to form a sequence of nested clusterings in which the number
of clusters decreases as the sequence progresses until a single cluster containing
all 11 objects, called the conjoint clustering, remains. A divisive algorithm performs
the task in the reverse order.
A picture of a hierarchical clustering is much easier for a human being to
comprehend than is a list of abstract symbols. A dendrogram is a special type of
tree structure that provides a convenient picture of a hierarchical clustering. A
dendrogram consists of layers of nodes, each representing a cluster. Lines connect
nodes representing clusters which are nested into one another. Cutting a dendrograrn
horizontally creates a clustering. Figure 3.2 provides a simple example. Section
3.2.2 explains the role of dendrograms in hierarchical clustering.
Other pictures can also be drawn to visualize a hierarchical clustering (Kleiner
and Hartigan, 1981; Friedman and Rafsky, 1981; Everitt and Nicholls, 1975).
Information other than the sequence in which clusterings appear will be of interest.
The level, or proximity value, at which a clustering is formed can also be recorded.
If objects are represented as patterns, or points in a space, the centroids of the
clusters can be important, as well as the spreads of the clusters.
Two specific hierarchical clustering methods are now defined called the single-
link and the complete-link methods. Section 3.2.1 explains algorithms for these
two commonly used hierarchical clustering methods. The sequences of clusterings
created by these two methods depend on the proximities only through their rank
Clustering, xi X2 )4
3
x
4
X
5
(tx1),(x2),(m3),(x4),09) (Disjoint)
{(xl,xa),(x3),(x4),(x5))
1-X2), (x3,x4L(m5))
(04
((X02, X04).(15)}
((xi,x2,1(3,X4045)) (Conjoint)
order. Thus we first assume an ordinal scale for the proximities and use graph
theory to express algorithms. Single-link and complete-link hierarchical methods
are not limited to ordinal data. Sections 3,2.4 and 3.2.5 examine algorithms for
these two methods in terms of interval and ratio data. The effects of proximity
ties on hierarchical clustering are discussed in Section 3.2.6, while algorithms
defined for single-link and complete-link clustering are generalized in Sections
3.2.7 and 3.2.9 to establish new clustering methods. The issues in determining
whether or not a hierarchical classification is appropriate for a given proximity
matrix are postponed until Chapter 4.
Example 3.1
An example of an ordinal proximity matrix for a = 5 is given as matrix a,.
XI X2 X3 .x4 Jr5
xi 0 6 8 2 7-
.x2 6 0 1 5 3
x3 8 1 0 10 9
2
-x4 5 10 0 4
.r 7 3 9 4 0
matrix means that the pair (xi, xj) belongs to the binary relation.
Sec. 3.2 Hierarchical Clustering 61
X 1 x2 x3 x4 x1
xJ
X2
* • • •
3
* *
X4
X5
Step 1. Begin with the disjoint clustering implied by threshold graph G(0),
which contains no edges and which places every object in a unique cluster,
as the current clustering. Set k 1.
Step 2. Form threshold graph G(k).
If the number of components (maximally connected subgraphs) in G(k)
is less than the number of clusters in the current clustering, redefine the
current clustering by naming each component of G(k) as a cluster.
Step 1. Begin with the disjoint clustering implied by threshold graph G(0),
which contains no edges and which places every object in a unique cluster,
as the current clustering. Set k 1.
Step 2. Form threshold graph G(k),
If two of the current clusters form a clique (maximally complete sub-
graph) in G(k), redefine the current clustering by merging these two clusters
into a single cluster.
2 3 2 3 2 3
0 - •
0 5
0 - 0
I 4 1 4 4
GO) 6(2) GO)
4
1 4
6
1
52 3 9 4 2 3 1
•
T
Single Link Complete Link
Figure 3.4 Threshold graphs and dendrograms for single-link and complete-link hierar-
chical clusterings_
Sec. 3.2 Hierarchical Clustering 63
four threshold graphs in Figure 3.4. However, the first seven threshold graphs
are needed to determine the complete-link hierarchy. Once the two-cluster complete-
link clustering has been obtained, no more explicit threshold graphs need be drawn
because the two clusters will merge into the conjoint clustering only when all
n(n — 1)/2 edges have been inserted. This example demonstrates the significance
of nesting in the hierarchy. Objects ix , x x form a u e or corn 1 to
sub ra Ir ..... r a reshold ra2h G 5 1311 ji Iie three objects aretco mplete-link
cluster. Once complete-link clusters {x2, x3} and {x1, x4} have been established,
object x5 must merge with one of the two established clusters; once formed, clusters
cannot be dissolved and clusters cannot overlap. The dendrograms themselves
are drawn with each clustering shown on a separate level, even though, for example,
the two-cluster single-link clustering is obtained from G(3) and the two-cluster
complete-link clustering is obtained from G(7).
The interpretation of the dendrograms is not under consideration in this
chapter, but the two dendrograms in Figure 3.4 do raise about 11`ec
.
x5. Does it belon to the cluster or to the cluster ,icti l x }? A case can
a e made for calling {x2, x4, x5} a cluster. Perhaps a hierarchical structure is
not appropriate for this proximity matrix. These issues are examined in Chapter
4.
Hubert (1974a) provides the following algorithms for generating hierarchical
clusterings by the single-link and complete-link methods. When the proximity
matrix contains no ties, clusterings are numbered 0, 1, . . . , (ii — I) and the ?
nth clustering, 'Cm, contains 11 — in clusters.
(C
Mr - {C MI , C rti.25 • • • , C ni C n— In}}
Step 1. Set in ‹— 0. Form the disjoint clustering with clustering number in.
'Co ' {(x1), (x2), .. • , (x.)}
Step 2a. To find the next clustering (with clustering number in + 1) by the
single-link method, define the function Qs for all pairs (r, t) of clusters in
the current clustering as follows.
Qs(r, t) = min id(i, j): the maximal subgraph
of G(d(i, j)) defined by Cn„. U C►„ is connected}
Clusters Gip and Cmq are merged to form the next clustering in the single-
link hierarchy if
Q s(p, q)= min {Q,(r, t)}
Step 2b. The function Q, is used to find clustering number in + 1 by the
complete-link method and is defined for all pairs (r, t) of clusters in the
current clustering.
64 Clustering Methods and Algorithms Chap. 3
.
i) = min mi, J ) the maximal subgraph
of G(dO, j)) defined by C. U Cm, is complete}
Cluster Cmp is merged with cluster C„,4, under the complete-link method if
Q c (p, q) = min {Q,(r,
Step 3. Set m in -F 1 and repeat step 2. Continue until all objects are ip
a single cluster.
Example 3.2
One way to understand the functions (2, and Q, is to consider the sequence of .threshold
graphs even though the threshold graphs are not necessary to the evaluation of these functions.
For example, the first seven threshold graphs for the proximity matrix 94 (Example 3.1)
are given in Figure 3.4. The third clustering (in = 2) can be numbered as follows.
(
C2 = {C2 1 , C22, C23}
have objected to certain practical difficulties with clusters formed by the single-
link method. For example, single-link clusters easily chain together and are often
"straggly." Only a single edge between two large clusters is needed to merge
the clusters. On the other hand, complete-link clusters are conservative. All pairs
of objects must be related before the objects can form a complete-link cluster.
Completeness is a much stronger property than connectedness. Perceived deficien-
cies in these two clustering methods have led to a large number of alternatives,
some of which are explained in Sections 3.2.7 and 3.2.9. For example, Hansen
and DeLattre (1978) noted that sitiglezlink clusters may chain and have little homo-
geneity, while corn lete-link clusters ma not be well se arated.
very connected subgraph of a threshold graph is a single-link cluster but
not every clique is a complete-link cluster. Peay (1975) proposed an exclusive,
overlapping, hierarchical clustering method based on cliques and extended it to
asymmetric proximity matrices. Matula (1977) noted that the number of possible
cliques is huge, so clustering based on cliques is practical only for small n.
Suppose that the latest clustering of {xi , 12, . . . x,} in one of the hierarchies
has been formed by merging clusters C „p and Cnui in the clustering
{C„, , C m 2 , • • • , C m (fp -m)}
The following characterizations may help to distinguish the two clustering methods.
If the clustering was by the single-link method, we would know that
min {d(i,j)} = min { min id(i, j)il
mieC„,rirtC,, r#. xecc—x/Ec..
If the clustering was by the complete-link method, we have that
max {d(i, j)) = min { max {d(i, j)}}
,X,ECmp,Yf mq ro,s x i ec tt ,,x l eC i ,„
These characterizations show why the single-link method has been called
the "minimum" method and the complete-link method has been named the "maxi-
mum met (Jo nson, *1 owever, 1 t e proximities are similarities instead
of dissimilarities, this terminology would be confusing. This characterization also
explains why the complete-link method is referred to as the "diameter" method.
The diameter of a com lete subgraph is the largest proximity among all proximities
for pairs of objects in the subgraph. A t ough the complete-link method does not
generate clusters with minimum diameter, the diameter of a complete-link cluster
is known to equal the level at which the cluster is formed. By contrast, single-
link clusters are based on connectedness and are characterized by minimum path
length among all pairs of objects in the cluster.
level defines a clustering and identifies clusters. The level itself has no meaning
in terms of the scale of the proximity matrix.
A proximity graph is a threshold graph in which each edge is weighted
according to its proximity. The proximities used in Section 3.2.1 are ordinal, so
the weights are integers from 1 to n(n — 1)12. The dendrogram drawn from a
proximity graph is called a proximity dendrogram and records both the clusterings
and the proximities at which they are formed. Proximity dendrograms are especially
useful when the proximities are on an interval or ratio scale.
Example 3.3
a.
A ratio proximity matrix is given below as 2 The threshold and proximity dendrograms
are given in Figure 3.5. Also shown is the sequence of proximity graphs which provides
the actual dissimilarity values at which clusters are formed.
X2 x3 x4 x5
5.8 4.2 6.9 2.6
x2
6.7 1.7 7.2
212 = 1.9 5.6
X4 7.6
Ro= - -- = AA
where the 'nth clustering contains n — m clusters:
= {Cm', Cm2, C m(n_ffi)}
A level function, L, records the proximity at which each clustering is formed.
For a threshold dendrogram L(k) = k, because the levels in the dendrograrn are
evenly spaced. In general,
Sec. 3.2 Hierarchical Clustering 67
2
1• • 5 1 • • 5 1 •-• 5
1
_____
42 3
5.6 3
5 l 5
Figure 3.5 Examples of threshold and proximity graphs with corresponding dendro-
grams.
6(5)
6 6 W 6(9) 6(
Thresh
2 4 3
L
Dendrogro
T T
2 4 3 15 2 4 3 1 5
Proximity
2
Dendrogro '
ms
Single Complete
68 Clustering Methods and Algorithms Chap. 3
The matrix of values [dc(xi, x1)] is called the cophenetic matrix. The closer the
cophenetic matrix and the given proximity matrix, the better the hierarchy fits
the data. There can be no more than (n — 1) levels in a dendrogram, so there
can be no more than (n — 1) distinct cophenetic proximities. Since the cophenetic
matrix has n(n — 1)/2 entries, it must contain many ties.
The cophenetic matrix for the single-link dendrogram in Figure 3.5 is shown
below as gcs and will be used to demonstrate some interesting properties of cophen-
etic matrices.
A.2 x3 x4 x5
x1 "4.2 4.2 4.2 2.6
26 0. — x2 1.9 1.7 4.2
x3 1.9 4.2
x4 4.2
Applying the single-link clustering method to acs reproduces the single-
link dendrogram in Figure 3.5. This might be expected. However, applying the
complete-link method to ac, generates the same (single-link) dendrograrn. The
complete-link method is usually ambiguous when the proximi matrix contains
ti_es as discussed in Section 3.2.6. However, the cophenetic matrix is so arranged
that tied proximities form complete subgraphs and no ambiguity occurs under
complete-link clustering. A cophenetic matrix is an example of a proximity matrix
with perfect hierarchical structure. Both the single-link and the complete-link meth-
ods generate exactly the same dendrogram when applied to a cophenetic matrix.
Repeating this exercise by starting with the complete-link dendrogram generates
the cophenetic matrix acs..
x2 x3 x4 x5
x1 7.6 5.6 7.6 2.6
7.6 1.7 7.6
Ce = 7.6 5.6
X3
x4 7.6
a,
The cophenetic matrix c also has perfect hierarchical structure. The com-
plete-link and single-link clustering methods will produce exactly the same dendro-
gram when applied to acs., and that dendrogram will be identical to the complete-
link dendrogram in Figure 3.5. An important question in applications is: Which
dendrogram better describes the true structure of the data?
3.2.3 Hierarchical Structure and Ultrametricity
The fact that both the single-link and the complete-link methods generate exactly
the same proximity dendrogram when applied to a cophenetic matrix suggests
that the cophenetic matrix captures "true" or "perfect" hierarchical structure.
Whether or not the hierarchical structure is a, iro date for a wen data set has
yet to be etermined, but the type of structure exemplified by the cophenetic
Sec. 3.2 Hierarchical Clustering 69
matrix is very special. The justification for calling the cophenetic matrix "true"
hierarchical structure comes from the fact that a cophenetic proximity measure
dc defines the following equivalence relation, denoted Rc, on the set of objects:
Rc(a)= ((xi, x1) : j) al
Relation Rc(a) can be shown to he an equivalence relation for any a 0
by checking the three conditions necessary for an equivalence relation. Since
dc(1, = 0 for all 1,
(xi, xi) E Rc(a) for all a 0
so Rc(a) is reflexive. Since dc(i, j) = dc(], i) for all (1,
so Rc(a) is symmetric. The final condition, transitivity, requires that for all a a- 0,
if (x i , .r) E R c (a) and if (x k , x j ) E R c (a), then (x i, x j) R c (a)
This condition must be satisfied for all triples (x1, xj, xk) of objects and all a, It
can also be restated as
mappings from the class of proximity matrices to the class of ultrametric proximity
measures. That is, a hierarchical clusterin: method irri s oses a dendro ra the
given iroximit m establishes the co henetic iroximit m ich
satin ten the rametricjaequality. Measures of fit between proximity measures
andcophenetic proximity measures are discussed in Chapter 4. The property of
uttrametricit is also called monot oni c i t -ac ohe ne t i c rl i t me asure sat i sf i e s
if the clusters form in a monotonic manner as
the
dissimilarity increases. In other words, the clusterings are nested in the hierarchy.
STngle : JkiLac adcomplet e- link clusterings are alvic, but other common
clustering methods defined in Section 3.2.7 can create the next clustering at a
smaller dissimilarity than the present one. This issue is discussed in Section
3.2.8.
3.2.4 Other Graph Theory Algorithms for Single Link -
and Complete-Link
Step 1. Begin with the disjoint clustering, which places each object in its
own cluster. Find an MST on GOO.
Repeat steps 2 and 3 until all objects are in one cluster.
Step 2. Merge the two clusters connected by the MST edge with the smallest
weight to define the next clustering.
Step 3. Replace the weight of the edge selected in step 2 by a weight
larger than the largest proximity.
these algorithms generate the same single-link clusterings as the algorithms presented
earlier. Gower and Ross (1969) first proposed this algorithm. Rohlf (1973) provided
an implementation that examines each proximity value only once.
Example 3.4
Examples of the two algorithms are given in Figure 3.6 for the proximity matrix c,t3 defined
below.
f
X 2 x 3 x4 x 5
XI 2.3 3.4 1.2 3.7
x2 2.6 1.8 4.6
a3
x3 I 4.2 0.7
A node coloring of a
X4 L 4.4_
3 3
4 4
Minimum Spanning Tree
Complete Graph with
MST Darkened I (MST)
1
(3,5) 5 3
-•
0.7
2 2
• 1.2 le
5 3
0 0 - 1
8
D7
1111.
7.2"%. •2
<
30-
3
Single Link S •
0.7
(3,5)
Denarogrem
1 •2
(1,2,4) •
4
Agglomerative Divisive
in 0(v) are colored the same. Baker and Hubert (1976) show how the set of
node colorings is related to hierarchical clustering. The connection between node
coloring and complete-link clustering is not as simple as is the relation between
single-link clustering and the MST. The last complete-link clustering achieved
for a given threshold graph G(v) corresponds to a coloring of the nodes of the
complement of G(v). Hansen and DeLattre (1978) provide other algorithms from
graph coloring.
Step I. Begin with the disjoint clustering having level L(0) = 0 and sequence
number 777 = 0.
Step 2. Find the least dissimilar pair of clusters in the current clustering,
say pair {(r), (s)}, according to
Step 4. Update the proximity matrix, 9, by deleting the rows and columns
corresponding to clusters (r) and (s) and adding a row and column correspond-
ing to the newly formed cluster. The proximity between the new cluster,
denoted (r, .s.) and old cluster (k) is defined as follows. For the single-link
method,
Sec. 3.2 Hierarchical Clustering 73
1,2,4 [ 0 5
1 2 3 4 1,2,4 395
1 0 2.3 3.4 1.2 3.7 0
1,2,4
3,5
2 0 2.6 1.6 4.6
3 0 3,5
4.2 0
4.4
5
3 5 1 4 2
1 •
2 3,5 4
1 2 3,5
0 2.3 3.7
1 0 2.3 3.4 1
2
[ 0 2.6 1.8 2
0 4.4
3,5 0 4.2 3,5
0
4 0.
complete-link clusterings.
74 C l u s t e r i n g M e t h o d s a n d A l g o r i t h m s Chap. 3
Example 3.5
The computational examples in Figure 3.7 demonstrate the construction of single-link and
complete-link hierarchies. This example demonstrates the qualitative differences between
the single-link and complete-link hierarchies for the two artificial data sets defined in Section
2.4. The first data set, called DATAI, consists of 100 patterns in a four-dimensional
pattern space generated so as to have four categories, or true clusters. Patterns 1 through
24 were generated from category 1, patterns 25 through 59 from category 2, patterns 60
through 80 from category 3, and patterns 81 through 100 were generated in category 4.
An eigenvector projection is given in Figure 2.9. The proximity measure is squared Euclidean
distance in the pattern space. Since the proximity measure is Euclidean distance and since
the data were generated to several decimal places on a computer, we feel safe in assuming
that no proximity ties exist, so the hierarchies are both unique (see Section 3.2,6). Figures
3.8 and 3.9 show the proximity dendrograms for the single-link and complete-link hierarchies,
respectively.
-
L
-
0.4 - 3.8 Single-link hierarchy for 100 clustered patterns in four dimensions.
Figure
0.5 -
0.6 -
Sec. 3.2 Hierarchical Clustering 75
0.6
1.0
1,4
Figure 3.9 Complete-link hierarchy for 100 clustered patterns in four dimensions.
This example demonstrates the difficult in corn arin two dendro rams and motivates
the development of methods or automatically isolating significant clusters that are presented
in Chapter 4. The complete-link dendrogram in Figure 3.9 can be cut at level 1.0 to
generate four clusters. These clusters recover the original four categories in the data perfectly.
The four-category structure is not at all apparent in the single-link hierarchical clustering
of Figure 3.8.
Clustering methods have the nasty habit of creating clusters in data even when no
natural clusters exist, so hierarchies and cIusterin—s must be viewed with extreme sus icion.
Figures 3. W and 111 demonstrate this statement on the two hierarchies for a data set,
called DATA2, consisting of 100 points uniformly distributed over a unit hypercubc in
six dimensions (see Section 2.4). The patterns are positioned at random, so it is barely
possible that they have arranged themselves into meaningful clusters; however, it is unlikely
that real clusters exist, especially considering the two-dimensional projections in Figure
2.11. We thus interpret Figures 3.10 and 3.11 as hierarchies in which no true clusters
exist. The single-link dendrogram in Figure 3.10 exhibits the chaining that is characteristic
of single-link hierarchies. This chaining can occur even when valid clusters exist, as in
Figure 3.8. The complete-link hierarchy in Figure 3.11 suggests some meaningful clusters;
it looks more clustered than the single-link hierarchy. and this is the lure of complete-link
clustering. It tends to produce dendrograms that form small clusters which combine nicely
into larger clusters even when such a hierarchy is not warranted, as with random data.
This example should demonstrate the difficulties inherent in letting the human eye scan
over the dendrogram to pick out believable clusters and clusterings.
76 Clustering Methods and Algorithms Chap. 3
1 11 11 111111 1 llllllllllllllll 11111111111 llllll 1 lllllllllll 11111111111121, llllll h a o s l u m
1 i/1441131111 124 31 1111111112 lllllll 27331 111 31321/ 1311111 11411/71141 II 111111 11/ 11111
1 4I 1 3 I l I 1 1 i 1 1 1 1 1 1 1 7 1 1 311132111111311314111211 lllll 111i112131 llllll 1311/11114111111 2111211111
0.25 -
0.6
0,3 -
I
0.7 -
0.4 -
Figure 3.10 Single-link hierarchy for 100 random patterns in six dimensions.
0.5 -
3.2.6 Ties in Proximity
The computational complexity of competing algorithms for implementing a particular
clustering method and the availability of software should determine which algorithm is
appropriate for a given application. The problem of choosing between the single-link
and complete-link methods is much more difficult than choosing an algorithm for
one of the methods. No list of characteristics exists that lets us choose between the
two methods in a calm, rational manner. Some theoretical and practical information
about the two methods is summarized in this section, especially the effects of ties in
the proximity matrix.
The single-link and complete-link methods differ in many respects, such as i n
t h e s t r u c t u r e s r e c d a n d t h e r o c e d u r e . T h e t w o m e t h o d s produce
the same clusterin s v" ,kr-oximit matrix sati es the ultramehie inequality, as
discussed in Section 3.2,3. This section demonstrates that the two methods differ in the
way they treat ties in the proximity matrix. Up to now, we have assumed that the
troximit matrix contains no ies so that two n w clusters are ilever formed at the
same level and the algorithms defined thus far produce unique dendro A tie implies
that two or more edges are added to the proximity
Sec. 3.2 Hierarchical Clustering
77
JJ
0.25
—
0.4 —
0.8 —
1,2 —
1,6
1 2
1 2
1 2 3 4 5
1 2 7 5
2 5
8 6 es •5
3 0 3
2
4 0 3 9
0 4 4 4
0 G(1) 4
1 2 3 4 5 1 2 5 3 4
1 2 3
* S 0
1
2
t
3
4
9 9 Single
1 2 34
1 2 3
4 5
0 1 2
1 75
3
2 0 86 1
7:••
0
3 49 2
4 03
3
0
5
4
Single Link
Complete Link
(c)
Figure 3.12 Effects of ties in
proximity on single-link and
complete-link clustering: (a)
proximity matrix and threshold
graphs: (b) proximity
dendrograms: (c) altered
proximity matrix and
dendrogranis.
data. In this case a unique complete-link hierarchy is obtained because the two
edges Tiirthe same proximity can be added in arbitrary order. However, the
hierarchy has two clusters forming at level 3. The single-link hierarchy is also
shown. The single- and complete-link dendrograms in Figure 3.I2(c) resemble
one another much more closely than do those in Figure 3.12(b), which might
lead one to believe that the proximity matrix in Figure 3.12(c) had a good hierarchical
structure, whereas that in Figure 3.12(a) has a poor hierarchical structure. This
example raises the issue of sensitivity. It appears that the hierarchical structure
can change dramaticall with small chan es in the rank orders of the iroximities.
The havoc that ties can create in a complete-link hierarchy has been noted
by several researchers. Sibson (1971) and Williams et al. (1971) argue against
the complete-link method as a feasible clustering procedure (see also Hubert,
1974a). Thpractical problem of ties is subtle. Software packages do not typically
check for ties. The order in which an edge is added from a set of edges with the
same proximity is at the whim of the programmer. The program will generate
only one complete-link clustering, even though a number of clusterings might be
equally justifiable. This problem is compounded when the proximity matrix contains
several ties. The comparative studies in Section 3.5.2 suggest that the complete-
link method produces more useful hierarchies in many applications than does the
single-link method, even though proximity ties make it ambiguous.
This section generalizes the algorithms in Section 3.2.5 and discusses issues in
the computation and application of these algorithms. Questions of the validity of
cluster structures are taken up in Chapter 4. The general paradigm for expressing
SAHN (Sequential, Agglomerative, Hierarchical, Nonoverlapping) clustering meth-
ods is given in Section 3.2.5. Step 4 of that algorithm specifies how the dissimilarity
matrix is to be updated by defining the formula for the dissimilarity between a
newly formed cluster, (r, s), and an existing cluster, (k) with nk objects. The
single-link and complete-link algorithms use the minimum and maximum, respec-
tively, of the dissimilarities between the pairs {(k), (r)} and 1(k), (s)}. Other clustering
methods can be defined by specifying different combinations of the distances in-
volved. A general formula for step 4 that includes most of the commonly referenced
hierarchrcac us eyi n g me ods is givenbilov,
Clustering Method
0
—112
Single-link 1/2 112
1/2
Complete-link 112 112 0
n„ Ili 0 0
UPGMA (group average)
nr n , ll. + 11,
0
0
WPGMA (weighted average) 1/2 30 —
nt-ns
n, n, 0
(tr,120
UPGMC (unweighted centroid) n, + n, n, + P15
—1/4 0
WPGMC (weighted centroid) 1/2 1/2 —nk
n, + nk n, + nk 0
11, ± n, + nk
Ward's method (minimum variance) n, + n, + nk n, + ny + nA
treats each object in a cluster equally, regardles of the structure of the dendrogram.
A "weighted" method weights all clusters the same, so objects in small clusters
are weighted more heavily than objects in large clusters. The suffixes "A" and
"C" refer to "arithmetic averages" and "centroids." Thus "UPGMA" stands
for "unweighted pair group method using arithmetic averages" and "WPGMC"
refers to "weighted pair group method using centroids." Rohlf (1970) and Sneath
and Sokal (1973) have used this terminology. The UPGMC method has also been
called, simply, the centroid method, while the WPGMC method has been called
the median method (see Lance and Williams, 1967).
Example 3.6
Dendrograms for the seven algorithms in Table 3.1 are drawn in Figure 3.13. The six
objects involved arc the six pattern vectors defined below in a three-dimensional space.
x, = (1.0 2.0 2.0)r x4 = (3.0 4.0 3.0)T
x2 = (2.0
1.0 2.0)T xs -= (0,0 3.5 3.5)T
The tie in proximity between pairs of patterns (x,, x5) and (x3, x6) causes no ambiguity
in any of the dendrograms. The dendrograms for Ward's method, the two arithmetic average
methods, and the two centroid methods all have the same topology and differ only in
levels. They suggest that x 4 is an outlier because it joins the cluster of the other five
patterns last and the gap between the formation of the five-pattern cluster and the singleton
cluster is large. The single-link dendrogram has much the same topology, except that x 5
now appears to be the outlier. The complete-link dendrogram establishes cluster (x 4, x5).
All dendrograms agree that (x1, x2, x5) is a strong cluster. Quantitative measures of the
strength and quality of clusters and clusterings, defined in Chapter 4, should help answer
such questions as: If one of the dendrograms were to be cut to define a partition, where is
the best cutting level? Is (x1, x2, x6, x3) a good cluster?
angie LIPG MA
Ward's
Link
1 6 2 3 4 5
1.511
ti 3
2.0
5 5
5-5
10 10 1
15 15 15 0
2 1
20
0 5
LIPP% WPGMC
WPGMA
1 6 2 3 5 4
422
5 5
6.43
9.80
1 10
15
151 15
in(k) — lin k E x( )
( )
=1
The square-error for cluster k is the sum of squared distances to the centroid for
all patterns in cluster k. Mk d
Since the square-errors for all clusters except for the three clusters involved remain
the same,
AE2 = er2 _ e2q
After a bit of algebra, we find that the change in square-error depends only
on the centroids.
A
nnP
In
"L IN ./
►P
The clusters p and q selected for merger are the clusters that minimize this
quantity. The square-error must increase as the number of clusters decreases, but
the increase is as small as possible in Ward's method. Once clusters p and q are
merged into cluster t, the proximity between all other clusters and the new cluster
t must be updated. Letting cluster r represent a cluster other than p, q, or 1, the
following formula can be applied to find d[(r), (t)]:
n+n + n
n 11
d[(r), (1)1 = nr + n C r), (p)J + dj(r), (q)] nr n +n,di(P),(0]
, nr + 11,
The choice of a suitable hierarchical clustering method is an important matter
in applications, but theory provides few guidelines for optimizing the choice.
Square-error is a familiar criterion in engineering, so one might feel comfortable
with a procedure that minimizes square-error, such as Ward's method. However,
the objective of cluster analysis is to investigate the structure of the data, so the
imposition of an apriori criterion, such as square-error, might not be appropriate.
All data do not occur as patterns, so we cannot limit our thinking to geometrical
constructs. Section 3.5.2 reviews several empirical studies that compare hierarchical
clustering methods and that guide the choice of a clustering method.
1 2 3 4
0 1.2 2.3
2 0 2.4 2
3 0 .
4 2
1 2 3 4 1 2 3 4
1 2 3 4
•9••
1.2
1.2 1.2
2.6 1.5
2.0
2.0
2.0
for all clusters (k) distinct from (r) and (s), That is, no
dissimilarity in the updated matrix can be smaller than
the smallest entry in the previous matrix. Another
way of saying this is that the cophenetic matrix
generated by these two methods satisfies the ultrametric
inequality.
What can be said about the monotonicity of
SAHN algorithms expressed through matrix updating,
especially those defined by the matrix updating algorithm
in Table 3.1? Figure 3.14 provides a simple example of
a dissimilarity matrix and the dendrograms generated
by the single-link, complete-link, UPGMC, and WPGMC
methods. The dendrograms from the centroid methods
(UPGMC and WPGMC) are not monotone and exhibit
what is called a "crossover" or a "reversal" since clusters
(x1, x2) and (x3, x4) merge at a level lower than the level
at which (x3, x4) is first defined.
M_ onotonic is clearly a property of the
clustering method and has nothing to do with the
proximity matrix. The advantage of the matrix updating
formula Ts—that the monotonicity of any SAHN
algorithm that can be expressed in terms of this
updating oralT-11a can be predicted from the
coefficients. Assuming that a,. > 0, aiid a, 0, Milligan
(1979) provided the following results. The matrix
updating formula for step 4 of the SAHN algorithm is
repeated below for easy reference. Clusters (r) and (s)
are being merged into cluster (r, s) and the dissimilarity
between distinct cluster (k) and the newly formed
cluster is being established.
Sec. 3.2 Hierarchical Clustering 85
for all clusters (k) other than (r) and (s), which implies monotonicity.
To demonstrate this result, first consider the case when d[(k), (r)] >
d[(k), (s)1. Using the first inequality and recalling that y is negative, the matrix
updating equation can be rewritten as
Williams and Lance (1977) call them obsolete. Anderberg (1973) claims that they
lack a useful interpretation for general proximities. Sneath and Sokal (1973) claim
that the frequency of reversals and the relatively high degree of its distortion of
the original [proximity] matrix has led to the abandonment of this [UPGMC] tech-
nique." On the other hand, the performances of nonmonotone methods in several
comparative studies of clustering methods discussed in Section 3.5.2 do not suggest
that such methods be abandoned. Williams et al. (1971) argue that monotonicity
is not essential for the proper performance of hierarchical clustering.
Following the algorithm, clusters Cmp and C,ng are merged to form the next
clustering in the sequence if
Qp(k)(p, q) = min {Qp(k)(r, t)}
Some examples of property p are given below. Integer k is a parameter.,
so, for example, p(k) could mean a node connectivity of k or a node degree of k.
Example 3.7
Figure 3.15 demonstrates hierarchical clustering methods defined by graph properties. An
ordinal proximity matrix is given below on eight objects. Threshold graph G(13) is pictured
in Figure 3.15(a) to help in establishing the dendrograms for several methods, as well as
for the single-link and complete-link methods. Proximity dendrograms are shown in Figure
3.15(b)—(h). A simple way to find these hierarchies with pencil and paper is first to list
the pairs of objects in rank order by proximity. Then construct a sequence of threshold
graphs and find the first threshold graph at which a property is satisfied. It is important to
check the property only on the subgraph formed by the union of the subgraphs for two
existing clusters.
1 2 3 4 5 6 7 8
—0 ,
l 13 21 18 4 8 7 28
2 0 9 19 15 14 10 16
3 0 22 20 12 11 17
4 0 3 23 27 1
5 — 0 5 24 2
6 ---------- 0 6 25
7 ---------- 0 26
8 ----------
_ 0_
48S 6 2 4 8 5 "7 1 2 4 8 5 7 23
• #•
1
2
•
3
4
S
8 8
9 9
10
12
16
211
28
4 8 5 6 7 1 2 3 4 8 5 15 7 1 2 3
6 3 1
a
6
9
11
7 a T
4 8 5 1 6
23 IT/ 1 • •
• • (g) 1
2
(h)
3 T
4 F
4
Figure 3.15 (continued}
5
5
9
3.3 PARTITIONAL
1 T
CLUSTERING 1
1
Hierarchical clustering techniques organize the data into a nested sequence of
groups. An important characteristic of hierarchical clustering methods is the visual
impact of the dendrogram, which enables a data analyst to see how objects are
being merged into clusters or split at successive levels of proximity. The data
analyst can then try to decide whether the entire dendrogram describes the data
or can select a clustering, at some fixed level of proximity, which makes sense
for the application in hand. We refer to nonhierarchical clustering methods as
partitional clustering methods. They generate a single partition of the data in an
attempt to recover natural groups present in the data. Both clustering strategies
have their appropriate domains of applications. Hierarchical clustering methods
generally require only the proximity matrix among the objects, whereas partitional
techniques expect the data in the form of a pattern matrix. It is generally assumed
that the features have been measured on a ratio scale,
Hierarchical techniques are popular in biological, social, and behavioral sci-