0% found this document useful (0 votes)
18 views40 pages

001 - Clustering - Jain - Dubes (1) - 69-103

This document discusses clustering methods and algorithms, focusing on the classification of objects into meaningful subsets. It outlines the factors involved in clustering, the types of classifications (exclusive vs. nonexclusive, intrinsic vs. extrinsic, hierarchical vs. partitional), and various algorithms used for cluster analysis. The chapter emphasizes the importance of proximity matrices and introduces hierarchical clustering methods, including single-link and complete-link algorithms.

Uploaded by

Alenis Ackerman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views40 pages

001 - Clustering - Jain - Dubes (1) - 69-103

This document discusses clustering methods and algorithms, focusing on the classification of objects into meaningful subsets. It outlines the factors involved in clustering, the types of classifications (exclusive vs. nonexclusive, intrinsic vs. extrinsic, hierarchical vs. partitional), and various algorithms used for cluster analysis. The chapter emphasizes the importance of proximity matrices and introduces hierarchical clustering methods, including single-link and complete-link algorithms.

Uploaded by

Alenis Ackerman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

3 ______________

Clustering Methods
and Algorithms

Cluster analysis is the process of classifying objects into subsets that have meaning
in the context of a particular problem. The objects are thereby organized into an
efficient representation that characterizes the population being sampled. In this
chapter we present the clustering methods themselves and explain algorithms for
performing cluster analysis. Section 3.1 lists the factors involved in classifying
objects, and in Sections 3.2 and 3.3 we explain the two most common types of
classification. Computer software for cluster analysis is described in Section 3.4.
Section 3.5 outlines a methodology for using clustering algorithms to the best
advantage. This chapter focuses on the act of clustering itself by concentrating
on the inputs to and outputs from clustering algorithms. The need for the formal
validation methods in Chapter 4 will become apparent during the discussion.

3.1 GENERAL INTRODUCTION

A clustering is a type of classification imposed on a finite set of objects. As


explained in Section 2.2, the relationship between objects is represented in a
proximity matrix in which rows and columns correspond to objects. If the objects
are characterized as patterns, or points in a d-dimensional metric space, the proximi-
ties can be distance between pairs of points, such as Euclidean distance. Unless
a meaningful measure of distance, or proximity, between pairs of objects has
been established, no meaningful cluster analysis is possible. The proximity matrix
is the one and only input to a clustering algorithm.

55
56 Clustering Methods and Algorithms Chap.

Clustering is a special kind of classification. See Kendall (1966) for discussion


on the relationship between classification and clustering. Figure 3.1 shows a tree
of classification problems as suggested by Lance and Williams (1967). Each leaf
in the tree in Figure 3.1 defines a different genus of classification problem. The
nodes in the tree of Figure 3.1 are defined below.

a. Exclusive versus nonexclusive. An exclusive classification is a


partition of the set of objects. Each object belongs to exactly one subset, or
cluster. Nonexclusive, or overlapping, classification can assign an object
to several classes. For example, a grouping of people by age or sex is
exclusive, whereas a grouping by disease category is nonexclusive because
a person can have several diseases simultaneously. Shepard and Arabic
(1979) provide a review of nonexclusive or overlapping clustering methods.
This chapter treats only exclusive classification. Fuzzy clustering is a type
of nonexclusive classification in which a pattern is assigned a degree of
belongingness to each cluster in a partition and is explained in Section
3.3.8.

b. Intrinsic versus extrinsic. An intrinsic classification uses only the


proximity matrix to perform the classification. Intrinsic classification is called
"unsupervised learning" in pattern recognition because no category labels
denoting an a priori partition of the objects are used. (See Appendix A for
an introduction to pattern recognition.) Extrinsic classification uses
category labels on the objects as well

Cl a ssi fi cations

I
Mon-Exclusive
Exclusive
(Overlapping)

I_
Extrinsic Intrinsic
(Supervised) (Unsupervised)

Hierarchical

Partitional

Figure 3.1 Tree of classification types.


Sec. 3.1 General Introduction 57

as the proximity matrix. The problem is then to establish a discriminant surface


that separates the objects according to category. In other words, an extrinsic classifier
relies on a "teacher," whereas an intrinsic classifier has only the proximity matrix.
One way to evaluate an intrinsic classification is to see how the cluster
labels, assigned to objects during clustering, match the category labels, assigned
a priori. For example, suppose that various indices of personal health were collected
from smokers and nonsmokers. An intrinsic classification would group the individu-
als based on similarities among the health indices and then try to determine whether
smoking was a factor in the propensity of individuals toward various diseases.
An extrinsic classification would study ways of discriminating smokers from non-
smokers based on health indices. We are concerned only with intrinsic classification
in this book; isiWnsissifijation is the essence of cluster anal

c. Hierarchical versus partitional. Exclusive, intrinsic classifications are sub-


divided into hierarchical and partitional classifications by the type of structure
imposed on the data. A hierarchical classification is a nested sequence of partitions
and is explained in Section 3.2, whereas a partitional classification is a single
partition and is defined in Section 3.3. Thus a hierarchical classification is a special
sequence of partitional classifications. We will use the term clustering for an
exclusive, intrinsic, partitional classification and the term hierarchical clustering
for an exclusive, intrinsic, hierarchical classification. Sneath and Sokal (1973)
apply the acronym SAHN (Sequential, Agglomerative, Hierarchical, Nonoverlap-
ping) to exclusive, intrinsic, hierarchical, agglomerative algorithms. The differences
and similarities between algorithms for generating these two types of classifications
are the topics of this chapter.

Several algorithms can be proposed to express the same exclusive, intrinsic


classification. One frequently uses an algorithm to express a clustering method,
then examines various computer implementations of the method. The primary
algorithmic options in common use are explained below.

1. Agglomerative versus divisive. An agglomerative, hierarchical


classification places each object in its own cluster and gradually merges
these atomic clusters into larger and larger clusters until all objects are in a
single cluster. Divisive, hierarchical classification reverses the process by
starting with all objects in one cluster and subdividing into smaller pieces.
Thus this option corresponds to a choice of procedure rather than to a different
kind of classification. Partitional classification can be characterized in the same
way. A single partition can be established by gluing together small clusters
(agglomerative) or by fragmenting a single all-inclusive cluster (divisive).
2. Serial versus simultaneous. Serial procedures handle the patterns
one by one, whereas simultaneous classification works with the entire set of
patterns at the same time (see Clifford and Stephenson, 1975).
3. Monothetic versus polythetic. This option is most applicable to
problems in taxonomy, where the objects to be clustered are represented as
patterns, or
58 Clustering Methods and Algorithms Chap.

points in a space. A monothetic clustering algorithm uses the features one


by one, whereas, a polythetic procedure uses all the features at once. For
example, a different feature can be used to form each partition in a hierarchical
classification under a monothetic algorithm. We will consider only polythetic
algorithms.
4. Graph theory versus matrix algebra. What is the appropriate mathematical
formalism for expressing a clustering algorithm? We will express some algo-
rithms in terms of graph theory, using properties such as connectedness
and completeness to define classifications, and express other algorithms in
terms of algebraic constructs, such as mean-square-error. The choice is one
of clarity, convenience, and personal choice. When implementing an algorithm
on a computer, attention must be paid to questions of computational efficiency.
This issue is not related to human understanding of the classification method.
Some algorithms have convenient expressions under both options.

3.2 HIERARCHICAL CLUSTERING

A hierarchical clustering method is a procedure for transforming a proximity matrix


into a sequence of nested partitions. A hierarchical clustering algorithm is the
specification of steps for performing a hierarchical clustering. It is often convenient
to characterize a hierarchical clustering method by writing down an algorithm,
but the algorithm should be separated from the method itself. In addition to defining
algorithms and methods in this section. we define the type of mathematical
structure a hierarchical clustering imposes on data and describe ways of viewing that
structure.
First comes the notion of a sequence of nested partitions. The n objects to
be clustered are denoted by the set ge.
SC = ixi, x2, . . . , x„}
where xi is the ith object. A partition, ce , of Se breaks W into subsets {C 1, C2, .. ,

C„,} satisfying the following:


c, n cj = 4. for i and j from 1 tout, i j
CLUC2U...UC„= W
In this notation, "fl" stands for set intersection, "U" stands for set union,
and (.1) is the empty set. A clustering is a partition; the components of the partition
are called clusters. Partition 53 is nested into partition % if every component of
is a mpg' subset of a component of 'C. That is, % is formed by merging
components of al. For example, if the clustering % with three clusters and the
clustering gi with five clusters are defined as follows, then g4 is nested into %.
Both (.6 and @ are clusterings of the set of objects ixi, x2, . • - , x101.
= {(x i , x 3 , X5, x7), (X2, X4, x6, Xs), (x9, X1.0)}
, Xs),
g3 = {(Xi, X3), (X5, X7), (x2), (x4 x6, (x9, Xio))
Sec. 3.2 Hierarchical Clustering 59

Neither (.6 nor a is nested into the following partition, and this partition is
not nested into or a.
10,1, 12, x3, x4), (X5, X. .17, Xs), (X9, xrn)}
A hierarchical clustering is a sequence of partitions in which each partition
is nested into the next partition in the sequence. An agglomerative algorithm for
hierarchical clustering starts with the disjoint clustering, which places each of the
n objects in an individual cluster. The clustering algorithm being employed dictates
how the proximity matrix should be interpreted to merge two or more of these
trivial clusters, thus nesting the trivial clustering into a second partition. The
process is repeated to form a sequence of nested clusterings in which the number
of clusters decreases as the sequence progresses until a single cluster containing
all 11 objects, called the conjoint clustering, remains. A divisive algorithm performs
the task in the reverse order.
A picture of a hierarchical clustering is much easier for a human being to
comprehend than is a list of abstract symbols. A dendrogram is a special type of
tree structure that provides a convenient picture of a hierarchical clustering. A
dendrogram consists of layers of nodes, each representing a cluster. Lines connect
nodes representing clusters which are nested into one another. Cutting a dendrograrn
horizontally creates a clustering. Figure 3.2 provides a simple example. Section
3.2.2 explains the role of dendrograms in hierarchical clustering.
Other pictures can also be drawn to visualize a hierarchical clustering (Kleiner
and Hartigan, 1981; Friedman and Rafsky, 1981; Everitt and Nicholls, 1975).
Information other than the sequence in which clusterings appear will be of interest.
The level, or proximity value, at which a clustering is formed can also be recorded.
If objects are represented as patterns, or points in a space, the centroids of the
clusters can be important, as well as the spreads of the clusters.
Two specific hierarchical clustering methods are now defined called the single-
link and the complete-link methods. Section 3.2.1 explains algorithms for these
two commonly used hierarchical clustering methods. The sequences of clusterings
created by these two methods depend on the proximities only through their rank

Clustering, xi X2 )4
3
x
4
X
5

(tx1),(x2),(m3),(x4),09) (Disjoint)

{(xl,xa),(x3),(x4),(x5))

1-X2), (x3,x4L(m5))
(04

((X02, X04).(15)}

((xi,x2,1(3,X4045)) (Conjoint)

Figure 3.2 Example of dendrogram.


60 Clustering Methods and Algorithms Chap. 3

order. Thus we first assume an ordinal scale for the proximities and use graph
theory to express algorithms. Single-link and complete-link hierarchical methods
are not limited to ordinal data. Sections 3,2.4 and 3.2.5 examine algorithms for
these two methods in terms of interval and ratio data. The effects of proximity
ties on hierarchical clustering are discussed in Section 3.2.6, while algorithms
defined for single-link and complete-link clustering are generalized in Sections
3.2.7 and 3.2.9 to establish new clustering methods. The issues in determining
whether or not a hierarchical classification is appropriate for a given proximity
matrix are postponed until Chapter 4.

3.2.1 Single-Link and Complete-Link Algorithms from


Graph Theory

We begin with a symmetric n X n proximity matrix a = [d(i, j)], as defined in


Section 2.2. The n(n — 1)12 entries on one side of the main diagonal are assumed
to contain a permutation of the integers from 1 to n(n — 1)12 with no ties. That
is, the proximities are on an ordinal scale. We take the proximities to be dissimilari-
ties; d(I, 2) > Al, 3) means that objects 1 and 3 are more like one another than
are objects 1 and 2.

Example 3.1
An example of an ordinal proximity matrix for a = 5 is given as matrix a,.
XI X2 X3 .x4 Jr5
xi 0 6 8 2 7-
.x2 6 0 1 5 3
x3 8 1 0 10 9
2
-x4 5 10 0 4
.r 7 3 9 4 0

A threshold graph is an undirected, unweighted graph on a nodes without


self-loops or multiple edges. Each node represents an object. See Appendix G
for a brief review of terms in graph theory. A threshold graph G(v) is defined
for each dissimilarity level v by inserting an edge (I, j) between nodes i and j if
objects I and j are less dissimilar than v. That is,
(i, j) c G(v) if and only if (Ai,

As discussed in Section 2.2, we assume that d(i, i) = ❑ for all i. Thus


G(v) defines a binary relation for any real number v that is reflexive and symmetric.
A binary relation is a subset of the product set X x X, where X is the set of
objects. Objects xi and x1 are "related" if their dissimilarity is below the threshold
v. Reflexive, symmetric binary relations are pictured in a natural fashion by a
threshold graph. Figure 3.3 shows the binary relation obtained from proximity
matrix 211 above for a threshold of 5. The symbol "*" in position (1, j. of the )

matrix means that the pair (xi, xj) belongs to the binary relation.
Sec. 3.2 Hierarchical Clustering 61

X 1 x2 x3 x4 x1

xJ
X2
* • • •

3
* *

X4
X5

Figure 3.3 Binary relation and threshold graph for threshold 5.

Simple algorithms for the single-link and complete-link clustering methods


based on threshold graphs are listed below. These algorithms should help one
conceptualize the way in which the two hierarchies are formed and can easily be
applied to small problems. Other algorithms are given later in this chapter that
are appropriate for computer implementation. Boa] al orithms assume an ordinal
dissimilarity matrix containin: no tied entries and iroduce a nested se uence of
clusterings that can e pictured on a dendrogram.

AGGLOMERATIVE ALGORITHM FOR SINGLE LINK CLUSTERING


-

Step 1. Begin with the disjoint clustering implied by threshold graph G(0),
which contains no edges and which places every object in a unique cluster,
as the current clustering. Set k 1.
Step 2. Form threshold graph G(k).
If the number of components (maximally connected subgraphs) in G(k)
is less than the number of clusters in the current clustering, redefine the
current clustering by naming each component of G(k) as a cluster.

Step 3. If G(k) consists of a single connected graph, stop. Else, set


k k + 1 and go to step 2.

AGGLOMERATIVE ALGORITHM FOR COMPLETE LINK CLUSTERING -

Step 1. Begin with the disjoint clustering implied by threshold graph G(0),
which contains no edges and which places every object in a unique cluster,
as the current clustering. Set k 1.
Step 2. Form threshold graph G(k),
If two of the current clusters form a clique (maximally complete sub-
graph) in G(k), redefine the current clustering by merging these two clusters
into a single cluster.

Step 3. If k = n(n — 1)/2, so that G(k) is the complete graph on the n


nodes, stop. Else, set k k + 1 and go to step 2.
82 Clustering Metnods and Algorithms Chap. 3

These algorithms can be extended to dissimilarity matrices on interval and


ratio scales as long as no entries are tied. Simply view G(k) as the threshold
graph containing edges corresponding to the k smallest dissimilarities. A threshold
dendrogram records the clusterings in the order in which they are formed, irrespec-
tive of the dissimilarity level at which the clusterings first appear. A proximity
dendrogram lists the dissimilarity level at which each clustering forms and, in
effect, is a nonlinear transformation of the scale used with a threshold dendrogram.
Examples of proximity dendrograms are given in Section 3.2.2.
The single-link clustering on Qv) is defined in terms of connected subgraphs
in G(v); the complete-link clustering uses complete subgraphs. However, not all
maximally complete subgraphs in a threshold graph need be complete-link clusters.
The order in which the clusters are formed is crucial. Figure 3,4 exhibits the
single-link and complete-link hierarchical clusterings for the proximity matrix a,
of Example 3.1. The first seven threshold graphs in the sequence of 10 threshold
graphs are shown with nodes labeled so that node j denotes object xj.
Please note the following peculiarities about forming hierarchical clusterings
from threshold graphs. The entire single-link hierarchy is defined by the first

2 3 2 3 2 3
0 - •

 0 5
0 - 0
I 4 1 4 4
GO) 6(2) GO)

4
1 4
6


1
52 3 9 4 2 3 1
• 

T
Single Link Complete Link

Figure 3.4 Threshold graphs and dendrograms for single-link and complete-link hierar-
chical clusterings_
Sec. 3.2 Hierarchical Clustering 63

four threshold graphs in Figure 3.4. However, the first seven threshold graphs
are needed to determine the complete-link hierarchy. Once the two-cluster complete-
link clustering has been obtained, no more explicit threshold graphs need be drawn
because the two clusters will merge into the conjoint clustering only when all
n(n — 1)/2 edges have been inserted. This example demonstrates the significance
of nesting in the hierarchy. Objects ix , x x form a u e or corn 1 to
sub ra Ir ..... r a reshold ra2h G 5 1311 ji Iie three objects aretco mplete-link
cluster. Once complete-link clusters {x2, x3} and {x1, x4} have been established,
object x5 must merge with one of the two established clusters; once formed, clusters
cannot be dissolved and clusters cannot overlap. The dendrograms themselves
are drawn with each clustering shown on a separate level, even though, for example,
the two-cluster single-link clustering is obtained from G(3) and the two-cluster
complete-link clustering is obtained from G(7).
The interpretation of the dendrograms is not under consideration in this
chapter, but the two dendrograms in Figure 3.4 do raise about 11`ec
.
x5. Does it belon to the cluster or to the cluster ,icti l x }? A case can
a e made for calling {x2, x4, x5} a cluster. Perhaps a hierarchical structure is
not appropriate for this proximity matrix. These issues are examined in Chapter
4.
Hubert (1974a) provides the following algorithms for generating hierarchical
clusterings by the single-link and complete-link methods. When the proximity
matrix contains no ties, clusterings are numbered 0, 1, . . . , (ii — I) and the ?
nth clustering, 'Cm, contains 11 — in clusters.

(C
Mr - {C MI , C rti.25 • • • , C ni C n— In}}

HUBERT'S ALGORITHM FOR SINGLE-LINK AND COMPLETE-LINK METHODS

Step 1. Set in ‹— 0. Form the disjoint clustering with clustering number in.
'Co ' {(x1), (x2), .. • , (x.)}

Step 2a. To find the next clustering (with clustering number in + 1) by the
single-link method, define the function Qs for all pairs (r, t) of clusters in
the current clustering as follows.
Qs(r, t) = min id(i, j): the maximal subgraph
of G(d(i, j)) defined by Cn„. U C►„ is connected}
Clusters Gip and Cmq are merged to form the next clustering in the single-
link hierarchy if
Q s(p, q)= min {Q,(r, t)}
Step 2b. The function Q, is used to find clustering number in + 1 by the
complete-link method and is defined for all pairs (r, t) of clusters in the
current clustering.
64 Clustering Methods and Algorithms Chap. 3
.
i) = min mi, J ) the maximal subgraph
of G(dO, j)) defined by C. U Cm, is complete}
Cluster Cmp is merged with cluster C„,4, under the complete-link method if
Q c (p, q) = min {Q,(r,

Step 3. Set m in -F 1 and repeat step 2. Continue until all objects are ip
a single cluster.

The word "maximal" in the definitions of functions Q, and Q, means that


all nodes of the two clusters Cmr and Cm, must be considered when establishing
connectedness or completeness. Only existing clusters can be merged at the next
level.

Example 3.2
One way to understand the functions (2, and Q, is to consider the sequence of .threshold
graphs even though the threshold graphs are not necessary to the evaluation of these functions.
For example, the first seven threshold graphs for the proximity matrix 94 (Example 3.1)
are given in Figure 3.4. The third clustering (in = 2) can be numbered as follows.
(
C2 = {C2 1 , C22, C23}

The three clusters are defined as


C21 = tr5},. C22 = 42, x3} ■ C23 = {xi* x4}
To evaluate Q, when in is 2, find the smallest proximity that will connect two of
the existing clusters. Clusters C,1 and C,, become connected in threshold graph G(3).
Therefore, (p, q) is (1, 2) and Op, q) is 3. Another way of understanding this function
is to realize that Qs(r, 1) is the smallest dissimilarity that connects clusters Cr, and C,,,
and the smallest of the dissimilarities so found defines the next clustering. In this case,
clusters C21 and C22 first connect at level 3, or in G(3), clusters C., 1 and C23 first connect
in G(4) and clusters C,, and C first form a connected subgraph in G(5). The minimum
of the levels (3, 4, 5) is 3.
The interpretation of Q„ is much the same, with completeness replacing connectedness.
For example, Q, is found when m -= 2 by searching the threshold graphs in sequence
until one is found that merges existing clusters from %2 into a complete subgraph. This
does not happen until G(7), so (p, q) is (1, 3) and Qc(p, q) is 7. Clusters C21 and C2, first
form a complete subgraph in threshold graph G(9). Clusters C2, and C,3 first merge into a
complete subgraph in G(7) and clusters C,, and C23 first form a complete subgraph in
G(10). The minimum of the levels (7, 9, 10) is 7, so the fourth complete link clustering
(m — 3) is achieved at threshold 7 and merges clusters C21 and C21. The fact that other
complete subgraphs are formed in the process, such asx4 x5}, is immaterial.
,

Single-link clusters are characterized as maxirnall connected


whereas complete-link clusters are ckques, or maximally complete subgraphs.
Jartrdn SitTson (1971) have demonstrated several desirable theoretical properties
of single-link clusterings, but several authors (e.g., Wishart, 1969; Hubert, 1974b)
Sec. 3.2 Hierarchical Clustering 65

have objected to certain practical difficulties with clusters formed by the single-
link method. For example, single-link clusters easily chain together and are often
"straggly." Only a single edge between two large clusters is needed to merge
the clusters. On the other hand, complete-link clusters are conservative. All pairs
of objects must be related before the objects can form a complete-link cluster.
Completeness is a much stronger property than connectedness. Perceived deficien-
cies in these two clustering methods have led to a large number of alternatives,
some of which are explained in Sections 3.2.7 and 3.2.9. For example, Hansen
and DeLattre (1978) noted that sitiglezlink clusters may chain and have little homo-
geneity, while corn lete-link clusters ma not be well se arated.
very connected subgraph of a threshold graph is a single-link cluster but
not every clique is a complete-link cluster. Peay (1975) proposed an exclusive,
overlapping, hierarchical clustering method based on cliques and extended it to
asymmetric proximity matrices. Matula (1977) noted that the number of possible
cliques is huge, so clustering based on cliques is practical only for small n.
Suppose that the latest clustering of {xi , 12, . . . x,} in one of the hierarchies
has been formed by merging clusters C „p and Cnui in the clustering
{C„, , C m 2 , • • • , C m (fp -m)}
The following characterizations may help to distinguish the two clustering methods.
If the clustering was by the single-link method, we would know that
min {d(i,j)} = min { min id(i, j)il
mieC„,rirtC,, r#. xecc—x/Ec..
If the clustering was by the complete-link method, we have that
max {d(i, j)) = min { max {d(i, j)}}
,X,ECmp,Yf mq ro,s x i ec tt ,,x l eC i ,„

These characterizations show why the single-link method has been called
the "minimum" method and the complete-link method has been named the "maxi-
mum met (Jo nson, *1 owever, 1 t e proximities are similarities instead
of dissimilarities, this terminology would be confusing. This characterization also
explains why the complete-link method is referred to as the "diameter" method.
The diameter of a com lete subgraph is the largest proximity among all proximities
for pairs of objects in the subgraph. A t ough the complete-link method does not
generate clusters with minimum diameter, the diameter of a complete-link cluster
is known to equal the level at which the cluster is formed. By contrast, single-
link clusters are based on connectedness and are characterized by minimum path
length among all pairs of objects in the cluster.

3.2.2 Dendrograms and Recovered Structure

An important objective of hierarchical cluster analysis is to provide a picture of


the data that can easily be interpreted, such as the dendrograrns in Figure 3.4.
Dendrograms list the clusterings one after another. Cutting a dendrograni at any
66 Clustering Methods and Algorithms Chap. 3

level defines a clustering and identifies clusters. The level itself has no meaning
in terms of the scale of the proximity matrix.
A proximity graph is a threshold graph in which each edge is weighted
according to its proximity. The proximities used in Section 3.2.1 are ordinal, so
the weights are integers from 1 to n(n — 1)12. The dendrogram drawn from a
proximity graph is called a proximity dendrogram and records both the clusterings
and the proximities at which they are formed. Proximity dendrograms are especially
useful when the proximities are on an interval or ratio scale.

Example 3.3
a.
A ratio proximity matrix is given below as 2 The threshold and proximity dendrograms
are given in Figure 3.5. Also shown is the sequence of proximity graphs which provides
the actual dissimilarity values at which clusters are formed.
X2 x3 x4 x5
5.8 4.2 6.9 2.6
x2
6.7 1.7 7.2
212 = 1.9 5.6
X4 7.6

A proximity dendrogram is drawn on a proximity scale from a sequence of


proximity graphs and highliclusters t1EIatue'`b)m' iuwland "last" a long
time in the dendrogam. These observations are the basis for formal measures of
cluster validity in Chapter 4.
Any hierarchical clustering algorithm can be seen as a way of transforming
a proximity matrix into a dendrogram. Only the single-link and complete-link
methods of clustering have been discussed so far, but the statement applies to
hierarchical clustering methods defined in Sections 3.2.7 and 3.2.9 as well. Thrssli
old and 'roximit dendro rams re resent u r - I the hierarchical clustering
method is imposing on the data. This imposed structure can be captured in another
proximity matrix called the cophenetic matrix. The agreement between the given
proximity matrix and the cophenetic matrix measures the degree to which the
hierarchical clustering method captures the actual structure of the data. Formal
methods for measuring this agreement are discussed in Chapter 4. Here the cophen-
etic matrix is defined to help explain the difference between the single-link and
complete-link methods.
We begin with a hierarchical clustering:

Ro= - -- = AA
where the 'nth clustering contains n — m clusters:
= {Cm', Cm2, C m(n_ffi)}
A level function, L, records the proximity at which each clustering is formed.
For a threshold dendrogram L(k) = k, because the levels in the dendrograrn are
evenly spaced. In general,
Sec. 3.2 Hierarchical Clustering 67

L(m) = min {ci(xi, xi) : (C„, is defined}


The cophenetic proximity measure dc on the n objects is the level at which
objects xi and xi are first in the same cluster.
dc(i, j) = L(k,j)
where
kij = min {m (xi, xj) C„,q, some q}
2 1 2 il 2 4 2
2
1.7 4
6-4111
II-It.g
 3 3 3

2
1• • 5 1 • • 5 1 •-• 5
1
 _____
42 3
5.6 3
 5 l 5

Figure 3.5 Examples of threshold and proximity graphs with corresponding dendro-
grams.
6(5)

G(1) 6(2) 6(3) 6(4)

6 6 W 6(9) 6(

Thresh
2 4 3

L
Dendrogro

T T
2 4 3 15 2 4 3 1 5

Proximity
2
Dendrogro '
ms

Single Complete
68 Clustering Methods and Algorithms Chap. 3

The matrix of values [dc(xi, x1)] is called the cophenetic matrix. The closer the
cophenetic matrix and the given proximity matrix, the better the hierarchy fits
the data. There can be no more than (n — 1) levels in a dendrogram, so there
can be no more than (n — 1) distinct cophenetic proximities. Since the cophenetic
matrix has n(n — 1)/2 entries, it must contain many ties.
The cophenetic matrix for the single-link dendrogram in Figure 3.5 is shown
below as gcs and will be used to demonstrate some interesting properties of cophen-
etic matrices.
A.2 x3 x4 x5
x1 "4.2 4.2 4.2 2.6
26 0. — x2 1.9 1.7 4.2
x3 1.9 4.2
x4 4.2
Applying the single-link clustering method to acs reproduces the single-
link dendrogram in Figure 3.5. This might be expected. However, applying the
complete-link method to ac, generates the same (single-link) dendrograrn. The
complete-link method is usually ambiguous when the proximi matrix contains
ti_es as discussed in Section 3.2.6. However, the cophenetic matrix is so arranged
that tied proximities form complete subgraphs and no ambiguity occurs under
complete-link clustering. A cophenetic matrix is an example of a proximity matrix
with perfect hierarchical structure. Both the single-link and the complete-link meth-
ods generate exactly the same dendrogram when applied to a cophenetic matrix.
Repeating this exercise by starting with the complete-link dendrogram generates
the cophenetic matrix acs..
x2 x3 x4 x5
x1 7.6 5.6 7.6 2.6
7.6 1.7 7.6
Ce = 7.6 5.6
X3
x4 7.6

a,
The cophenetic matrix c also has perfect hierarchical structure. The com-
plete-link and single-link clustering methods will produce exactly the same dendro-
gram when applied to acs., and that dendrogram will be identical to the complete-
link dendrogram in Figure 3.5. An important question in applications is: Which
dendrogram better describes the true structure of the data?
3.2.3 Hierarchical Structure and Ultrametricity

The fact that both the single-link and the complete-link methods generate exactly
the same proximity dendrogram when applied to a cophenetic matrix suggests
that the cophenetic matrix captures "true" or "perfect" hierarchical structure.
Whether or not the hierarchical structure is a, iro date for a wen data set has
yet to be etermined, but the type of structure exemplified by the cophenetic
Sec. 3.2 Hierarchical Clustering 69
matrix is very special. The justification for calling the cophenetic matrix "true"
hierarchical structure comes from the fact that a cophenetic proximity measure
dc defines the following equivalence relation, denoted Rc, on the set of objects:
Rc(a)= ((xi, x1) : j) al
Relation Rc(a) can be shown to he an equivalence relation for any a 0
by checking the three conditions necessary for an equivalence relation. Since
dc(1, = 0 for all 1,
(xi, xi) E Rc(a) for all a 0
so Rc(a) is reflexive. Since dc(i, j) = dc(], i) for all (1,

(xj, e R c (a) if (x i , x i ) € R c (a) for all a 0

so Rc(a) is symmetric. The final condition, transitivity, requires that for all a a- 0,
if (x i , .r) E R c (a) and if (x k , x j ) E R c (a), then (x i, x j) R c (a)

This condition must be satisfied for all triples (x1, xj, xk) of objects and all a, It
can also be restated as

j) max idc(i, k), dc(k, M for all (1, j, k)


When stated in this way, the requirement is called the ultrametric inequality.
A close inspection of the cophenetic matrices for Figure 3.5 shows that they
satisfy the ultrametric inequality, so Rc(a) is, indeed, an equivalence relation for
any a 0. The nesting of the clusterings forming the hierarchy assures transitivity,
The only way that the very restrictive ultrametric inequality can be satisfied is to
have many ties in the cophenetic proximity. Recall that, at most, only n — 1 of
the n(n 1)/2 cophenetic proximities can be distinct. Since cophenetic proximity
measures re resent perfect hierarchical structure •roximit measur s II,
seldom reflect true hierarchical structure. The concept of ultrametricity has been
developed separately in mathematics and has applications in physics. See Rammal
et al. (1986) for an excellent review of ultrametricity and Schikhof (1984) for a
mathematical treatment of ultrametricity in the realm of "p-adic" analysis.
Two items should be noted with regard to the ultrametric inequality. First,
the cophenetic proximities derived from single-link and complete-link clusterings
always satisfy the ultrametric inequality. However, Section 3.2.7 will introduce
some hierarchical clustering methods whose cophenetic proximities are not ultra-
metric. Second, a geometric interpretwioiK ualit demonstrates
wh roximities mea s r lications are ve seldom ultramet ri c. Su ose
that each object is a pattern in a d-dimensional space. If Euclidean distance is
the measure of proximity and if the proximity matrix is to be ultrametric, the
triangles formed by all triples of points must be isosceles triangles with the unequal
leg no longer than the two legs of equal length.
Jardine and Sibson (1971) characterize hierarchical clustering methods as
70 Clustering Methods and Algorithms Chap. 3

mappings from the class of proximity matrices to the class of ultrametric proximity
measures. That is, a hierarchical clusterin: method irri s oses a dendro ra the
given iroximit m establishes the co henetic iroximit m ich
satin ten the rametricjaequality. Measures of fit between proximity measures
andcophenetic proximity measures are discussed in Chapter 4. The property of
uttrametricit is also called monot oni c i t -ac ohe ne t i c rl i t me asure sat i sf i e s
if the clusters form in a monotonic manner as
the
dissimilarity increases. In other words, the clusterings are nested in the hierarchy.
STngle : JkiLac adcomplet e- link clusterings are alvic, but other common
clustering methods defined in Section 3.2.7 can create the next clustering at a
smaller dissimilarity than the present one. This issue is discussed in Section
3.2.8.
3.2.4 Other Graph Theory Algorithms for Single Link -

and Complete-Link

The algorithms for single-link and complete-link hierarchical clusterings described


thus far establish step-by-step procedures for forming dendrograrns. In this section
we present other algorithms for these clustering methods that provide insight into
the clustering methods and can be computationally attractive.
An algorithm for single-link clustering begins with the minimum spanning
tree (MST) for G(c0), which is the proximity graph containing all n ( n — 1)12
edges. Although the sin hiera r c h y c a n b e from the MST. the MST
cannot be found from a single-link hierarchical clustering. For convenience, we
assume that no two edges in the MST have the same weight, even though Section
3.2.6 shows that ties in proximity pose no problem with single-link clustering.
An agglomerative algorithm for single-link clustering is given below that assumes
a dissimilarity matrix.

GRAPH THEORY ALGORITHM FOR SINGLE LINK CLUSTERING


-

Step 1. Begin with the disjoint clustering, which places each object in its
own cluster. Find an MST on GOO.
Repeat steps 2 and 3 until all objects are in one cluster.
Step 2. Merge the two clusters connected by the MST edge with the smallest
weight to define the next clustering.
Step 3. Replace the weight of the edge selected in step 2 by a weight
larger than the largest proximity.

This algorithm follows from the characterization for single-link clustering


given in Section 3.2.1 and the definition of MST, A divisive algorithm i st as
simple. Cut the edges in the MST in the order of W -eFfit cuttiii— the 1, - i st.
Each cut defines a new clustering, with those objects connected in the MST at
any stage belonging to the same cluster. As long as no proximity ties occur,
Sec. 3.2 Hierarchical Clustering 71

these algorithms generate the same single-link clusterings as the algorithms presented
earlier. Gower and Ross (1969) first proposed this algorithm. Rohlf (1973) provided
an implementation that examines each proximity value only once.

Example 3.4
Examples of the two algorithms are given in Figure 3.6 for the proximity matrix c,t3 defined
below.

f
X 2 x 3 x4 x 5
XI 2.3 3.4 1.2 3.7
x2 2.6 1.8 4.6
a3
x3 I 4.2 0.7

A node coloring of a
X4 L 4.4_

threshold graph G(v) is an assignment of —colors," or labels, to the n nodes in


such a way that no two nodes connected by an edge

3 3

4 4
Minimum Spanning Tree
Complete Graph with
MST Darkened I (MST)
1

(3,5) 5 3
 -•
0.7
2 2
 • 1.2 le

5 3
0 0 - 1
8
D7

1111.
7.2"%. •2

<
30-
3
Single Link S •
0.7
(3,5)
Denarogrem
1 •2
(1,2,4) •
4

Agglomerative Divisive

Figure 3.6 Examples of agglomerative and divisive single-link algorithms based


on the MST.
72 Clustering Methods and Algorithms Chap. 3

in 0(v) are colored the same. Baker and Hubert (1976) show how the set of
node colorings is related to hierarchical clustering. The connection between node
coloring and complete-link clustering is not as simple as is the relation between
single-link clustering and the MST. The last complete-link clustering achieved
for a given threshold graph G(v) corresponds to a coloring of the nodes of the
complement of G(v). Hansen and DeLattre (1978) provide other algorithms from
graph coloring.

3.2.5 Matrix Updating Algorithms for Single-Link and


Complete-Link

In this section we discuss algorithms for single-link and complete-link clustering


in terms of a scheme for updating the proximity matrix. This approach was suggested
by King (1967) and popularized by Johnson (1967), who formalized the procedure.
The algorithm is an a lornerative scheme that erases rows and columns in the
proximity matrix as old clusters are merged into new ones. We again simplify
the algorithm by assuming no ties in the proximity matrix. Figure 3.7 provides
a
examples of this algorithm for the proximity matrix 3 (Example 3.4).
The it x it proximity matrix is = [d(i, j)]. The clusterings are assigned
sequence numbers 0, 1, . (n — 1) and L(k) is the level of the kth clustering.
A cluster with sequence number m is denoted (in) and the proximity between
clusters (r) and (s) is denoted ell(r), (s)].

JOHNSON'S ALGORITHM FOR SINGLE LINK AND COMPLETE LINK CLUSTERING


- -

Step I. Begin with the disjoint clustering having level L(0) = 0 and sequence
number 777 = 0.
Step 2. Find the least dissimilar pair of clusters in the current clustering,
say pair {(r), (s)}, according to

di(r), (s)] = min {d1(i), (ill}


where the minimum is over all pairs of clusters in the current clustering.

Step 3. Increment the sequence number: m rn + 1. Merge clusters (r)


and (s) into a single cluster to form the next clustering in. Set the level of
this clustering to
L(m) = d[(r), (s)1

Step 4. Update the proximity matrix, 9, by deleting the rows and columns
corresponding to clusters (r) and (s) and adding a row and column correspond-
ing to the newly formed cluster. The proximity between the new cluster,
denoted (r, .s.) and old cluster (k) is defined as follows. For the single-link
method,
Sec. 3.2 Hierarchical Clustering 73

d[(k), (r, s)] = min id[(k), (r)], dl(k),


For the complete-link method,
d[(k), (r, .01 = max {d[(k), (r).1, d[(k), (MI

Step 5. If all objects are in one cluster, stop. Else, go to step 2.

1,2,4 [ 0 5
1 2 3 4 1,2,4 395
1 0 2.3 3.4 1.2 3.7 0
1,2,4
3,5
2 0 2.6 1.6 4.6
3 0 3,5
4.2 0
4.4
5
3 5 1 4 2

1 •
2 3,5 4
1 2 3,5
0 2.3 3.7
1 0 2.3 3.4 1
2
[ 0 2.6 1.8 2
0 4.4
3,5 0 4.2 3,5
0
4 0.

1,4 2 3,5_ 3 5 1 4 2 1,4 2 3,5


 • •
[0 4.1
1,4 10. 0 i.8 3.4 1,4
4 0 - 4.0 0 4.6
2 2.6 2
3, 3,5
2,0 0
2 . 0 - 0
5
Single Link Complete Link
3.0 3.0
Figure 3.7 Examples of matrix updating algorithms for single link and -

complete-link clusterings.
74 C l u s t e r i n g M e t h o d s a n d A l g o r i t h m s Chap. 3

Anderberg (1973) discusses three computational approaches to implementing


the algorithm above, called the stored matrix, sorted matrix, and stored data ap-
proaches. The three approaches differ as to whether the pattern matrix or the
dissimilarity matrix is stored in random access memory or in auxiliary storage,
such as disk. Note that the dissimilarity matrix requires more storage than the
pattern matrix when n d. The stored matrix approach, where the entire dissimilarity
matrix is stored in random access memory, is fastest.

Example 3.5
The computational examples in Figure 3.7 demonstrate the construction of single-link and
complete-link hierarchies. This example demonstrates the qualitative differences between
the single-link and complete-link hierarchies for the two artificial data sets defined in Section
2.4. The first data set, called DATAI, consists of 100 patterns in a four-dimensional
pattern space generated so as to have four categories, or true clusters. Patterns 1 through
24 were generated from category 1, patterns 25 through 59 from category 2, patterns 60
through 80 from category 3, and patterns 81 through 100 were generated in category 4.
An eigenvector projection is given in Figure 2.9. The proximity measure is squared Euclidean
distance in the pattern space. Since the proximity measure is Euclidean distance and since
the data were generated to several decimal places on a computer, we feel safe in assuming
that no proximity ties exist, so the hierarchies are both unique (see Section 3.2,6). Figures
3.8 and 3.9 show the proximity dendrograms for the single-link and complete-link hierarchies,
respectively.

111 6 1 6116111111 lllllll 6116156156611611 llllllll 011061166 1 felpfloopamme of 6610651161 66 llll


6 17 1 5116117177 11 3 321361652331311 151125 ♦ 3445243151 11411561 11111 51 51011211212 11021
361767 1 111 1111121024 I 121.114537111 41 1751104131512113151111 1 75141 134 1117111141 1134
Ll _S IL
0.04

-
L
-

0.4 - 3.8 Single-link hierarchy for 100 clustered patterns in four dimensions.
Figure

0.5 -

0.6 -
Sec. 3.2 Hierarchical Clustering 75

11111011111111101111111111111111111 44444 11$11101 OOOOO 01411,1014611111111110100101 01101101 11010001001


11 M 1 1 1 / 2 / 7 1 7 31112441245555451141515211121111153114141111 lllll 101111111 111101121201 11002212101
0.0
LC' _.c1L11
0.2

0.6

1.0

1,4
Figure 3.9 Complete-link hierarchy for 100 clustered patterns in four dimensions.

This example demonstrates the difficult in corn arin two dendro rams and motivates
the development of methods or automatically isolating significant clusters that are presented
in Chapter 4. The complete-link dendrogram in Figure 3.9 can be cut at level 1.0 to
generate four clusters. These clusters recover the original four categories in the data perfectly.
The four-category structure is not at all apparent in the single-link hierarchical clustering
of Figure 3.8.

Clustering methods have the nasty habit of creating clusters in data even when no
natural clusters exist, so hierarchies and cIusterin—s must be viewed with extreme sus icion.
Figures 3. W and 111 demonstrate this statement on the two hierarchies for a data set,
called DATA2, consisting of 100 points uniformly distributed over a unit hypercubc in
six dimensions (see Section 2.4). The patterns are positioned at random, so it is barely
possible that they have arranged themselves into meaningful clusters; however, it is unlikely
that real clusters exist, especially considering the two-dimensional projections in Figure
2.11. We thus interpret Figures 3.10 and 3.11 as hierarchies in which no true clusters
exist. The single-link dendrogram in Figure 3.10 exhibits the chaining that is characteristic
of single-link hierarchies. This chaining can occur even when valid clusters exist, as in
Figure 3.8. The complete-link hierarchy in Figure 3.11 suggests some meaningful clusters;
it looks more clustered than the single-link hierarchy. and this is the lure of complete-link
clustering. It tends to produce dendrograms that form small clusters which combine nicely
into larger clusters even when such a hierarchy is not warranted, as with random data.
This example should demonstrate the difficulties inherent in letting the human eye scan
over the dendrogram to pick out believable clusters and clusterings.
76 Clustering Methods and Algorithms Chap. 3
1 11 11 111111 1 llllllllllllllll 11111111111 llllll 1 lllllllllll 11111111111121, llllll h a o s l u m
1 i/1441131111 124 31 1111111112 lllllll 27331 111 31321/ 1311111 11411/71141 II 111111 11/ 11111
1 4I 1 3 I l I 1 1 i 1 1 1 1 1 1 1 7 1 1 311132111111311314111211 lllll 111i112131 llllll 1311/11114111111 2111211111

0.25 -
0.6
0,3 -

I
0.7 -
0.4 -
Figure 3.10 Single-link hierarchy for 100 random patterns in six dimensions.

0.5 -
3.2.6 Ties in Proximity
The computational complexity of competing algorithms for implementing a particular
clustering method and the availability of software should determine which algorithm is
appropriate for a given application. The problem of choosing between the single-link
and complete-link methods is much more difficult than choosing an algorithm for
one of the methods. No list of characteristics exists that lets us choose between the
two methods in a calm, rational manner. Some theoretical and practical information
about the two methods is summarized in this section, especially the effects of ties in
the proximity matrix.
The single-link and complete-link methods differ in many respects, such as i n
t h e s t r u c t u r e s r e c d a n d t h e r o c e d u r e . T h e t w o m e t h o d s produce
the same clusterin s v" ,kr-oximit matrix sati es the ultramehie inequality, as
discussed in Section 3.2,3. This section demonstrates that the two methods differ in the
way they treat ties in the proximity matrix. Up to now, we have assumed that the
troximit matrix contains no ies so that two n w clusters are ilever formed at the
same level and the algorithms defined thus far produce unique dendro A tie implies
that two or more edges are added to the proximity
Sec. 3.2 Hierarchical Clustering
77

014010101110001104101101116001401.1 •1 SO II •1014 10000151 a p a c i o a c i a s o n s o n a m f o g i p p g i l l g i g op IS


411114111511 VISIT 17 lllll 104115111 61 1111,23611111 5151554 3521 152)41.121W; llllll 'mit llllll 71
1449111t46514014414116111 4114r15 x51055511141411141141211 4172 15442 11101071351111111111141S12

JJ
0.25

0.4 —

0.8 —

1,2 —

1,6

Figure 3. 11 Complete-link hierarchy for 100 random patterns in


six dimensions.

graph at once and that the minimum and


maximum functions required in matrix updating
are not unique.
Jardine and Sibson (1971) showed that the
single-link method does not suffer from ambi
cities due to ties because it has a continui ro
crty. If the ties are broken in the proximities by
adding or subtracting a small amount from the tied
proximities, the resulting single-link
dendrograms will merge smoothly into the same
dendrogram as the added amount tends to zero, no
matter how the ties are broken. This statement
applies to all single-link algorithms as long as the
rank orders of the proximities are not changed
by the added amounts. By contrast, several
complete-link dendrograms can be obtained by
breaking ties in this way, as demonstrated in
Figure 3.12.
Figure 112(a) shows the first three threshold
graphs for the given proximity matrix. Two edges
are added at once in G(3). The single-link
hierarchy_ is the same whether
ecp(2,3)isEismted first or ed e (3,21) is
inserterst. The proximity dendrogram for the
single-link method in Figure 3.12(b) is unique even
though more than one cluster can be formed at
the same level. Algorithms based on the MST or
on matrix updating produce the same results.
The situation is very different with complete link
clustering. Figure 3.12(b) shows the hierarchy
defined by adding edge (2, 3) first and that
formed when adding edge (3, 4)
78 Clustering Methods and Algorithms Chap. 3

1 2
1 2
1 2 3 4 5
1 2 7 5
2 5
8 6 es •5
3 0 3
2
4 0 3 9

0 4 4 4
0 G(1) 4


1 2 3 4 5 1 2 5 3 4
1 2 3
* S 0
1
2
t
3
4

9 9 Single

Complete Link Complete Link


(2,3) inserted first (3,4) inserted
first

1 2 34
1 2 3
4 5
0 1 2
1 75
3
2 0 86 1

7:••
0
3 49 2
4 03
3
0
5
4

Single Link
Complete Link
(c)
Figure 3.12 Effects of ties in
proximity on single-link and
complete-link clustering: (a)
proximity matrix and threshold
graphs: (b) proximity
dendrograms: (c) altered
proximity matrix and
dendrogranis.

first. The two clustering


structures are very different, This
effect can also be
observed on the matrix updating
algorithm and with Hubert's
algorithm (Section
12.1). Adding the two edges (2, 3)
and (3, 4) simultaneousl y does not the
problem beca resultin four-ed
± c - is I. h is not a corn Mete
.

graph. In fact, the next complete


graph would be the one on all five
nodes and the hierarchical
clustering would have only three
levels.
Figure 3.12(c) emphasizes the
seriousness of ties. The given
proximity matrix differs from that
in Figure 3.12(a) in only two
entries; the (3, 4) and (4, 5)
entries are interchanged, as might
occur through a typing error when
entering
Sec. 3.2 Hierarchical Clustering 79

data. In this case a unique complete-link hierarchy is obtained because the two
edges Tiirthe same proximity can be added in arbitrary order. However, the
hierarchy has two clusters forming at level 3. The single-link hierarchy is also
shown. The single- and complete-link dendrograms in Figure 3.I2(c) resemble
one another much more closely than do those in Figure 3.12(b), which might
lead one to believe that the proximity matrix in Figure 3.12(c) had a good hierarchical
structure, whereas that in Figure 3.12(a) has a poor hierarchical structure. This
example raises the issue of sensitivity. It appears that the hierarchical structure
can change dramaticall with small chan es in the rank orders of the iroximities.
The havoc that ties can create in a complete-link hierarchy has been noted
by several researchers. Sibson (1971) and Williams et al. (1971) argue against
the complete-link method as a feasible clustering procedure (see also Hubert,
1974a). Thpractical problem of ties is subtle. Software packages do not typically
check for ties. The order in which an edge is added from a set of edges with the
same proximity is at the whim of the programmer. The program will generate
only one complete-link clustering, even though a number of clusterings might be
equally justifiable. This problem is compounded when the proximity matrix contains
several ties. The comparative studies in Section 3.5.2 suggest that the complete-
link method produces more useful hierarchies in many applications than does the
single-link method, even though proximity ties make it ambiguous.

3.2.7 General Matrix Updating Algorithms and


Monotonicity

This section generalizes the algorithms in Section 3.2.5 and discusses issues in
the computation and application of these algorithms. Questions of the validity of
cluster structures are taken up in Chapter 4. The general paradigm for expressing
SAHN (Sequential, Agglomerative, Hierarchical, Nonoverlapping) clustering meth-
ods is given in Section 3.2.5. Step 4 of that algorithm specifies how the dissimilarity
matrix is to be updated by defining the formula for the dissimilarity between a
newly formed cluster, (r, s), and an existing cluster, (k) with nk objects. The
single-link and complete-link algorithms use the minimum and maximum, respec-
tively, of the dissimilarities between the pairs {(k), (r)} and 1(k), (s)}. Other clustering
methods can be defined by specifying different combinations of the distances in-
volved. A general formula for step 4 that includes most of the commonly referenced
hierarchrcac us eyi n g me ods is givenbilov,

d[(k), (r, s)]


= a,k1[(k), (r)] + asd[(k), (s)] + 13d[(r), (s)] + -MO), (r)] d[(k), (s)]I
This formula was first proposed by Lance and Williams (1967). Table 3.1
shows the parameter values for the most common algorithms. This table is also
given in Milligan (1979) and in Day and Edelsbrunner (1984).
The acronym "PGM" refers to the "pair group method"; the prefixes "U"
and "W" refer to unweighted and weighted, respectively. An "unweighted" method
80 Clustering Methods and Algorithms Chap.
TABLE 3.1 Coefficient Values for SAHN Matrix Updating
Algorithms

Clustering Method

0
—112
Single-link 1/2 112
1/2
Complete-link 112 112 0
n„ Ili 0 0
UPGMA (group average)
nr n , ll. + 11,
0
0
WPGMA (weighted average) 1/2 30 —
nt-ns
n, n, 0
(tr,120
UPGMC (unweighted centroid) n, + n, n, + P15
—1/4 0
WPGMC (weighted centroid) 1/2 1/2 —nk
n, + nk n, + nk 0
11, ± n, + nk
Ward's method (minimum variance) n, + n, + nk n, + ny + nA

treats each object in a cluster equally, regardles of the structure of the dendrogram.
A "weighted" method weights all clusters the same, so objects in small clusters
are weighted more heavily than objects in large clusters. The suffixes "A" and
"C" refer to "arithmetic averages" and "centroids." Thus "UPGMA" stands
for "unweighted pair group method using arithmetic averages" and "WPGMC"
refers to "weighted pair group method using centroids." Rohlf (1970) and Sneath
and Sokal (1973) have used this terminology. The UPGMC method has also been
called, simply, the centroid method, while the WPGMC method has been called
the median method (see Lance and Williams, 1967).

Sneath and Sokal (1973) provide a good discussion of the backgrounds of


these methods and define other SAHN algorithms. Arithmetic averagints
to avoid the extremes of the sin le-link and corripletenethods. When measuring
the assimilarity between an existing cluster and a prospective cluster, the single-
link method finds the closest pair of objects in the two clusters, the complete-
link method finds the most distant pair, and the UPGMA and WPGMA methods
use arithmetic averages of the dissimilarities. The arithmetic averaging methods
have no simple geometric interpretation. In contrast, the UPGMC and WPGMC
methods have direct geometric interpretations when the objects are represented as
patterns in a d-dimensional space. The centroid methods assess the dissimilarity
between two clusters by the distance between centroids. The UPGMC method
measures distance in terms of the centroid computed from all patterns in each
cluster. The WPGMC method computes centroids from the centroids of the two
clusters that merge to form a new cluster. The UPGMA weights the contribution
of each pattern equally by taking into account the sizes of the clusters, while the
WPGMC weights the patterns in small clusters more heavily than the patterns in
large clusters. Centroid methods should only be used when the objects are represented
as patterns and the proximity measure is squared Euclidean distance. An important
distinction between centroid methods and other SAHN algorithms is in monotonicity,
as explained later in this section.
Sec. 3.2 Hierarchical Clustering 81

Example 3.6
Dendrograms for the seven algorithms in Table 3.1 are drawn in Figure 3.13. The six
objects involved arc the six pattern vectors defined below in a three-dimensional space.
x, = (1.0 2.0 2.0)r x4 = (3.0 4.0 3.0)T
x2 = (2.0
1.0 2.0)T xs -= (0,0 3.5 3.5)T

x / = (0.0 1.0 3.0)T X6 = (2.0 2.5 2.5)T

Under a squared Euclidean distance measure of dissimilarity, the proximity matrix


is given below.
1 2 3 4 5 6
-
0 2.0 3.0 9.0 5.5 1.51
0 5.0 11.0 12.5 2.5
3 0 18.0 6.5 6.5
4 0 3.5
5, 0 6.0
6 0
01.5

The tie in proximity between pairs of patterns (x,, x5) and (x3, x6) causes no ambiguity
in any of the dendrograms. The dendrograms for Ward's method, the two arithmetic average
methods, and the two centroid methods all have the same topology and differ only in
levels. They suggest that x 4 is an outlier because it joins the cluster of the other five
patterns last and the gap between the formation of the five-pattern cluster and the singleton
cluster is large. The single-link dendrogram has much the same topology, except that x 5
now appears to be the outlier. The complete-link dendrogram establishes cluster (x 4, x5).
All dendrograms agree that (x1, x2, x5) is a strong cluster. Quantitative measures of the
strength and quality of clusters and clusterings, defined in Chapter 4, should help answer
such questions as: If one of the dendrograms were to be cut to define a partition, where is
the best cutting level? Is (x1, x2, x6, x3) a good cluster?

Several of the comparative studies discussed in Section 3.5.2 conclude that


Ward's method (Ward, 1963), also called the minimum variance method ou
forms other hierarchical clustermu methods. This method is based on notions of
square error popularized in analysis-of-variance and other statistical procedures
(Wilks, 1963, Cooley and Lohnes, 1971). Square-error criteria are also used in
partitional clustering algorithms (Section 3.3.1). Ward's method is implemented
by the standard algorithm using the constants in Table 3.1. These constants are
derived below to see how square error is minimized.
Suppose that a clustering has been achieved with Ward's method and that
the next clustering in the hierarchy is to be obtained with the matrix updating
algorithm. Ward's method is designed for the situation when the data appear as
patterns. Thus we begin with a set of n patterns in a d-dimensional space. Let
4) be the value for feature j of pattern I when pattern i is in cluster k for i from
I to n k and j from 1 to d. The centroid of cluster k, denoted InrS k) . 41, is the
cluster center, or the average of the nk patterns in cluster k.
82 Clustering Methods and Algorithms Chap. 3

angie LIPG MA
Ward's
Link
1 6 2 3 4 5
1.511

ti 3

2.0
5 5
5-5

10 10 1

15 15 15 0

2 1

20
0 5

LIPP% WPGMC
WPGMA
1 6 2 3 5 4

422
5 5
6.43

9.80
1 10

15

151 15

Figure 3.13 Examples of dendrograms for matrix updating algorithms.

in(k) — lin k E x( )
( )

=1
The square-error for cluster k is the sum of squared distances to the centroid for
all patterns in cluster k. Mk d

e1 2, E E 14) _ myof 2 i=L j=i


The square-error for the entire clustering, which contains K clusters, is the sum
of the square-errors for the individual clusters.
K
4 = E ei
k=1
Ward's method merges the pair of clusters that minimizes ❑Er2Q, the change in
4 caused by the merger of clusters p and q into cluster t to form the next clustering.
Sec. 3.2 Hierarchical Clustering 83

Since the square-errors for all clusters except for the three clusters involved remain
the same,
AE2 = er2 _ e2q

After a bit of algebra, we find that the change in square-error depends only
on the centroids.
A
nnP
In
"L IN ./
►P

The clusters p and q selected for merger are the clusters that minimize this
quantity. The square-error must increase as the number of clusters decreases, but
the increase is as small as possible in Ward's method. Once clusters p and q are
merged into cluster t, the proximity between all other clusters and the new cluster
t must be updated. Letting cluster r represent a cluster other than p, q, or 1, the
following formula can be applied to find d[(r), (t)]:
n+n + n
n 11
d[(r), (1)1 = nr + n C r), (p)J + dj(r), (q)] nr n +n,di(P),(0]
, nr + 11,
The choice of a suitable hierarchical clustering method is an important matter
in applications, but theory provides few guidelines for optimizing the choice.
Square-error is a familiar criterion in engineering, so one might feel comfortable
with a procedure that minimizes square-error, such as Ward's method. However,
the objective of cluster analysis is to investigate the structure of the data, so the
imposition of an apriori criterion, such as square-error, might not be appropriate.
All data do not occur as patterns, so we cannot limit our thinking to geometrical
constructs. Section 3.5.2 reviews several empirical studies that compare hierarchical
clustering methods and that guide the choice of a clustering method.

3.2.8 Crossovers and Monotonicity in Dendrograms

Section 3.2.2 defined perfect hierarchical structure as ar oximi matrix


that satisfies ttlet in e q uality. The rationale was that the single-link and
complete-link methods produced the same dendrograms for an ultrametric proximity
matrix, and since these two methods search for very different types of structure,
the fact that they exhibit the same exact structure is meaningful. It is clear from
the single-link and complete-link algorithms based on threshold and proximity
graphs that these methods are monotonic. That is, the level at which the next
cluster forms is always larger, on a dissimilari scale, than the level of the current
clustering. Monotone methods induce ultrametric cophenetic matrices. Monotonicity
car—i-Feexpressed in mathematical terms by referring to the matrix updating formula
in Section 3.2.7. If a clustering method merges clusters (r) and (s) into cluster
(r, s), monotonicity demands that
d[(k), (r, s)] d[(r), (s)]
84 Clustering Methods and Algorithms Chap. 3

1 2 3 4

0 1.2 2.3
2 0 2.4 2
3 0 .

4 2
1 2 3 4 1 2 3 4

1 2 3 4
•9••
1.2

1.2 1.2
2.6 1.5
2.0
2.0
2.0

Single Link Complete Link WP


a
UPG

Figure 3.14 Examples of crossover or reversal in dendrogram.

for all clusters (k) distinct from (r) and (s), That is, no
dissimilarity in the updated matrix can be smaller than
the smallest entry in the previous matrix. Another
way of saying this is that the cophenetic matrix
generated by these two methods satisfies the ultrametric
inequality.
What can be said about the monotonicity of
SAHN algorithms expressed through matrix updating,
especially those defined by the matrix updating algorithm
in Table 3.1? Figure 3.14 provides a simple example of
a dissimilarity matrix and the dendrograms generated
by the single-link, complete-link, UPGMC, and WPGMC
methods. The dendrograms from the centroid methods
(UPGMC and WPGMC) are not monotone and exhibit
what is called a "crossover" or a "reversal" since clusters
(x1, x2) and (x3, x4) merge at a level lower than the level
at which (x3, x4) is first defined.
M_ onotonic is clearly a property of the
clustering method and has nothing to do with the
proximity matrix. The advantage of the matrix updating
formula Ts—that the monotonicity of any SAHN
algorithm that can be expressed in terms of this
updating oralT-11a can be predicted from the
coefficients. Assuming that a,. > 0, aiid a, 0, Milligan
(1979) provided the following results. The matrix
updating formula for step 4 of the SAHN algorithm is
repeated below for easy reference. Clusters (r) and (s)
are being merged into cluster (r, s) and the dissimilarity
between distinct cluster (k) and the newly formed
cluster is being established.
Sec. 3.2 Hierarchical Clustering 85

d[(k), (r, s)]


= ard[(k), (r)] + ctscli(k), (s)] + 1301(r), (s)] + leld[(k), (r)] — d[(k),

Result I. If a,. + a,. + 13 1 and y 0, the clustering method is monotone.

This result is easily demonstrated. The first inequality can be rewritten as



ar —
as

and substituting in the matrix updating formula shows that


d[(k), (r, s)] CO, + arfal(k), (r)] — Cr), (s)l}
+ otAd[(k), (s)] — d[(r), (s)]) + 71d[(k), (r)] — d[(k),
SAHN algorithms require that d[(r), (s)] be no greater than either d[(k), (0] or
d[(k), (s)1 for any distinct (k). Thus the condition that 7 be nonnegative implies
that
d[(k), (r, s)] di(r). (s)1

for all clusters (k) other than (r) and (s), which implies monotonicity.

Result 2. If ar + a, + 13 a- 1 and 0 > y a max { ai., — — eta, the clustering


method is monotone.

To demonstrate this result, first consider the case when d[(k), (r)] >
d[(k), (s)1. Using the first inequality and recalling that y is negative, the matrix
updating equation can be rewritten as

d[(k), (r, s)] d[(r), (s)] + (ar 171){4(k), — d[(r), (s)1}


+ (as + 171)tdi(k), (s)1 — d[(r), (s)]}
Since a,. 171, the second term on the right is nonnegative for the case under
consideration. The last term on the right is nonnegative because of the way SAHN
algorithms are defined. Thus

d[(k), (r, s)] d[(r), (s)]


as in Result 1, and the clustering method is monotone. The case when d[(k), (01
> d[(k), (0] can be proved in a similar fashion.
The inequality in Result 1 is not satisfied for either of the centroid methods
(UPGMC and WPGMC). It is easy to create examples for which these methods
do not produce monotone hierarchies, as indicated in Figure 3.14. However, all
other methods in Table 3.1 are monotone. Note that only single-link and complete-
link clusterings are invariant under monotone transformation of the dissimilarities.
Figure 3.13 demonstrates that nonmonotone clustering methods do not neces-
sarily produce crossovers.

C rlfscrv-dr.i tg.e 6 ,14 ;


4 1
A^".
ti„e fre-:---4,-15 •
86 Clustering Methods and Algorithms Chap. 3

.will—lae—eiiiiiinate€17 One is tempted to reject nonmonotone methods out of hand.


.

Williams and Lance (1977) call them obsolete. Anderberg (1973) claims that they
lack a useful interpretation for general proximities. Sneath and Sokal (1973) claim
that the frequency of reversals and the relatively high degree of its distortion of
the original [proximity] matrix has led to the abandonment of this [UPGMC] tech-
nique." On the other hand, the performances of nonmonotone methods in several
comparative studies of clustering methods discussed in Section 3.5.2 do not suggest
that such methods be abandoned. Williams et al. (1971) argue that monotonicity
is not essential for the proper performance of hierarchical clustering.

3.2.9 Clustering Methods Based on Graph Theory

The statements of the single-link and complete-link algorithms in terms of


graph theory in Section 3.2.1 suggest that properties other than connectedness
and completeness can be used to define clustering methods. The idea is to watch
the s luence of threshold ra hs or roximit •ra hs for file a. - arance or a
suitable Hubert (1974a) suggests the following expression of algorithms
that define hierarchical clustering methods. Ties in the proximities can affect the
clusterings in unexpected ways, so we assume that no ties exist in the proximity
matrix.
New hierarchical clustering algorithms are formed by changing step 2 in
the algorithm of Section 3.2.1. The function Qp(k) is defined as follows for all
pairs of clusters {C,„, C„„} in the clustering {C mi, . . C„ 0„,)}:
Qp(k)(r, 1) = min 4d(i, j) : the maximal subgraph of G[d(i, j)] defined by
C,„ U C„,, is connected and either has property p(k) or is complete}

Following the algorithm, clusters Cmp and C,ng are merged to form the next
clustering in the sequence if
Qp(k)(p, q) = min {Qp(k)(r, t)}
Some examples of property p are given below. Integer k is a parameter.,
so, for example, p(k) could mean a node connectivity of k or a node degree of k.

Node connectivity. The node connectivity of a connected subgraph is the


largest number rz such that all pairs of nodes are joined by at least n, paths
having no nodes in common.
Edge connectivity. The edge connectivity of a connected subgraph is the
largest integer ne such that all pairs of nodes are joined by at least ne paths
having no edges in common.
Node degree. The degree of a connected subgraph is the largest integer ?id
such that each node has at least nd incident edges.
Diameter. The diameter of a connected subgraph is the maximum "distance"
between two nodes in the subgraph. The distance between two nodes is the
number of edges in the shortest path joining them.
Sec. 3.2 Hierarchical Clustering 87

Radius. The radius of a connected subgraph is the smallest integer n. such


that at least one node is within distance n,. of all other nodes in the subgraph.

Specifying parameter k and property p defines a new clustering method.


Every cluster must at least be connected. Once all the edges have been inserted
into the subgraph, it is complete and no further properties can be applied. Certain
practical difficulties arise when trying to select a suitable property. Few guidelines
exist other than intuition and experience. Theorems from mathematics provide
some insight into these methods. For example, a node connectivity of k implies
an edge connectivity of k, but the reverse is not true. Similarly, an edge connectivity
of k implies a minimum degree of k, but the reverse does not hold. A compelling
reason must appear before one of these methods is used in place of the single-
link, complete-link, or other SAHN algorithms.

Example 3.7
Figure 3.15 demonstrates hierarchical clustering methods defined by graph properties. An
ordinal proximity matrix is given below on eight objects. Threshold graph G(13) is pictured
in Figure 3.15(a) to help in establishing the dendrograms for several methods, as well as
for the single-link and complete-link methods. Proximity dendrograms are shown in Figure
3.15(b)—(h). A simple way to find these hierarchies with pencil and paper is first to list
the pairs of objects in rank order by proximity. Then construct a sequence of threshold
graphs and find the first threshold graph at which a property is satisfied. It is important to
check the property only on the subgraph formed by the union of the subgraphs for two
existing clusters.
1 2 3 4 5 6 7 8
—0 ,
l 13 21 18 4 8 7 28
2 0 9 19 15 14 10 16
3 0 22 20 12 11 17
4 0 3 23 27 1
5 — 0 5 24 2
6 ---------- 0 6 25
7 ---------- 0 26
8 ----------
_ 0_

Ling (1972) examined hierarchical clustering based on notions of connectivity


and compactness that are particularly appropriate for ordinal proximity matrices.
He assumed ordinal proximities with no ties, but his clustering method can also
be applied to interval and ratio proximity matrices. The properties p that Ling
proposed are defined below.
Consider a proximity graph on n nodes. A subgraph is r-connected if all
pairs of nodes in the subgraph are connected by r-chains. An r-chain between
two nodes is a sequence of nodes having d(i, j) s r for all pairs (i, j) of nodes
in the sequence. A subgraph is (k, r)-bonded if every node in the subgraph is
directly connected to at least k nodes and if d(i, j) r for all k connections.
Finally, a subgraph is (k, r)-connected if it is both r-connected and (k, r)-bonded.
as Clustering Methods and Algorithms Chap. 3

A subgraph becomes a (k, r)-cluster in the algorithm of Section 3.2.1 as


soon as it becomes (k, r)-connected. Note that (1, r) clusters are single-link clusters.
Ling (1972, 1973a) proposed the (k, r)-cluster as a way of identifying significant
clusters. Given any proximity graph, a (k, r)-cluster can be defined independent
of the hierarchical clustering algorithm. A subgraph is a (k, r)-cluster if r is the
smallest value of s for which the subgraph is (k,$)-connected for some s and the
subgraph is not properly contained in any other (k, 1)-connected subgraph for
t> r. Such clusters have several attractive mathematical properties and are intuitively
appealing since both connectedness and compactness are involved in the definition.
We emphasize that no theory exists for choosing among the various properties
of graphs to select the "best" clustering method for a particular application. Section
3.5 provides some guidance, but familiarity with a method and confidence in the
results of previous applications of the method are the only practical ways of choosing
a method.

48S 6 2 4 8 5 "7 1 2 4 8 5 7 23
• #•
1
2

3

4
S
8 8
9 9
10
12

16

211

28

(b) Cc) (d)


Figure 3.15 examples of dendrograrns from graph theory: (a) threshold graph G(13)
for proximity matrix in example 3.7; (b) single-link; (c) complete-link; (d) 2-node con-
nected; (c) 2-edge connected; (f) 2-degree; (g) 2-diameter; (h) 2-radius.
Sec. 3.3 Partitional Clustering 89

4 8 5 6 7 1 2 3 4 8 5 15 7 1 2 3

i till • • • • • 1 trt • ► •(1)• •


3 CO
it

6 3 1
a
6
9
11
7 a T

4 8 5 1 6
23 IT/ 1 • •
• • (g) 1
2
(h)
3 T
4 F
4
Figure 3.15 (continued}
5
5

9
3.3 PARTITIONAL
1 T
CLUSTERING 1

1
Hierarchical clustering techniques organize the data into a nested sequence of
groups. An important characteristic of hierarchical clustering methods is the visual
impact of the dendrogram, which enables a data analyst to see how objects are
being merged into clusters or split at successive levels of proximity. The data
analyst can then try to decide whether the entire dendrogram describes the data
or can select a clustering, at some fixed level of proximity, which makes sense
for the application in hand. We refer to nonhierarchical clustering methods as
partitional clustering methods. They generate a single partition of the data in an
attempt to recover natural groups present in the data. Both clustering strategies
have their appropriate domains of applications. Hierarchical clustering methods
generally require only the proximity matrix among the objects, whereas partitional
techniques expect the data in the form of a pattern matrix. It is generally assumed
that the features have been measured on a ratio scale,
Hierarchical techniques are popular in biological, social, and behavioral sci-

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy