0% found this document useful (0 votes)

11 views24 pages

How Many Clusters

The document discusses clustering as a method for identifying structure in complex data, emphasizing the challenge of determining the appropriate number of clusters. It introduces an information-theoretic approach that combines clustering and model selection, aiming to optimize the clustering process while correcting for sampling errors. The authors propose a method to find the maximum number of clusters that can be resolved from finite data sets, highlighting the importance of distinguishing meaningful structure from noise.

Uploaded by

rodrigo camargos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views24 pages

How Many Clusters

Uploaded by

rodrigo camargos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

LETTER Communicated by Joachim Buhmann

How Many Clusters? An Information-Theoretic Perspective

Susanne Still
susanna@princeton.edu
William Bialek
wbialek@princeton.edu
Department of Physics and Lewis-Sigler Institute for Integrative Genomics,
Princeton University, Princeton, NJ 08544, U.S.A.

Clustering provides a common means of identifying structure in complex

data, and there is renewed interest in clustering as a tool for the analysis of
large data sets in many fields. A natural question is how many clusters are
appropriate for the description of a given system. Traditional approaches
to this problem are based on either a framework in which clusters of a
particular shape are assumed as a model of the system or on a two-step
procedure in which a clustering criterion determines the optimal assign-
ments for a given number of clusters and a separate criterion measures
the goodness of the classification to determine the number of clusters.
In a statistical mechanics approach, clustering can be seen as a trade-off
between energy- and entropy-like terms, with lower temperature driving
the proliferation of clusters to provide a more detailed description of the
data. For finite data sets, we expect that there is a limit to the meaning-
ful structure that can be resolved and therefore a minimum temperature
beyond which we will capture sampling noise. This suggests that correct-
ing the clustering criterion for the bias that arises due to sampling errors
will allow us to find a clustering solution at a temperature that is opti-
mal in the sense that we capture maximal meaningful structure—without
having to define an external criterion for the goodness or stability of the
clustering. We show that in a general information-theoretic framework,
the finite size of a data set determines an optimal temperature, and we
introduce a method for finding the maximal number of clusters that can
be resolved from the data in the hard clustering limit.

1 Introduction

Much of our intuition about the world around us involves the idea of clus-
tering: many different acoustic waveforms correspond to the same syllable,
many different images correspond to the same object, and so on. It is plausi-
ble that a mathematically precise notion of clustering in the space of sensory
data may approximate the problems solved by our brains. Clustering meth-
ods also are used in many different scientific domains as a practical tool

Neural Computation 16, 2483–2506 (2004)

c 2004 Massachusetts Institute of Technology
2484 S. Still and W. Bialek

to evaluate structure in complex data. Interest in clustering has increased

recently because of new areas of application, such as data mining, image
and speech processing, and bioinformatics. In particular, many groups have
used clustering methods to analyze the results of genome-wide expression
experiments, hoping to discover genes with related functions as members
of the same cluster (see, e.g., Eisen, Spellman, Brown, & Botstein, 1998).
A central issue in these and other applications of clustering is how many
clusters provide an appropriate description of the data. The estimation of
the true number of classes has been recognized as “one of the most difficult
problems in cluster analysis” by Bock (1996), who gives a review of some
methods that address the issue.
The goal of clustering is to group data in a meaningful way. This is
achieved by optimization of a so-called clustering criterion (an objective
function), and a large variety of intuitively reasonable criteria have been
used in the literature (a summary is given in Gordon, 1999). Clustering
methods include agglomerative clustering procedures such as described by
Ward (1963) and iterative reallocation methods, such as the commonly used
K-means algorithm (Lloyd, 1957; MacQueen, 1967), which reduces the sum
of squares criterion, the average of the within-cluster squared distances.
More recently, algorithms with physically inspired criteria were introduced
(Blatt, Wiseman, & Domany 1996; Horn & Gottlieb, 2002). All of these clus-
tering methods have in common that the number of clusters has to be found
by another criterion. Often a two-step procedure is performed: the optimal
partition is found for a given data set, according to the defined objective
function, and then a separate criterion is applied to test the robustness of
the results against noise due to finite sample size. Such procedures include
the definition of an intuitively reasonable criterion for the goodness of the
classification, as in Tibshirani, Walther, and Hastie (2001) or performing
cross-validation (Stone, 1974) and related methods in order to estimate the
prediction error and to find the number of clusters that minimizes this error
(e.g., Smyth, 2000). Roth, Lange, Braun, and Buhmann (2004) quantify the
goodness of the clustering with a resampling approach.
It would be attractive if these two steps could be combined in a single
principle. In a sense, this is achieved in the probabilistic mixture model
approach, but at the cost of assuming that the data can be described by
a mixture of Nc multivariate distributions with some parameters that de-
termine their shape. Now the problem of finding the number of clusters
is a statistical model selection problem. There is a trade-off between the
complexity of the model and goodness of fit. One approach to model se-
lection is to compute the total probability that models with Nc clusters
can give rise to the data, and then one finds that phase-space factors as-
sociated with the integration over model parameters serve to discrimi-
nate against more complex models (Balasubramanian, 1997). This Bayesian
approach has been used to determine the number of clusters (Fraley &
Raftery, 2002).
How Many Clusters? An Information-Theoretic Perspective 2485

From an information-theoretic point of view, clustering is most funda-

mentally a strategy for lossy data compression: the data are partitioned into
groups such that the data could be described in the most efficient way (in
terms of bit cost) by appointing a representative to each group. The clus-
tering of the data is achieved by compressing the original data, throwing
away information that is not relevant to the analysis. While most classical
approaches in statistics give an explicit definition of a similarity measure,
in rate distortion theory, we arrive at a notion of similarity through a fi-
delity criterion implemented by a distortion function (Shannon, 1948). The
choice of the distortion function provides an implicit distinction between
relevant and irrelevant information in the raw data. The notion of rele-
vance was made explicit by Tishby, Pereira, and Bialek (1999), who defined
relevant information as the information that the data provide about an aux-
iliary variable and performed lossy compression, keeping as much relevant
information as possible. This formulation, termed the information bottle-
neck method (IB), is attractive because the objective function follows only
from information-theoretic principles. In particular, this formulation does
not require an explicit definition of a measure for similarity or distortion.
The trade-off between the complexity of the model, on one hand, and the
amount of relevant information it captures, on the other hand, is regulated
by a trade-off parameter. For a given problem, the complete range of this
trade-off is meaningful, and the structure of the trade-off characterizes the
“clusterability” of the data. However, for a finite data set, there should be a
maximal value for this trade-off, after which we start to “overfit,” and this
issue has not yet been addressed in the context of the IB. In related work,
Buhmann and Held (2000) derived for a particular class of histogram clus-
tering models a lower bound on the annealing temperature from a bound on
the probability of a large deviation between the error made on the training
data and the expected error.
In this article, we follow the intuition that if a model—which, in this
context, is a (probabilistic) partition of the data set—captures information
(or structure) in the data, then we should be able to quantify this structure
in a way that corrects automatically for the overfitting of finite data sets.
Attempts to capture only this “corrected information” will, by definition,
not be sensitive to noise. Put another way, if we would separate at the outset
real structure from spurious coincidences due to undersampling, then we
could fit only the real structure. In the context of information estimation
from finite samples, there is a significant literature on the problem, and
we argue here that a finite sample correction to information estimates is (in
some limits) sufficient to achieve the “one-step” compression and clustering
in the sense described above, leading us naturally to a principled method
of finding the best clustering that is consistent with a finite data set.
We should point out that in general, we are not looking for the “true”
number of clusters, but rather for the maximum number of clusters that can
be resolved from a finite data set. This number equals the true number only
2486 S. Still and W. Bialek

if there exists a true number of classes and if the data set is large enough to
allow us to resolve them.

2 Rate Distortion Theory and the Information Bottleneck Method

If data x ∈ X are chosen from a probability distribution P(x), then a complete

description of a single data point requires anaverage code length equal to
the entropy of the distribution, S(x) = − x P(x) log2 [P(x)] bits. On the
other hand, if we assign points to clusters c ∈ {1, 2, . . . , Nc }, then we need at
most log2 (Nc ) bits. For Nc |X|, we have log2 (Nc ) S(x), and our intuition
is that many problems will allow substantial compression at little cost if we
assign each x to a cluster c and approximate x by a representative xc .

2.1 Rate Distortion Theory. The cost of approximating the signal x by xc

as the expected value of some distortion function, d(x, xc ) (Shannon, 1948).
This distortion measure can, but need not be, a metric. Lossy compression is
achieved by assigning the data to clusters such that the mutual information

P(c|x)
I(c; x) = P(c|x)P(x) log2 (2.1)
xc P(c)

is minimized. The minimization is constrained by fixing the expected dis-

tortion,

d(x, xc ) = P(c|x)P(x)d(x, xc ). (2.2)
xc

This leads to the variational problem:

min [ d(x, xc ) + TI(c; x)]. (2.3)

P(c|x)

The (formal) solution is a Boltzmann distribution,1

P(c) 1
P(c|x) = exp − d(x, xc ) , (2.4)
Z(x; T) T

with the distortion playing the role of energy, and the normalization,

1
Z(x, T) = P(c) exp − d(x, xc ) , (2.5)
c T

1 T = T/ ln(2), because the information is measured in bits in equation 2.1. P(c) is

calculated as P(c) = x
P(c|x)P(x).
How Many Clusters? An Information-Theoretic Perspective 2487

playing the role of a partition function (Rose, Gurewitz, & Fox, 1990). The
representatives, xc , often simply called cluster centers, are determined by
the condition that all of the “forces” within each cluster balance for a test
point located at the cluster center,2

∂
P(x|c) d(x, xc ) = 0. (2.6)
x ∂xc

the squared distance, d(x, xc ) = (x −

Recall that if the distortion measure is
xc )2 , then equation 2.6 becomes xc = x xP(x|c). The cluster center is in fact
the center of mass of the points assigned to the cluster.
The Lagrange parameter T regulates the trade-off between the detail
we keep and the bit cost we are willing to pay. In analogy with statistical
mechanics, T often is referred to as the temperature (Rose et al., 1990). T
measures the softness of the cluster membership. The deterministic limit
(T → 0) is the limit of hard clustering solutions. As we lower T, there are
phase transitions among solutions with different numbers of distinct clus-
ters, and if we follow these transitions, we can trace out a curve of d versus
I(c; x), both evaluated at the minimum. This is the rate distortion curve and
is analogous to plotting energy versus (negative) entropy, with tempera-
ture varying parametrically along the curve. Crucially, there is no optimal
temperature that provides the unique best clustering, and thus there is no
optimal number of clusters: more clusters always provide a more detailed
description of the original data and hence allow us to achieve smaller aver-
age values of the distortion while the cost of the encoding increases.

2.2 Information Bottleneck Method. The distortion function implicitly

selects the features that are relevant for the compression. However, for many
problems, we know explicitly what it is that we want to keep information
about while compressing the data, but one cannot always construct the
distortion function that selects for these relevant features. In the information
bottleneck method (Tishby, Pereira, & Bialek, 1999), the relevant information
in the data is defined as information about another variable, v ∈ V. Both x
and v are random variables, and we assume that we know the distribution
of co-occurrences, P(x, v). We wish to compress x into clusters c, such that
the relevant information (the information about v) is maximally preserved.
This leads directly to the optimization problem,

max[I(c; v) − TI(c; x)]. (2.7)

P(c|x)

2 This condition is not independent of the original variational problem. Optimizing the

objective function with respect to xc , we find ∂x∂ c [ d(x, xc ) +TI(c; x)] = 0 ⇔ ∂x∂ c d(x, xc ) =

p(c) x
P(x|c) ∂x∂ c d(x, xc ) = 0; ∀c ⇒ equation 2.6.
2488 S. Still and W. Bialek

One obtains a solution similar to equation 2.4,

P(c) 1
P(c|x) = exp − DKL [P(v|x)P(v|c)] , (2.8)
Z(x, T) T

in which the Kullback–Leibler divergence,

P(v|x)
DKL [P(v|x)P(v|c)] = P(v|x) ln , (2.9)
v P(v|c)

emerges in the place of the distortion function, providing a notion of simi-

larity between the distributions P(v|x) and P(v|c), where P(v|c) is given by
(Tishby et al., 1999)

1
P(v|c) = P(v|x)P(c|x)P(x). (2.10)
P(c) x

When we plot I(c; v) as a function of I(c; x), both evaluated at the optimum,
we obtain a curve similar to the rate distortion curve, the slope of which is
given by the trade-off between compression and preservation of relevant
information:

δI(c; v)
= T. (2.11)
δI(c; x)

3 Finite Sample Effects

The formulation above assumes that we know the probability distribution

underlying the data, but in practice, we have access to only a finite number of
samples, and so there are errors in the estimation of the distribution. These
random errors produce a systematic error in the computation of the cost
function. The idea here is to compute the error perturbatively and subtract
it from the objective function. Optimization with respect to the assignment
rule is now by definition insensitive to noise, and we should (for the IB) find
a value for the trade-off parameter, T∗ , at which the relevant information is
kept maximally.
The compression problem expressed in equation 2.7 gives us the right
answer if we evaluate the functional 2.7 at the true distribution P(x, v). But
in practice, we do not know P(x, v); instead, we have to use an estimate
P̂(x, v) based on a finite data set. We use perturbation theory to compute the
systematic error in the cost function that results from the uncertainty in the
estimate.
We first consider the case that P(x) is known, and we have to estimate
only the distribution P(v|x). This is the case in many practical clustering
problems, where x is merely an index to the identity of samples, and hence
How Many Clusters? An Information-Theoretic Perspective 2489

P(x) is constant, and the real challenge is to estimate P(v|x). In section 5, we

discuss the error that comes from uncertainty in P(x) and also what happens
when we apply this approach to rate distortion theory.
Viewed as a functional of P(c|x), I(c; x) can have errors arising only from
uncertainty in estimating P(x). Therefore, if P(x) is known, then there is no
bias in I(c; x). We assume for simplicity that v is discrete. Let N be the total
number of observations of x and v. For a given x, the (average) number of
observations of v is then NP(x). We assume that the estimate P̂(v|x) con-
verges to the true distribution in the limit of large data set size N → ∞.
However, for finite N, the estimated distribution will differ from the true
distribution, and there is a regime in which N is large enough such that we
can approximate (compare Treves & Panzeri, 1995),

P̂(v|x) = P(v|x) + δP(v|x), (3.1)

where we assume that δP(v|x) is some small perturbation and its average
over all possible realizations of the data is zero:

δP(v|x) = 0. (3.2)

Taylor expansion of Iemp (c; v) := I(c; v)|P̂(v|x) around P(v|x) leads to a sys-
tematic error I(c; v),

I(c; v)|P̂(v|x)=P(v|x)+δP(v|x) = I(c; v)|P(v|x) + I(c; v), (3.3)

where the error is calculated as an average over realizations of the data

∞
1 δ n I(c; v)
n
(k)
I(c; v) = ... n (k)
δP(v|x ) , (3.4)
n=1
n! v x(1) x(n) k=1 δP(v|x ) k=1

with
n
δ n I(c; v) (−1)n (n − 2)! nk=1 P(c, x(k) ) (k)
k=1 P(x )
n (k)
= − (3.5)
k=1 δP(v|x ) ln 2 c (P(c, v))n−1 (P(v))n−1

is given by

1 ∞
(−1)n
I(c; v) =
ln 2 n=2 n(n − 1) v

(
x δP(v|x)P(c, x)) ( x δP(v|x)P(x))n
n
× − . (3.6)
c (P(c, v))n−1 (P(v))n−1

Note that the terms with n = 1 vanish because of equation 3.2 and that the
second term in the sum is constant with respect to P(c|x).
2490 S. Still and W. Bialek

Our idea is to subtract this error from the objective function, equation 2.7,
and recompute the distribution that maximizes the corrected objective func-
tion,

max Iemp (c; v) − TIemp (c; x) − I(c; v) + µ(x) P(c|x) . (3.7)
P(c|x) c

The last constraint ensures normalization, and the optimal assignment rule
P(c|x) is now given by

which has to be solved self-consistently together with equation 2.10 and

δP(v|c) := δP(v|x)P(x|c). (3.9)
x

The error I(c; v) is calculated in equation 3.6 as an asymptotic expan-

sion, and we are assuming that N is large enough to ensure that δP(v|x) is
small ∀(x, v). Let us thus concentrate on the term of leading order in δP(v|x),
which is given by (disregarding the term which does not depend on P(c|x))3

1 [P(c|x)]2 P(x|v)
(I(c; v))(2) x , (3.10)
2 ln(2)N vc x P(c|x)P(x|v)

where we have made use of equation 2.10 and the approximation (for count-
ing statistics)

P(v|x)
δP(v|x)δP(v|x ) δxx . (3.11)
NP(x)

3 We arrive at equation 3.10 by calculating the first leading-order term (n = 2) in the

sum of equation 3.6:

(2) ∼ 1 x x
P(c, x)P(c, x ) δP(v|x)δP(v|x )
(I(c; v)) = ,
2 ln(2) P(c, v)
vc

making use of approximation 3.11 and summing over x , which leads to

Can we say something about the shape of the resulting “corrected” op-
timal information curve by analyzing the leading-order error term, equa-
tion 3.10? This term is bounded from above by the value it assumes in the
deterministic limit (T → 0), in which assignments P(c|x) are either 1 or 0
and thus [P(c|x)]2 = P(c|x),4

1 Kv
(I(c; v))(2)
T→0 = Nc . (3.12)
2 ln(2) N

Kv is the number of bins we have used to obtain our estimate P̂(v|x). Note
that if we had adopted a continuous, rather than a discrete, treatment, then
the volume of the (finite) V-space would arise instead of Kv .5 If one does
not assume that P(x) is known, and instead calculates the bias by Taylor ex-
pansion of Iemp (c; v) around P(x, v) (see section 5), then the upper bound to
the leading order term (see equation 5.7) is given by Nc (Kv − 1)/(2N ln(2)).
This is the Nc -dependent part of the leading correction to the bias as de-
rived in Treves and Panzeri (1995, equation 2.11, term C1 ). Similar to what
these authors found when they computed higher-order corrections, we also
found in numerical experiments that the leading term is a surprisingly good
estimate of the total bias, and we therefore feel confident to approximate
the error by equation 3.10, although we cannot guarantee convergence of
the series in equation 3.6.6

4 Substitution
of [P(c|x)]
2 = P(c|x) into equation 3.10 gives (I(c; v))(2) =
P(c|x)P(x|v)
1
2 ln(2)N vc
x
= 1
2 ln(2)N vc
1= 1
K N.
2 ln(2)N v c
P(c|x)P(x|v)
x
5
For choosing the number of bins, Kv , as a function of the data set size N, we refer to
the large body of literature on this problem, for example, Hall and Hannan (1988).
6 For counting statistics (binomial distribution), we have

1
n
n! Nk
(δP(v|x))n = (P(v|x))n (−1)(n−k)
N n−1 k!(n − k)! (P(v|x))k
k=0

1 k lq
k!N! p
× (1 − p)(N−l) ,
(N − l)! lq ! q!
{l1 ···lk } q=1

k
where l = l , the lq are positive integers, and the sum {l ···l } runs over all partitions
q=1 q
k 1 k

of k, that is, values of l1 ,. . . ,lk such that ql = k. There is a growing number of

q=1 q
contributions to the sum at each order n, some of which can be larger than the smallest
terms of the expression at order n − 1. If there are enough measurements such that NP(x)
is large, the binomial distribution approaches a normal distribution with (δP(v|x))2n =
1
(2n − 1)!! Nn P(x)n (P(v|x) − P(v|x) ) , and (δP(v|x))
2 n 2n−1 = 0 (n = 1, 2, . . .). Substituting

this into equation 3.6, and considering only terms with x(1) = x(2) = · · · = x(n) , we get
∞ k
1 (2k−1)!! 1 P(x) (P(c|x))2
N P(v) (P(c|v))2 (P(v|x) − P(v|x) )
P(c, v) 2 , which is not guaranteed
ln 2 xvc k=1 2k(2k−1)
to converge.
2492 S. Still and W. Bialek

The lower bound of the leading-order error, equation 3.10, is given by7

1 1 I(c;x)
2 , (3.13)
2 ln(2) N

and hence the “corrected information curve,” which we define as

Icorr (c; v) := Iemp (c; v) − I(c; v), (3.14)

is (to leading order) bounded from above by

corr 1 1 I(c;x)
IUB (c; v) = Iemp (c; v) − 2 . (3.15)
2 ln(2) N

The slope of this upper bound is T − 2I(c;x) /2N (using equation 2.11), and
there is a maximum at

∗ 1 I(c;x)
TUB = 2 . (3.16)
2N

If the hard clustering solution assigns equal numbers of data to each cluster,
then the upper bound on the error, equation 3.12, can be rewritten as

1 Kv I(c;x)
2 . (3.17)
2 ln(2) N

and, therefore the lower bound on the information curve,

corr 1 Kv I(c;x)
ILB (c; v) = Iemp (c; v) − 2 , (3.18)
2 ln(2) N

has a maximum at

∗ Kv I(c;x)
TLB = 2 . (3.19)
2N

Since both upper and lower bound coincide at the end point of the curve,
where T → 0 (see Figure 1), the actual corrected information curve must

emp
I (c;v)
xx I corr
UB (c;v)

I corr
LB (c;v)

I(c,x)
Figure 1: Sketch of the lower and upper bound on the corrected information
curve, which both have a maximum under some conditions (see equations 3.16
and 3.19), indicated by x, compared to the empirical information curve, which
is monotonically increasing.

have a maximum at
γ I(c;x)
T∗ = 2 , (3.20)
2N

where 1 < γ < Kv .

In general, for deterministic assignments, the information we gain by
adding another cluster saturates for large Nc , and it is reasonable to as-
sume that this information grows sublinearly in the number of clusters.
That means that the lower bound on Icorr (c; v) has a maximum (or at least a
plateau). This ensures us that Icorr (c; v) must have a maximum (or plateau)
and, hence, that an optimal temperature exists.
In the context of the IB, asking for the number of clusters that are con-
sistent with the uncertainty in our estimation of P(v|x) makes sense only
for deterministic assignments. From the above discussion, we know the
leading-order error term in the deterministic limit, and we define the “cor-
rected relevant information” in the limit T → 0 as8

corr emp Kv
IT→0 (c; v) = IT→0 (c; v) − Nc , (3.21)
2 ln(2)N
emp
where IT→0 (c; v) is calculated by fixing the number of clusters and cooling
emp
the temperature to obtain a hard clustering solution. While IT→0 (c; v) in-
creases monotonically with Nc , we expect IT→0 (c; v) to have a maximum (or
corr

8 This quantity is not strictly an information anymore—thus, the quotation marks.

2494 S. Still and W. Bialek

at least a plateau) at Nc∗ , as we have argued above. Nc∗ is then the optimal
number of clusters in the sense that using more clusters, we would not cap-
ture more meaningful structure (or, in other words, would overfit the data),
and although in principle we could always use fewer clusters, this comes
at the cost of keeping less relevant information I(c; v).

4 Numerical Results

4.1 Simple Synthetic Test Data. We test our method for finding Nc∗ on
data that we understand well and where we know what the answer should
be. We thus created synthetic data drawn from normal distributions with
zero mean and five different variances (for Figures 2 and 3).9 We emphasize
that we chose an example with gaussian distributions not because any of
our analysis makes use of gaussian assumptions, but rather because in the
gaussian case, we have a clear intuition about the similarity of different
distributions and hence about the difficulty of the clustering task. This will
become important later, when we make the discrimination task harder (see
Figure 6). We use Kv = 100 bins to estimate P̂(v|x). In Figures 2 and 3, we
emp
compare how IT→0 (c; v) and IT→0 corr
(c; v) behave as a function of the number
of clusters. The number of observations of v, given x, is Nv = N/Nx , and
the average number of observations per bin is given by Nv /Kv . Figure 3
shows the average IT→0 corr
(c; v), computed from 31 different realizations of
10
the data. All of the 31 individual curves have a maximum at the true
number of clusters, Nc∗ = 5, for Nv /Kv ≥ 2. They are offset with respect to
each other, which is the source of the error bars. When we have too few data
(Nv /Kv = 1), then we can resolve only four clusters (65% of the individual
curves peak at Nc∗ = 4, the others at Nc∗ = 5). As Nv /Kv becomes very large,
emp
corr
IT→0 (c; v) approaches IT→0 (c; v), as expected.
The curves in Figures 2 and 3 differ in the average number of exam-
ples per bin, Nv /Kv . The classification problem becomes harder as we see
fewer data. However, it also becomes harder when the true distributions are
closer. To separate the two effects, we create synthetic data drawn from gaus-
sian distributions with unit variance and NA different, equidistant means
α, which are dα apart.11 NA is the true number of clusters. This problem
becomes intrinsically harder as dα becomes smaller. Examples are shown
in Figures 4 and 5. The problem becomes easier as we are allowed to look
at more data, which corresponds to an increase in Nv /Kv . We are interested

9 P(x) = 1/N and P(v|x) = N (0, α(x)) where α(x) ∈ A, and |A| = 5, with P(α) = 1/5;
x
and Nx = 50. Nx is the number of “objects” we are clustering.
10 Each time we compute Iemp (c; v), we start at 100 different, randomly chosen initial
T→0
conditions to increase the probability of finding the global maximum of the objective
functional.
11 P(v|x) = N (α(x), 1), α(x) ∈ A, N := |A|.
A
How Many Clusters? An Information-Theoretic Perspective 2495

0.3
0.28
I T→ 0 ( c ; v )

0.26
0.24
0.22
emp

0.2
0.18
0.16
2 3 4 5 6 7 8 9 10
Nc
Figure 2: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five pos-
sible values for α. Displayed is the relevant information kept in the compression,
emp
computed from the empirical distribution, IT→0 (c; v), which increases monotoni-
cally as a function of the number of clusters. Each curve is computed as the mean
of 31 different curves, obtained by virtue of creating different realizations of the
data. The error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds),
2 (squares), 3 (triangles), 5 (stars), and 50 (X’s). Nx = 50 and Kv = 100 for all
curves. The line indicates the value of the information I(x; v), estimated from
106 data points.

in the regime in the space spanned by Nv /Kv and dα in which our method
retrieves the correct number of clusters.
In Figure 6, points mark those values of dα and Nv /Kv (evaluated on the
shown grid) at which we find the true number of clusters. The different
shapes of the points summarize results for 2, 5, and 10 clusters. A missing
point on the grid indicates a value of dα and Nv /Kv at which we did not find
the correct number of clusters. All of these missing points lie in a regime char-
acterized by a strong overlap of the true distributions combined with scarce
data. In that regime, our method always tells us that we can resolve fewer
clusters than the true number of clusters. For small sample sizes, the correct
number of clusters is resolved only if the clusters are well separated, but
as we accumulate more data, we can recover the correct number of classes
for more and more overlapping clusters. To illustrate the performance of
2496 S. Still and W. Bialek

0.23
0.22
0.21
I T→ 0 ( c ; v )

0.2
0.19
0.18
0.17
corr

0.16
0.15
0.14
0.13
2 3 4 5 6 7 8 9 10
Nc
Figure 3: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five pos-
sible values for α. Displayed is the “corrected relevant information” in the hard
corr
clustering limit, IT→0 (c; v) (see equation 3.21), as a function of the number of
clusters. Each curve is computed as the mean of 31 different curves, obtained
by virtue of creating different realizations of the data. For Nv /Kv ≥ 2, all indi-
vidual 31 curves peak at Nc∗ = 5, but are offset with respect to each other. The
error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds), 2 (squares),
3 (triangles), 5 (stars), and 50 (X’s). For Nv /Kv = 1, 20 of the 31 curves peak at
Nc∗ = 4, the other 11 at Nc∗ = 5. The line indicates the value of the information
I(x; v), estimated from 106 data points.

the method, we show in Figure 5 the distribution P(x, v) in which α(x) has
five different values that occur with equal probability, P(α(x)) = 1/5 and
differ by dα = 0.2. For this separation, our method still retrieves five as the
optimal number of clusters when we have at least Nv = 2000.12
Our method detects when only one cluster is present, a case in which
many methods fail (Gordon, 1999). We verified this for data drawn from one
gaussian distribution and for data drawn from the uniform distribution.

12 We used Kv = 100 bins to estimate P(v|x). P(x) = 1/Nx with Nx = 20.

How Many Clusters? An Information-Theoretic Perspective 2497

Figure 4: A trivial example of those data sets on which we found the cor-
rect number of clusters (results are summarized in Figure 6). Here, P(v|x) =
N (α(x), 1) with five different values for α, spaced dα = 2 apart. Kv = 100,
Nx = 20, Nv /Kv = 20.

4.2 Synthetic Test Data That Explicitly Violate Mixture Model As-
sumptions. We consider data drawn from a radial normal distribution, ac-
cording to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π ,
as shown in Figure 7. The empirical information curves (see Figure 8) and
corrected information curves (see Figure 9) are computed as the mean of
seven different realizations of the data for different sample sizes.13 The cor-
rected curves peak at Nc∗ , which is shown as a function of N in Figure 10.
For fewer than a few thousand samples, the optimal number of clusters
goes roughly as Nc∗ ∝ N2/3 , but there is a saturation around Nc∗ ≈ 25. This
number corresponds to half of the number of x-bins (and therefore half of
the number of “objects” we are trying to cluster), which makes sense given
the symmetry of the problem.

emp
13 Each time we compute IT→0 (c; v), we start at 20 different, randomly chosen initial
conditions to increase the probability of finding the global maximum of the objective
functional. Increasing the number of initial conditions would decrease the error bars at
the cost of computational time.
2498 S. Still and W. Bialek

Figure 5: One of the difficult examples of those data sets on which we found the
correct number of clusters (results are summarized in Figure 6). Here, P(v|x) =
N (α(x), 1) with five different values for α, spaced dα = 0.2 apart. Kv = 100,
Nx = 20, Nv /Kv = 20.

5 Uncertainty in P(x)

In the most general case, x can be a continuous variable drawn from an

unknown distribution P(x). We then have to estimate the full distribution
P(x, v), and if we want to follow the same treatment as above, we have to
assume that our estimate approximates the true distribution

P̂(v, x) = P(v, x) + δP(v, x), (5.1)

where δP(v, x) is some small perturbation and its average over all possible
realizations of the data is zero:

δP(v, x) = 0. (5.2)
How Many Clusters? An Information-Theoretic Perspective 2499

2
1
dα

0.5

0.2
95%
0.1
1 2 5 10 20
Nv / K v
Figure 6: Result of finding the correct number of clusters with our method for
a synthetic data set of size N = Nx Nv , (Nx = 20) with P(v|x) = N (α(x), 1) and
with either 2, 5, or 10 possible values for α, spaced dα apart. We indicate values
of dα and the “resolution” Nv /Kv (Kv = 100) at which the correct number of
clusters is found: for 2, 5, and 10 clusters (squares); for only 2 and 5 clusters
(stars); for only 2 clusters (circles). The classification error (on the training data)
is 0 for all points except for the one that is labeled with 95% correct.

Now, this estimate induces an error not only in Iemp (c; v) but also in
Iemp (c; x). Taylor expansion of these two terms gives

1 ∞
(−1)n 1 1
I(c; v) = −
ln(2) vc n=2 n(n − 1) (P(v, c))n−1 (P(c))n−1
n

× P(c|x)δP(x, v) − (P(v)) (5.3)
x
n
1 ∞
(−1)n 1
(P(v)) = δP(x, v) (5.4)
ln(2) v n=2 n(n − 1) (P(v))n−1 x

1 ∞
(−1)n 1
I(c; x) = −
ln(2) vc n=2 n(n − 1) (P(c))n−1
n

× P(c|x)δP(x, v) . (5.5)
x
2500 S. Still and W. Bialek

Figure 7: Twenty thousand data points drawn from a radial distribution, ac-
cording to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π.
Displayed is the estimated probability distribution (normalized histogram with
50 bins along each axis).

This results in a correction to the objective function (Fcorrected = Femp − F),

given by:

1 ∞
(−1)n 1 1
F = + T − 1
ln(2) vc n=2 n(n − 1) (P(c))n−1 (P(v|c))n−1
n

× P(c|x)δP(x, v) − (P(v)), (5.6)
x

where (P(v)) is constant in P(c|x) and therefore not important. At critical

temperature T = 1, the error due to uncertainty in P(x) made in calculating
Iemp (c; v) cancels that made in computing Iemp (c; x). For small T, the largest
contribution to the error is given by the first term in the sum of equation
5.6, since 1/(P(v|c))n ≥ 1, ∀{n, v, c}. Therefore, the procedure that we have
suggested for finding the optimal number of clusters in the deterministic
limit (T → 0) remains unchanged, even if P(x) is unknown. To illustrate this
How Many Clusters? An Information-Theoretic Perspective 2501

1.8
I T→ 0 ( c ; v )
1.6
1.4
1.2
emp

1
0.8
0.6
0.4

10 20 30 40
Nc
emp
Figure 8: IT→0 (c; v) as a function of the number of clusters, averaged over seven
different realizations of the data. Error bars are ± 1 standard deviation. The
information I(x; v), calculated from 100,000 data points, equals 0.58 bits (line).
Data set size N equals 100 (diamonds), 300 (squares), 1000 (triangles), 3000
(stars), and 100,000 (crosses).

point, let us consider, as before, the leading-order term of the error (using
the approximation in equation 3.11),
1
1 1
(F)(2) +T−1 (P(c|x))2 P(x, v). (5.7)
2N ln(2) cv p(c) p(v|c) x

In the T → 0 limit, this term becomes Nc (Kv − 1)/2N ln(2), and we find

corr emp Kv − 1
IT→0 (c; v) = IT→0 (c; v) − Nc , (5.8)
2 ln(2)N
which is insignificantly different from equation 3.21 in the regime Kv 1.
Only for very large temperatures T 1 (i.e., at the onset of the anneal-
ing process) could the error that results from uncertainty in P(x) make a
significant difference.
The corrected objective function is now given by

Fcorr = Iemp (c; v) − TIemp (c; x) − F + µ(x) P(c|x), (5.9)
c
2502 S. Still and W. Bialek

0.6
0.5
I T→ 0 ( c ; v )

0.4
0.3
0.2
corr

0.1
0
-0.1

10 20 30 40
Nc
corr
Figure 9: IT→0 (c; v) as a function of the number of clusters, averaged over seven
different realizations of the data. Error bars are ± 1 standard deviation. The
information I(x; v) calculated from 100,000 data points equals 0.58 bits (line).
Data set size N equals: 100 (diamonds), 300 (squares), 1000 (triangles), 3000
(stars), and 100,000 (crosses).

and the optimal assignment rule is given by

which has to be solved self-consistently together with equation 2.10 and

δP(v, c) := δP(v, x)P(c|x). (5.11)
x
How Many Clusters? An Information-Theoretic Perspective 2503

100

N c* 10

1
102 103 104 105
N
Figure 10: Optimal number of clusters, Nc∗ , as found by the suggested method,
as a function of the data set size N. The middle curve (crosses) represents the
average over seven different realizations of the data, points on the upper/lower
curve are maximum/minimum values, respectively. Line at 25.

5.1 Rate Distortion Theory. Let us assume that we estimate the distri-
bution P(x) by P̂(x) = P(x) + δP(x), with δP(x) = 0, as before. While there
is no systematic error in the computation of d , this uncertainty in P(x) does
produce a systematic underestimation of the information cost I(c; x):

1 ∞
(−1)n ( x P(c|x)δP(x))n
I(c; x) = − . (5.12)
ln(2) n=2 n(n − 1) c (P(c))n−1

When we correct the cost functional for this error (with λ = 1/T),

Fcorr := I(c; x) + λ d(x, xc ) − I(c; x) + µ(x) P(c|x), (5.13)
c

we obtain for the optimal assignment rule (with λ = λ ln(2)),

P(c) ∞
( x P(c|x)δP(x))n
P(c|x) = exp −λ d(x, xc )+ (−1)n
Z(x, λ) c n=2 n(P(c))n

1 δP(x)( x P(c|x)δP(x))n−1
− . (5.14)
P(x) (n − 1)(P(c))n−1
2504 S. Still and W. Bialek

Let us consider the leading-order term of the error made in calculating the
information cost,

(2) 1 x (P(c|x)δP(x))
2
(I(c; x)) =− . (5.15)
2 ln(2) c P(c)

For counting statistics, we can approximate, as before,

x (P(c|x))
1 2 P(x)
(I(c; x))(2) − . (5.16)
2 ln(2)N c P(c)

The information cost is therefore underestimated by at least 2I(c;x) /(2 ln(2)N)

bits.14 The corrected rate distortion curve with

Icorr (c; x) := I(c; x) − I(c; x) (5.17)

is then bounded from below by

corr 1
ILB (c; x) = I(c; x) + 2I(c;x) , (5.18)
2 ln(2)N

and this bound has a rescaled slope given by

1 I(c;x)
λ̃ = λ 1− 2 , (5.19)
2N

but no extremum. Since there is no optimal trade-off, it is not possible to

use the same arguments as we have used before to determine an optimal
number of clusters in the hard clustering limit. To do this, we would have
to carry the results obtained from the treatment of the finite sample size
effects in the IB over to rate distortion theory, using insights we have gained
in Still, Bialek, and Bottou (2004) about how to use the IB for data that are
given with some measure of distance (or distortion).

6 Summary

Clustering, as a form of lossy data compression, is a trade-off between the

quality and complexity of representations. In general, a data set (or cluster-
ing problem) is characterized by the whole structure of this trade-off—the
rate distortion curve or the information curve in the IB method—which
quantifies our intuition that some data are more clusterable than others. In

P(c|x)
log2 [ P(c) ]
14 Using xc
P(x, c) P(c|x)
P(c) = xc
P(x, c)2 ≥ 2I(c;x) .
How Many Clusters? An Information-Theoretic Perspective 2505

this sense, there is never a single “best” clustering of the data, just a family
of solutions evolving as a function of temperature.
As we solve the clustering problem at lower temperatures, we find so-
lutions that reveal more and more detailed structure and hence have more
distinct clusters. If we have only finite data sets, however, we expect that
there is an end to the meaningful structure that can be resolved—at some
point, separating clusters into smaller groups just corresponds to fitting the
sampling noise. The traditional approach to this issue is to solve the cluster-
ing problem in full and then to test for significance or validity of the results
by some additional statistical criteria. What we have presented in this work
is, we believe, a new approach. Because clustering is formulated as an op-
timization problem, we can try to take account of the sampling errors and
biases directly in the objective functional. In particular, for the IB method,
all terms in the objective functional are mutual information, and there is
a large literature on the systematic biases in information estimation. There
is a perturbative regime in which these biases have a universal form and
can be corrected. Applying these corrections, we find that at fixed sample
size, the trade-off between complexity and quality really does have an end
point beyond which lowering the temperature or increasing the number of
clusters does not resolve more relevant information. We have seen numeri-
cally that in model problems, this strategy is sufficient to set the maximum
number of resolvable clusters at the correct value.

Acknowledgments

We thank N. Tishby for helpful comments on an earlier draft. S. S. thanks

M. Berciu and L. Bottou for helpful discussions and acknowledges support
from the German Research Foundation (DFG), grant Sti197.

References

Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical

mechanics on the space of probability distributions. Neural Computation, 9,
349–368.
Blatt, M., Wiseman S., & Domany, E. (1996). Superparamagnetic clustering of
data. Phys. Rev. Lett., 76, 3251–3254.
Bock, H.-H. (1996). Probability models and hypotheses testing in partitioning
cluster analysis. In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering
and classification (pp. 378–453). Singapore: World Scientific.
Buhmann, J. M., & Held, M. (2000). Model selection in clustering by uniform
convergence bounds. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances
in neural information processing systems, 12. Cambridge, MA: MIT Press.
Eisen, M., Spellman, P. T., Brown P. O., & Botstein, D. (1998). Cluster analysis
and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. (USA),
95, 14863.
2506 S. Still and W. Bialek

Fraley, C., & Raftery, A. (2002). Model-based clustering, discriminant analysis,

and density estimation. J. Am. Stat. Assoc., 97, 611.
Gordon, A. D. (1999). Classification. London: Chapman and Hall/CRC Press.
Hall, P., & Hannan, E. J. (1988). On stochastic complexity and nonparametric
density estimation. Biometrika, 75, 705–714.
Horn D., & Gottlieb, A. (2002). Algorithm for data clustering in pattern recog-
nition problems based on quantum mechanics. Phys. Rev. Lett., 88, 018702.
Lloyd, S. (1957). Least squares quantization in PCM (Tech. Rep.). Murray Hill, NJ:
Bell Laboratories.
MacQueen, J. (1967). Some methods for classification and analysis of multivari-
ate observations. In L. M. L. Cam & J. Neyman (Eds.), Proc. 5th Berkeley Symp.
Math. Statistics and Probability (Vol. 1, pp. 281–297). Berkeley: University of
California Press.
Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase
transitions in clustering. Phys. Rev. Lett., 65, 945.
Roth, V., Lange, T., Braun M. L., & Buhmann, J. M. (2004). Stability-based vali-
dation of clustering solutions. Neural Computation, 16, 1299–1323.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System
Tech. J., 27, 379–423, 623–656.
Smyth, P. (2000). Model selection for probabilistic clustering using cross-
validated likelihood. Statistics and Computing, 10, 63.
Still, S., Bialek W., & Bottou, L. (2004). Geometric clustering using the informa-
tion bottleneck method. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances
in neural information processing systems, 16. Cambridge, MA: MIT Press.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predic-
tions. J. R. Stat. Soc., 36, 111.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters
in a dataset via the Gap statistic. J. R. Stat. Soc. B, 63, 411.
Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method.
In B. Hajek & R. S. Sreenivas (Eds.), Proc. 37th Annual Allerton Conf. Urbana:
University of Illinois.
Treves, A., & Panzeri, S. (1995). The upward bias in measures of information
derived from limited data samples. Neural Computation, 7, 399.
Ward, J. H. (1963). Hierarchical groupings to optimize an objective function. J.
Am. Stat. Assoc., 58, 236.

Received June 17, 2003; accepted May 28, 2004.

Clustering Data by Reordering Them
No ratings yet
Clustering Data by Reordering Them
60 pages
The Impact of Random Models On Clustering Similarity: Alexander J. Gates Yong-Yeol Ahn
No ratings yet
The Impact of Random Models On Clustering Similarity: Alexander J. Gates Yong-Yeol Ahn
28 pages
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 846 858
No ratings yet
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 846 858
13 pages
Objective Criteria For The Evaluation of Clustering Methods RAND - JASA - 1971
No ratings yet
Objective Criteria For The Evaluation of Clustering Methods RAND - JASA - 1971
6 pages
Data Mining, Vipin Kumar, Pang-Ning Tan, Michael Steinback, Anuj Karpatne - Introduction To Data Mining-Pearson
No ratings yet
Data Mining, Vipin Kumar, Pang-Ning Tan, Michael Steinback, Anuj Karpatne - Introduction To Data Mining-Pearson
81 pages
2361 Geometric Clustering Using The Information Bottleneck Method
No ratings yet
2361 Geometric Clustering Using The Information Bottleneck Method
8 pages
Salem2021 Article ARoughSetBasedAlgorithmForUpda
No ratings yet
Salem2021 Article ARoughSetBasedAlgorithmForUpda
34 pages
Project Report
67% (15)
Project Report
40 pages
Aced
No ratings yet
Aced
17 pages
Algorithms 10 00105
No ratings yet
Algorithms 10 00105
14 pages
Mathematics 05 00005 PDF
No ratings yet
Mathematics 05 00005 PDF
17 pages
Understanding Information Theoretic Measures For Comparing Clusterings
No ratings yet
Understanding Information Theoretic Measures For Comparing Clusterings
18 pages
FCM The Fuzzy C Means Clustering Algorithm
No ratings yet
FCM The Fuzzy C Means Clustering Algorithm
14 pages
Unit 6
No ratings yet
Unit 6
30 pages
Optimal Feature Selection From VMware ESXi 5.1 Feature Set
No ratings yet
Optimal Feature Selection From VMware ESXi 5.1 Feature Set
8 pages
Z - A Three-Way Clustering Method Based On Ensemble Strategy and Three-Way Decision
No ratings yet
Z - A Three-Way Clustering Method Based On Ensemble Strategy and Three-Way Decision
13 pages
Entropy 20 00788
No ratings yet
Entropy 20 00788
13 pages
Surveyofclusteringmethods
No ratings yet
Surveyofclusteringmethods
29 pages
Toward Theoretical Foundations
No ratings yet
Toward Theoretical Foundations
6 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
No ratings yet
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
12 pages
FCM - The Fuzzy C-Means Clustering Algorithm
No ratings yet
FCM - The Fuzzy C-Means Clustering Algorithm
13 pages
CLuster Time Series
No ratings yet
CLuster Time Series
8 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
Entropy: A Clustering Method Based On The Maximum Entropy Principle
No ratings yet
Entropy: A Clustering Method Based On The Maximum Entropy Principle
30 pages
Unsupervised K-Means Clustering Algorithm
No ratings yet
Unsupervised K-Means Clustering Algorithm
17 pages
Journal of Statistical Software: Nbclust: An R Package For Determining The Relevant Number of Clusters in A Data Set
No ratings yet
Journal of Statistical Software: Nbclust: An R Package For Determining The Relevant Number of Clusters in A Data Set
36 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
MCS: A Method For Finding The Number of Clusters
No ratings yet
MCS: A Method For Finding The Number of Clusters
26 pages
Efficient Clustering Algorithm For Large Database
No ratings yet
Efficient Clustering Algorithm For Large Database
25 pages
Comparing Clusterings - An Overview
No ratings yet
Comparing Clusterings - An Overview
19 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
The K-Means Clustering Technique General Considera
No ratings yet
The K-Means Clustering Technique General Considera
11 pages
Unified View Cluster Binary Data
No ratings yet
Unified View Cluster Binary Data
17 pages
Casey Kevin MSThesis
No ratings yet
Casey Kevin MSThesis
51 pages
C100 Service Training Manual:: All Wheel Drive (AWD)
No ratings yet
C100 Service Training Manual:: All Wheel Drive (AWD)
18 pages
Recursive Hierarchical Clustering Algorithm
No ratings yet
Recursive Hierarchical Clustering Algorithm
7 pages
Relative Motion of Projectiles
50% (2)
Relative Motion of Projectiles
13 pages
Online Banking Transaction
No ratings yet
Online Banking Transaction
57 pages
Entropy-Based Algorithm For Discretization: April 2011
No ratings yet
Entropy-Based Algorithm For Discretization: April 2011
9 pages
V61i06 PDF
No ratings yet
V61i06 PDF
36 pages
Knee Point Detection
No ratings yet
Knee Point Detection
8 pages
20-463 Internal and External Validity PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
No ratings yet
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
35 pages
Ref 2 Hierarchical
No ratings yet
Ref 2 Hierarchical
7 pages
Camintac Essay - Nubbh Kejriwal
No ratings yet
Camintac Essay - Nubbh Kejriwal
4 pages
Automatic Clustering With Single Optimal Solution
No ratings yet
Automatic Clustering With Single Optimal Solution
13 pages
Integrated Circuits - K. R. Botkar
No ratings yet
Integrated Circuits - K. R. Botkar
67 pages
Lecture Notes Solid State Physics 1
No ratings yet
Lecture Notes Solid State Physics 1
28 pages
Rough Sets: An Approach To Vagueness: Rough Sets As A Tool For Reasoning About Vague Concepts
No ratings yet
Rough Sets: An Approach To Vagueness: Rough Sets As A Tool For Reasoning About Vague Concepts
7 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
38 pages
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
70% (10)
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
8 pages
Classification of Air Masses and Fronts - Geography Optional - UPSC - Digitally Learn
No ratings yet
Classification of Air Masses and Fronts - Geography Optional - UPSC - Digitally Learn
14 pages
Feature Selection Based On Fuzzy Entropy
No ratings yet
Feature Selection Based On Fuzzy Entropy
5 pages
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
No ratings yet
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
10 pages
Data Clustering: 50 Years Beyond K-Means
No ratings yet
Data Clustering: 50 Years Beyond K-Means
35 pages
CPM and Pert
No ratings yet
CPM and Pert
40 pages
Introduction
No ratings yet
Introduction
24 pages
Amartya Paul With Degrees and Certificates
No ratings yet
Amartya Paul With Degrees and Certificates
30 pages
Visual Basic 6.0 Documentation
No ratings yet
Visual Basic 6.0 Documentation
33 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Literature Review
No ratings yet
Literature Review
17 pages
How To Tune Your TV
No ratings yet
How To Tune Your TV
5 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Sol 5
No ratings yet
Sol 5
7 pages
Chapter 4 Bending Part 1
No ratings yet
Chapter 4 Bending Part 1
35 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Ts1 ts2
No ratings yet
Ts1 ts2
61 pages
2002 Hakidi Cluster Validity Methods Part II
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
Sales Budgeting and Forecasting
0% (1)
Sales Budgeting and Forecasting
16 pages
Scheme of Examination
No ratings yet
Scheme of Examination
42 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
5 pages
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
No ratings yet
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
6 pages
A Hybrid Approach To Speed-Up The NG20 Data Set Clustering Using K-Means Clustering Algorithm
No ratings yet
A Hybrid Approach To Speed-Up The NG20 Data Set Clustering Using K-Means Clustering Algorithm
8 pages
Arholwr Yn Unig: Examiner Only
No ratings yet
Arholwr Yn Unig: Examiner Only
4 pages
Piling
No ratings yet
Piling
20 pages
Thoughts On Clustering: Draft December 9, 2009
No ratings yet
Thoughts On Clustering: Draft December 9, 2009
9 pages
Gates Belt Guide
No ratings yet
Gates Belt Guide
12 pages
0 1 App Log
No ratings yet
0 1 App Log
13 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Optical Computers Technical Seminar Report Vtu Ece
100% (1)
Optical Computers Technical Seminar Report Vtu Ece
33 pages
ONGC Uran
No ratings yet
ONGC Uran
10 pages
BODMAS 1new
No ratings yet
BODMAS 1new
2 pages
Le Club Francais Case
No ratings yet
Le Club Francais Case
8 pages
a094mMPMC Multiple Choice Questions
No ratings yet
a094mMPMC Multiple Choice Questions
7 pages
How To Install Ubuntu Linux From USB Drive
No ratings yet
How To Install Ubuntu Linux From USB Drive
2 pages
Windows 7 Hyper Terminal
No ratings yet
Windows 7 Hyper Terminal
4 pages
Substantive Theory and Constructive Measures: A Collection of Chapters and Measurement Commentary on Causal Science
From Everand
Substantive Theory and Constructive Measures: A Collection of Chapters and Measurement Commentary on Causal Science
Mark Everett Stone
No ratings yet
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

How Many Clusters

Uploaded by

How Many Clusters

Uploaded by

LETTER Communicated by Joachim Buhmann

How Many Clusters? An Information-Theoretic Perspective

Clustering provides a common means of identifying structure in complex

Neural Computation 16, 2483–2506 (2004)

to evaluate structure in complex data. Interest in clustering has increased

From an information-theoretic point of view, clustering is most funda-

2 Rate Distortion Theory and the Information Bottleneck Method

If data x ∈ X are chosen from a probability distribution P(x), then a complete

2.1 Rate Distortion Theory. The cost of approximating the signal x by xc

is minimized. The minimization is constrained by fixing the expected dis-

This leads to the variational problem:

min [ d(x, xc ) + TI(c; x)]. (2.3)

The (formal) solution is a Boltzmann distribution,1

1 T = T/ ln(2), because the information is measured in bits in equation 2.1. P(c) is

the squared distance, d(x, xc ) = (x −

2.2 Information Bottleneck Method. The distortion function implicitly

max[I(c; v) − TI(c; x)]. (2.7)

One obtains a solution similar to equation 2.4,

in which the Kullback–Leibler divergence,

emerges in the place of the distortion function, providing a notion of simi-

3 Finite Sample Effects

The formulation above assumes that we know the probability distribution

P(x) is constant, and the real challenge is to estimate P(v|x). In section 5, we

P̂(v|x) = P(v|x) + δP(v|x), (3.1)

I(c; v)|P̂(v|x)=P(v|x)+δP(v|x) = I(c; v)|P(v|x) + I(c; v), (3.3)

where the error is calculated as an average over realizations of the data

which has to be solved self-consistently together with equation 2.10 and

The error I(c; v) is calculated in equation 3.6 as an asymptotic expan-

3 We arrive at equation 3.10 by calculating the first leading-order term (n = 2) in the

sum of equation 3.6:

making use of approximation 3.11 and summing over x , which leads to

of k, that is, values of l1 ,. . . ,lk such that ql = k. There is a growing number of

and hence the “corrected information curve,” which we define as

Icorr (c; v) := Iemp (c; v) − I(c; v), (3.14)

is (to leading order) bounded from above by

and, therefore the lower bound on the information curve,

where 1 < γ < Kv .

8 This quantity is not strictly an information anymore—thus, the quotation marks.

12 We used Kv = 100 bins to estimate P(v|x). P(x) = 1/Nx with Nx = 20.

In the most general case, x can be a continuous variable drawn from an

P̂(v, x) = P(v, x) + δP(v, x), (5.1)

This results in a correction to the objective function (Fcorrected = Femp − F),

where (P(v)) is constant in P(c|x) and therefore not important. At critical

and the optimal assignment rule is given by

which has to be solved self-consistently together with equation 2.10 and

we obtain for the optimal assignment rule (with λ = λ ln(2)),

For counting statistics, we can approximate, as before,

The information cost is therefore underestimated by at least 2I(c;x) /(2 ln(2)N)

Icorr (c; x) := I(c; x) − I(c; x) (5.17)

is then bounded from below by

and this bound has a rescaled slope given by

but no extremum. Since there is no optimal trade-off, it is not possible to

Clustering, as a form of lossy data compression, is a trade-off between the

We thank N. Tishby for helpful comments on an earlier draft. S. S. thanks

Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical

Fraley, C., & Raftery, A. (2002). Model-based clustering, discriminant analysis,

Received June 17, 2003; accepted May 28, 2004.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

I(c; v)|P̂(v|x)=P(v|x)+δP(v|x) = I(c; v)|P(v|x) + I(c; v), (3.3)

The error I(c; v) is calculated in equation 3.6 as an asymptotic expan-

Icorr (c; v) := Iemp (c; v) − I(c; v), (3.14)

This results in a correction to the objective function (Fcorrected = Femp − F),

where (P(v)) is constant in P(c|x) and therefore not important. At critical

Icorr (c; x) := I(c; x) − I(c; x) (5.17)