How Many Clusters
How Many Clusters
Susanne Still
susanna@princeton.edu
William Bialek
wbialek@princeton.edu
Department of Physics and Lewis-Sigler Institute for Integrative Genomics,
Princeton University, Princeton, NJ 08544, U.S.A.
1 Introduction
Much of our intuition about the world around us involves the idea of clus-
tering: many different acoustic waveforms correspond to the same syllable,
many different images correspond to the same object, and so on. It is plausi-
ble that a mathematically precise notion of clustering in the space of sensory
data may approximate the problems solved by our brains. Clustering meth-
ods also are used in many different scientific domains as a practical tool
if there exists a true number of classes and if the data set is large enough to
allow us to resolve them.
P(c|x)
I(c; x) = P(c|x)P(x) log2 (2.1)
xc P(c)
with the distortion playing the role of energy, and the normalization,
1
Z(x, T) = P(c) exp − d(x, xc ) , (2.5)
c T
playing the role of a partition function (Rose, Gurewitz, & Fox, 1990). The
representatives, xc , often simply called cluster centers, are determined by
the condition that all of the “forces” within each cluster balance for a test
point located at the cluster center,2
∂
P(x|c) d(x, xc ) = 0. (2.6)
x ∂xc
2 This condition is not independent of the original variational problem. Optimizing the
objective function with respect to xc , we find ∂x∂ c [ d(x, xc ) +TI(c; x)] = 0 ⇔ ∂x∂ c d(x, xc ) =
p(c) x
P(x|c) ∂x∂ c d(x, xc ) = 0; ∀c ⇒ equation 2.6.
2488 S. Still and W. Bialek
P(v|x)
DKL [P(v|x)P(v|c)] = P(v|x) ln , (2.9)
v P(v|c)
1
P(v|c) = P(v|x)P(c|x)P(x). (2.10)
P(c) x
When we plot I(c; v) as a function of I(c; x), both evaluated at the optimum,
we obtain a curve similar to the rate distortion curve, the slope of which is
given by the trade-off between compression and preservation of relevant
information:
δI(c; v)
= T. (2.11)
δI(c; x)
where we assume that δP(v|x) is some small perturbation and its average
over all possible realizations of the data is zero:
δP(v|x) = 0. (3.2)
Taylor expansion of Iemp (c; v) := I(c; v)|P̂(v|x) around P(v|x) leads to a sys-
tematic error I(c; v),
with
n
δ n I(c; v) (−1)n (n − 2)! nk=1 P(c, x(k) ) (k)
k=1 P(x )
n (k)
= − (3.5)
k=1 δP(v|x ) ln 2 c (P(c, v))n−1 (P(v))n−1
is given by
1 ∞
(−1)n
I(c; v) =
ln 2 n=2 n(n − 1) v
(
x δP(v|x)P(c, x)) ( x δP(v|x)P(x))n
n
× − . (3.6)
c (P(c, v))n−1 (P(v))n−1
Note that the terms with n = 1 vanish because of equation 3.2 and that the
second term in the sum is constant with respect to P(c|x).
2490 S. Still and W. Bialek
Our idea is to subtract this error from the objective function, equation 2.7,
and recompute the distribution that maximizes the corrected objective func-
tion,
max Iemp (c; v) − TIemp (c; x) − I(c; v) + µ(x) P(c|x) . (3.7)
P(c|x) c
The last constraint ensures normalization, and the optimal assignment rule
P(c|x) is now given by
P(c) 1 ∞
P(c|x) = exp − DKL [P(v|x)P(v|c)] − (−1)n
Z(x, T) T v n=2
(δP(v|c))n δP(v|x)(δP(v|c))n−1
× P(v|x) − , (3.8)
n(P(v|c))n (n − 1)(P(v|c))n−1
where we have made use of equation 2.10 and the approximation (for count-
ing statistics)
P(v|x)
δP(v|x)δP(v|x ) δxx . (3.11)
NP(x)
Can we say something about the shape of the resulting “corrected” op-
timal information curve by analyzing the leading-order error term, equa-
tion 3.10? This term is bounded from above by the value it assumes in the
deterministic limit (T → 0), in which assignments P(c|x) are either 1 or 0
and thus [P(c|x)]2 = P(c|x),4
1 Kv
(I(c; v))(2)
T→0 = Nc . (3.12)
2 ln(2) N
Kv is the number of bins we have used to obtain our estimate P̂(v|x). Note
that if we had adopted a continuous, rather than a discrete, treatment, then
the volume of the (finite) V-space would arise instead of Kv .5 If one does
not assume that P(x) is known, and instead calculates the bias by Taylor ex-
pansion of Iemp (c; v) around P(x, v) (see section 5), then the upper bound to
the leading order term (see equation 5.7) is given by Nc (Kv − 1)/(2N ln(2)).
This is the Nc -dependent part of the leading correction to the bias as de-
rived in Treves and Panzeri (1995, equation 2.11, term C1 ). Similar to what
these authors found when they computed higher-order corrections, we also
found in numerical experiments that the leading term is a surprisingly good
estimate of the total bias, and we therefore feel confident to approximate
the error by equation 3.10, although we cannot guarantee convergence of
the series in equation 3.6.6
4 Substitution
of [P(c|x)]
2 = P(c|x) into equation 3.10 gives (I(c; v))(2) =
P(c|x)P(x|v)
1
2 ln(2)N vc
x
= 1
2 ln(2)N vc
1= 1
K N.
2 ln(2)N v c
P(c|x)P(x|v)
x
5
For choosing the number of bins, Kv , as a function of the data set size N, we refer to
the large body of literature on this problem, for example, Hall and Hannan (1988).
6 For counting statistics (binomial distribution), we have
1
n
n! Nk
(δP(v|x))n = (P(v|x))n (−1)(n−k)
N n−1 k!(n − k)! (P(v|x))k
k=0
1 k lq
k!N! p
× (1 − p)(N−l) ,
(N − l)! lq ! q!
{l1 ···lk } q=1
k
where l = l , the lq are positive integers, and the sum {l ···l } runs over all partitions
q=1 q
k 1 k
this into equation 3.6, and considering only terms with x(1) = x(2) = · · · = x(n) , we get
∞ k
1 (2k−1)!! 1 P(x) (P(c|x))2
N P(v) (P(c|v))2 (P(v|x) − P(v|x) )
P(c, v) 2 , which is not guaranteed
ln 2 xvc k=1 2k(2k−1)
to converge.
2492 S. Still and W. Bialek
The lower bound of the leading-order error, equation 3.10, is given by7
1 1 I(c;x)
2 , (3.13)
2 ln(2) N
corr 1 1 I(c;x)
IUB (c; v) = Iemp (c; v) − 2 . (3.15)
2 ln(2) N
The slope of this upper bound is T − 2I(c;x) /2N (using equation 2.11), and
there is a maximum at
∗ 1 I(c;x)
TUB = 2 . (3.16)
2N
If the hard clustering solution assigns equal numbers of data to each cluster,
then the upper bound on the error, equation 3.12, can be rewritten as
1 Kv I(c;x)
2 . (3.17)
2 ln(2) N
corr 1 Kv I(c;x)
ILB (c; v) = Iemp (c; v) − 2 , (3.18)
2 ln(2) N
has a maximum at
∗ Kv I(c;x)
TLB = 2 . (3.19)
2N
Since both upper and lower bound coincide at the end point of the curve,
where T → 0 (see Figure 1), the actual corrected information curve must
[P(c|x)]2 P(x|v)
7 Proof: xvc P(c|v) = xc
P(x, c) P(c|x)
P(c)
P(v|x)
v P(v|c)
> xc
P(x, c) P(c|x)
P(c) =
P(c|x)
log2 [ P(c) ]
xc
P(x, c)2 ≥2 I(c;x)
.
How Many Clusters? An Information-Theoretic Perspective 2493
emp
I (c;v)
xx I corr
UB (c;v)
I corr
LB (c;v)
I(c,x)
Figure 1: Sketch of the lower and upper bound on the corrected information
curve, which both have a maximum under some conditions (see equations 3.16
and 3.19), indicated by x, compared to the empirical information curve, which
is monotonically increasing.
have a maximum at
γ I(c;x)
T∗ = 2 , (3.20)
2N
corr emp Kv
IT→0 (c; v) = IT→0 (c; v) − Nc , (3.21)
2 ln(2)N
emp
where IT→0 (c; v) is calculated by fixing the number of clusters and cooling
emp
the temperature to obtain a hard clustering solution. While IT→0 (c; v) in-
creases monotonically with Nc , we expect IT→0 (c; v) to have a maximum (or
corr
at least a plateau) at Nc∗ , as we have argued above. Nc∗ is then the optimal
number of clusters in the sense that using more clusters, we would not cap-
ture more meaningful structure (or, in other words, would overfit the data),
and although in principle we could always use fewer clusters, this comes
at the cost of keeping less relevant information I(c; v).
4 Numerical Results
4.1 Simple Synthetic Test Data. We test our method for finding Nc∗ on
data that we understand well and where we know what the answer should
be. We thus created synthetic data drawn from normal distributions with
zero mean and five different variances (for Figures 2 and 3).9 We emphasize
that we chose an example with gaussian distributions not because any of
our analysis makes use of gaussian assumptions, but rather because in the
gaussian case, we have a clear intuition about the similarity of different
distributions and hence about the difficulty of the clustering task. This will
become important later, when we make the discrimination task harder (see
Figure 6). We use Kv = 100 bins to estimate P̂(v|x). In Figures 2 and 3, we
emp
compare how IT→0 (c; v) and IT→0 corr
(c; v) behave as a function of the number
of clusters. The number of observations of v, given x, is Nv = N/Nx , and
the average number of observations per bin is given by Nv /Kv . Figure 3
shows the average IT→0 corr
(c; v), computed from 31 different realizations of
10
the data. All of the 31 individual curves have a maximum at the true
number of clusters, Nc∗ = 5, for Nv /Kv ≥ 2. They are offset with respect to
each other, which is the source of the error bars. When we have too few data
(Nv /Kv = 1), then we can resolve only four clusters (65% of the individual
curves peak at Nc∗ = 4, the others at Nc∗ = 5). As Nv /Kv becomes very large,
emp
corr
IT→0 (c; v) approaches IT→0 (c; v), as expected.
The curves in Figures 2 and 3 differ in the average number of exam-
ples per bin, Nv /Kv . The classification problem becomes harder as we see
fewer data. However, it also becomes harder when the true distributions are
closer. To separate the two effects, we create synthetic data drawn from gaus-
sian distributions with unit variance and NA different, equidistant means
α, which are dα apart.11 NA is the true number of clusters. This problem
becomes intrinsically harder as dα becomes smaller. Examples are shown
in Figures 4 and 5. The problem becomes easier as we are allowed to look
at more data, which corresponds to an increase in Nv /Kv . We are interested
9 P(x) = 1/N and P(v|x) = N (0, α(x)) where α(x) ∈ A, and |A| = 5, with P(α) = 1/5;
x
and Nx = 50. Nx is the number of “objects” we are clustering.
10 Each time we compute Iemp (c; v), we start at 100 different, randomly chosen initial
T→0
conditions to increase the probability of finding the global maximum of the objective
functional.
11 P(v|x) = N (α(x), 1), α(x) ∈ A, N := |A|.
A
How Many Clusters? An Information-Theoretic Perspective 2495
0.3
0.28
I T→ 0 ( c ; v )
0.26
0.24
0.22
emp
0.2
0.18
0.16
2 3 4 5 6 7 8 9 10
Nc
Figure 2: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five pos-
sible values for α. Displayed is the relevant information kept in the compression,
emp
computed from the empirical distribution, IT→0 (c; v), which increases monotoni-
cally as a function of the number of clusters. Each curve is computed as the mean
of 31 different curves, obtained by virtue of creating different realizations of the
data. The error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds),
2 (squares), 3 (triangles), 5 (stars), and 50 (X’s). Nx = 50 and Kv = 100 for all
curves. The line indicates the value of the information I(x; v), estimated from
106 data points.
in the regime in the space spanned by Nv /Kv and dα in which our method
retrieves the correct number of clusters.
In Figure 6, points mark those values of dα and Nv /Kv (evaluated on the
shown grid) at which we find the true number of clusters. The different
shapes of the points summarize results for 2, 5, and 10 clusters. A missing
point on the grid indicates a value of dα and Nv /Kv at which we did not find
the correct number of clusters. All of these missing points lie in a regime char-
acterized by a strong overlap of the true distributions combined with scarce
data. In that regime, our method always tells us that we can resolve fewer
clusters than the true number of clusters. For small sample sizes, the correct
number of clusters is resolved only if the clusters are well separated, but
as we accumulate more data, we can recover the correct number of classes
for more and more overlapping clusters. To illustrate the performance of
2496 S. Still and W. Bialek
0.23
0.22
0.21
I T→ 0 ( c ; v )
0.2
0.19
0.18
0.17
corr
0.16
0.15
0.14
0.13
2 3 4 5 6 7 8 9 10
Nc
Figure 3: Result of clustering synthetic data with P(v|x) = N (0, α(x)); five pos-
sible values for α. Displayed is the “corrected relevant information” in the hard
corr
clustering limit, IT→0 (c; v) (see equation 3.21), as a function of the number of
clusters. Each curve is computed as the mean of 31 different curves, obtained
by virtue of creating different realizations of the data. For Nv /Kv ≥ 2, all indi-
vidual 31 curves peak at Nc∗ = 5, but are offset with respect to each other. The
error bars are ± 1 standard deviation. Nv /Kv equals 1 (diamonds), 2 (squares),
3 (triangles), 5 (stars), and 50 (X’s). For Nv /Kv = 1, 20 of the 31 curves peak at
Nc∗ = 4, the other 11 at Nc∗ = 5. The line indicates the value of the information
I(x; v), estimated from 106 data points.
the method, we show in Figure 5 the distribution P(x, v) in which α(x) has
five different values that occur with equal probability, P(α(x)) = 1/5 and
differ by dα = 0.2. For this separation, our method still retrieves five as the
optimal number of clusters when we have at least Nv = 2000.12
Our method detects when only one cluster is present, a case in which
many methods fail (Gordon, 1999). We verified this for data drawn from one
gaussian distribution and for data drawn from the uniform distribution.
Figure 4: A trivial example of those data sets on which we found the cor-
rect number of clusters (results are summarized in Figure 6). Here, P(v|x) =
N (α(x), 1) with five different values for α, spaced dα = 2 apart. Kv = 100,
Nx = 20, Nv /Kv = 20.
4.2 Synthetic Test Data That Explicitly Violate Mixture Model As-
sumptions. We consider data drawn from a radial normal distribution, ac-
cording to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π ,
as shown in Figure 7. The empirical information curves (see Figure 8) and
corrected information curves (see Figure 9) are computed as the mean of
seven different realizations of the data for different sample sizes.13 The cor-
rected curves peak at Nc∗ , which is shown as a function of N in Figure 10.
For fewer than a few thousand samples, the optimal number of clusters
goes roughly as Nc∗ ∝ N2/3 , but there is a saturation around Nc∗ ≈ 25. This
number corresponds to half of the number of x-bins (and therefore half of
the number of “objects” we are trying to cluster), which makes sense given
the symmetry of the problem.
emp
13 Each time we compute IT→0 (c; v), we start at 20 different, randomly chosen initial
conditions to increase the probability of finding the global maximum of the objective
functional. Increasing the number of initial conditions would decrease the error bars at
the cost of computational time.
2498 S. Still and W. Bialek
Figure 5: One of the difficult examples of those data sets on which we found the
correct number of clusters (results are summarized in Figure 6). Here, P(v|x) =
N (α(x), 1) with five different values for α, spaced dα = 0.2 apart. Kv = 100,
Nx = 20, Nv /Kv = 20.
5 Uncertainty in P(x)
where δP(v, x) is some small perturbation and its average over all possible
realizations of the data is zero:
δP(v, x) = 0. (5.2)
How Many Clusters? An Information-Theoretic Perspective 2499
2
1
dα
0.5
0.2
95%
0.1
1 2 5 10 20
Nv / K v
Figure 6: Result of finding the correct number of clusters with our method for
a synthetic data set of size N = Nx Nv , (Nx = 20) with P(v|x) = N (α(x), 1) and
with either 2, 5, or 10 possible values for α, spaced dα apart. We indicate values
of dα and the “resolution” Nv /Kv (Kv = 100) at which the correct number of
clusters is found: for 2, 5, and 10 clusters (squares); for only 2 and 5 clusters
(stars); for only 2 clusters (circles). The classification error (on the training data)
is 0 for all points except for the one that is labeled with 95% correct.
Now, this estimate induces an error not only in Iemp (c; v) but also in
Iemp (c; x). Taylor expansion of these two terms gives
1 ∞
(−1)n 1 1
I(c; v) = −
ln(2) vc n=2 n(n − 1) (P(v, c))n−1 (P(c))n−1
n
× P(c|x)δP(x, v) − (P(v)) (5.3)
x
n
1 ∞
(−1)n 1
(P(v)) = δP(x, v) (5.4)
ln(2) v n=2 n(n − 1) (P(v))n−1 x
1 ∞
(−1)n 1
I(c; x) = −
ln(2) vc n=2 n(n − 1) (P(c))n−1
n
× P(c|x)δP(x, v) . (5.5)
x
2500 S. Still and W. Bialek
Figure 7: Twenty thousand data points drawn from a radial distribution, ac-
cording to P(r) = N (1, 0.2), with x = rcos(φ), v = rsin(φ), and P(φ) = 1/2π.
Displayed is the estimated probability distribution (normalized histogram with
50 bins along each axis).
1.8
I T→ 0 ( c ; v )
1.6
1.4
1.2
emp
1
0.8
0.6
0.4
10 20 30 40
Nc
emp
Figure 8: IT→0 (c; v) as a function of the number of clusters, averaged over seven
different realizations of the data. Error bars are ± 1 standard deviation. The
information I(x; v), calculated from 100,000 data points, equals 0.58 bits (line).
Data set size N equals 100 (diamonds), 300 (squares), 1000 (triangles), 3000
(stars), and 100,000 (crosses).
point, let us consider, as before, the leading-order term of the error (using
the approximation in equation 3.11),
1
1 1
(F)(2) +T−1 (P(c|x))2 P(x, v). (5.7)
2N ln(2) cv p(c) p(v|c) x
In the T → 0 limit, this term becomes Nc (Kv − 1)/2N ln(2), and we find
corr emp Kv − 1
IT→0 (c; v) = IT→0 (c; v) − Nc , (5.8)
2 ln(2)N
which is insignificantly different from equation 3.21 in the regime Kv 1.
Only for very large temperatures T 1 (i.e., at the onset of the anneal-
ing process) could the error that results from uncertainty in P(x) make a
significant difference.
The corrected objective function is now given by
Fcorr = Iemp (c; v) − TIemp (c; x) − F + µ(x) P(c|x), (5.9)
c
2502 S. Still and W. Bialek
0.6
0.5
I T→ 0 ( c ; v )
0.4
0.3
0.2
corr
0.1
0
-0.1
10 20 30 40
Nc
corr
Figure 9: IT→0 (c; v) as a function of the number of clusters, averaged over seven
different realizations of the data. Error bars are ± 1 standard deviation. The
information I(x; v) calculated from 100,000 data points equals 0.58 bits (line).
Data set size N equals: 100 (diamonds), 300 (squares), 1000 (triangles), 3000
(stars), and 100,000 (crosses).
100
N c* 10
1
102 103 104 105
N
Figure 10: Optimal number of clusters, Nc∗ , as found by the suggested method,
as a function of the data set size N. The middle curve (crosses) represents the
average over seven different realizations of the data, points on the upper/lower
curve are maximum/minimum values, respectively. Line at 25.
5.1 Rate Distortion Theory. Let us assume that we estimate the distri-
bution P(x) by P̂(x) = P(x) + δP(x), with δP(x) = 0, as before. While there
is no systematic error in the computation of d , this uncertainty in P(x) does
produce a systematic underestimation of the information cost I(c; x):
1 ∞
(−1)n ( x P(c|x)δP(x))n
I(c; x) = − . (5.12)
ln(2) n=2 n(n − 1) c (P(c))n−1
When we correct the cost functional for this error (with λ = 1/T),
Fcorr := I(c; x) + λ d(x, xc ) − I(c; x) + µ(x) P(c|x), (5.13)
c
Let us consider the leading-order term of the error made in calculating the
information cost,
(2) 1 x (P(c|x)δP(x))
2
(I(c; x)) =− . (5.15)
2 ln(2) c P(c)
corr 1
ILB (c; x) = I(c; x) + 2I(c;x) , (5.18)
2 ln(2)N
6 Summary
P(c|x)
log2 [ P(c) ]
14 Using xc
P(x, c) P(c|x)
P(c) = xc
P(x, c)2 ≥ 2I(c;x) .
How Many Clusters? An Information-Theoretic Perspective 2505
this sense, there is never a single “best” clustering of the data, just a family
of solutions evolving as a function of temperature.
As we solve the clustering problem at lower temperatures, we find so-
lutions that reveal more and more detailed structure and hence have more
distinct clusters. If we have only finite data sets, however, we expect that
there is an end to the meaningful structure that can be resolved—at some
point, separating clusters into smaller groups just corresponds to fitting the
sampling noise. The traditional approach to this issue is to solve the cluster-
ing problem in full and then to test for significance or validity of the results
by some additional statistical criteria. What we have presented in this work
is, we believe, a new approach. Because clustering is formulated as an op-
timization problem, we can try to take account of the sampling errors and
biases directly in the objective functional. In particular, for the IB method,
all terms in the objective functional are mutual information, and there is
a large literature on the systematic biases in information estimation. There
is a perturbative regime in which these biases have a universal form and
can be corrected. Applying these corrections, we find that at fixed sample
size, the trade-off between complexity and quality really does have an end
point beyond which lowering the temperature or increasing the number of
clusters does not resolve more relevant information. We have seen numeri-
cally that in model problems, this strategy is sufficient to set the maximum
number of resolvable clusters at the correct value.
Acknowledgments
References