n25 PDF
n25 PDF
compiled by Alvin Wan from Professor Benjamin Rechts lecture, Samanehs discussion
1 Overview
With clustering, we have several key motivations:
Its not trivial to choose an objective to minimize. In PCA, the algorithm was fixed, regard-
less of the objective. With clustering, different objective can result in a different algorithms.
There is no preferred way to do clustering, but we will explore several popular methods in
this note. Here are three approaches to consider:
k-means (quantization)
agglomeration (hierarchy)
spectral (segmentation)
2 K-Means Clustering
In k-means clustering, we segment our data by describing each data point using a centroid
i . In other words, xi is in cluster j if xi to closer to cluster j than any other cluster,
kxi j k < kxi j 0 k for j 6= j 0 . Given centroids, this is how we assign points to clusters.
The question is now: how do we pick centroids? We have the following optimization problem:
n
X
Minimize1 ,2 ,...k Min1ji k kxi ji k2
i=1
(ji is an index) This is effectively an SVM, where were fitting parameters to some loss
function. As it turns out, minimizing this cost is NP-hard.
Explore I want hue?
1
2.1 Lloyds Algorithm
The following is called alternating minimization. If we fix the cluster assignments, the
problem becomes easy. If the cluster assignment is fixed, the objective is a convex function.
Then, if we fix the means, then we can easily cluster. In the following algorithm, we then
alternately fix the cluster assignments or the means and minimize over the other.
1. Initialize 1 , . . . , k .
5. Go back to 1.
The number of clusters is in fact a hyper-parameter for this algorithm. How do we initialize
i ? We have a few options:
Pick 1 , 2 , . . . k at random.
Initialize using k-means++. (See stronger results by Schulman, Rabani, Swarmy, Os-
trovski.)
As it turns out, if there exists a good clustering, and we know the number of clusters, this
algorithm is guaranteed to find that clustering.
2
3 Hierarchical Clustering
Previously, we had a top-down approach, where we took clusters and then assigned samples.
Here, we take a bottom-up approach; we form clusters incrementally. Take clusters of 2,
merge the pairs, then the quadruples etc. This inherently gives us a hierarchy. Let us define
one possible distance metric, called average linkage:
1 XX
d(A, B) = Dist(a, b)
|A||B| aA aB
P
We can also define centroid linkage, where A = aA a.
d(A, B) = Dist(A , B )
d(A, B) = Max(Dist(a, b) : a A, b B)
3
4 Spectral Clustering
View data as a graph, where our nodes are data points x1 , . . . xn , and edges are wij , which
denote similarity of two data points, Sim(xi , xj ). Here are a few sample similarity functions.
xT xj
cosine similarity: kxi kkxj k
4.1 Cuts
As it turns out, we can convert clustering into a graph partition problem. Let us formalize
the problem parameters. Our goal is find cut for our graph. Let V be the set of all nodes,
then our partitions V1 , V2 must satisfy the following.
V1 V2 = V
V1 V2 =
P P
The number of cuts is Cut(V1 , V2 ) = iV1 jV2 wij . However, we can find a trivial solution
that minimizes the number of cuts, which is to consider V1 = V, V2 = . So, we introduce a
penalty term to make a balanced cut.
MinimizeCut(V1 , V2 )
subject to |V1 | = |V2 | = n2 . (We ignore the odd case for now.) This problem is also NP-hard.
We are now going to transform a discrete problem into a continuous problem.
4
4.2 Graph Laplacian
affinity matrix (W ): entries are s(i, j) if i, j connected and 0 otherwise (no self-loops,
so diagonal entries are 0)
degree matrix (D): In the derivation below, D is a diagonal matrix with sums of the
Let us call Mass(G1 ) the number of nodes in G1 , or |V1 |. We wish to find 2 or more parittions
of similar sizes, where we cut edges with low weight. We can see that our problem can be
formally expressed as the following.
Cut(G1 , G2 )
Minimize
Mass(G1 )Mass(G2 )
(
1 i V1
vi =
1 i V2
n n
1 XX
Cut(V1 , V2 ) = wij (vi vj )2
4 i=1 j=1
If the weight is high, we want nodes to be closer together, and if the weight is low, nodes
are repelled. As it turns out, we can simplify this expression.
5
XX
Cut(V1 , V2 ) = wij
iG1 jG2
X 1
= wij (yi yj )2
4
(i,j)E
1 X
= (wij yi2 2wij yi yj + wij yj2 )
4
(i,j)E
1 X 1 X
= (2wij yi yj ) + (wij yi2 + wij yj2 )
4 4
(i,j)E (i,j)E
In the second summation, we sum over all edges in the cut wij , adding weight for both
vertices i, j. This is equivalent to summing over all vertices in the cut, and for each vertex,
adding all weights for edges in the cut.
n n
1 X X
2
X
Cut(V1 , V2 ) = (2wij yi yj ) + yi wik
4 i=1 k=1
(i,j)E
1
= v T (D W )v
4
1
= v T Lv
4
(
wij i 6= j
where Lij = P . L is known as the Graph Laplacian. This, like the adjacency
k wik i=j
matrix, can uniquely identify a graph. We know a few properties about this matrix L.
L is symmetric.
L is positive semidefinite, if wij > 0. Since all terms are squared and non-negative,
v T Lv 0, v.
6
We thus have a new objective.
1
Minimize v T Lv
4
such that vi P{1, 1}, 1T v = 0. To make this more explicit, note that along the diagonal
of L, we have j wij . Since wii = 0, then we have that this sum is equal to the sum of all
other terms in that row. Thus, L1 = 0. Since v 6= 0, = 0.
Note that this minimization problem is the exact same problem as the one proposed earlier.
The only difference is that this for continuous-valued numbers. Now, we make an approxi-
mation. Instead, we will subject our problem to kvk2 = n and 1T v = 0. As it turns out, the
solution to this minimization problem is the second-smallest eigenvalue. If 1T v = 0 was not
added, the solution would be the first eigenvalue.
There are a variety of other related to the Graph Laplacian - the normalized cut, maximum
cut etc. All of these are NP-hard.
Now, let us consider the denominator. We need to additionally constrian the sizes of the
partitions to be similar. How can we ensure that |V1 | = |V2 | = n2 . We want thte sum of all
entries in v to be 0. So, 1T y = 0. The problem is formally
Minimizev T Lv
7
all corners of the square. This is a circle of radius 2 = n. Generalizing to n, we can relax
this constraint to kvk22 = n or identically, 1T v = 0. Without any constraints, note that
v T Lv
Minimize = min (L) = 0
vT v
Consider now the ellipsoid, {x : xT Ax = 1}. Our semi-axis length is given by 1i . Our
principal directions are given by vi . When A = L, we have an eigenvalue of i = 0, so we
have one axis with length infinity. Seen geometrically, this is a cylinder, where the length of
the cylinder runs along v1 . Since we want to v1T v = 0, then we want v to be orthogonal to
v1 . This is a hyperplane orthogonal to v1 . Per before, we want kvk22 = n. The constraint
in three-dimensional space is a sphere. Thus, we are looking for the intersection of the
hyperplane with the sphere. This is precisely v2 .