0% found this document useful (0 votes)
152 views8 pages

n25 PDF

The document discusses different approaches to clustering, including k-means clustering, hierarchical clustering, and spectral clustering. K-means clustering aims to minimize the distance between data points and their assigned cluster centroids. Hierarchical clustering builds clusters incrementally from individual data points using distance metrics. Spectral clustering formulates clustering as a graph partitioning problem by viewing data as a graph and minimizing cuts between clusters while balancing cluster sizes.

Uploaded by

Christine Straub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views8 pages

n25 PDF

The document discusses different approaches to clustering, including k-means clustering, hierarchical clustering, and spectral clustering. K-means clustering aims to minimize the distance between data points and their assigned cluster centroids. Hierarchical clustering builds clusters incrementally from individual data points using distance metrics. Spectral clustering formulates clustering as a graph partitioning problem by viewing data as a graph and minimizing cuts between clusters while balancing cluster sizes.

Uploaded by

Christine Straub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Clustering

compiled by Alvin Wan from Professor Benjamin Rechts lecture, Samanehs discussion

1 Overview
With clustering, we have several key motivations:

archetypes (factor analysis)


segmentation
hierarchy
faster lookups (quantization)

Its not trivial to choose an objective to minimize. In PCA, the algorithm was fixed, regard-
less of the objective. With clustering, different objective can result in a different algorithms.
There is no preferred way to do clustering, but we will explore several popular methods in
this note. Here are three approaches to consider:

k-means (quantization)
agglomeration (hierarchy)
spectral (segmentation)

2 K-Means Clustering
In k-means clustering, we segment our data by describing each data point using a centroid
i . In other words, xi is in cluster j if xi to closer to cluster j than any other cluster,
kxi j k < kxi j 0 k for j 6= j 0 . Given centroids, this is how we assign points to clusters.
The question is now: how do we pick centroids? We have the following optimization problem:

n
X
Minimize1 ,2 ,...k Min1ji k kxi ji k2
i=1

(ji is an index) This is effectively an SVM, where were fitting parameters to some loss
function. As it turns out, minimizing this cost is NP-hard.
Explore I want hue?

1
2.1 Lloyds Algorithm

The following is called alternating minimization. If we fix the cluster assignments, the
problem becomes easy. If the cluster assignment is fixed, the objective is a convex function.
Then, if we fix the means, then we can easily cluster. In the following algorithm, we then
alternately fix the cluster assignments or the means and minimize over the other.

1. Initialize 1 , . . . , k .

2. Assign each point to the jth cluster if it is closest to j. For i = 1, . . . n, assign xi to Cj


if kxi j k2 kx2 j 0 kj 0 6= j.

3. If not assignments changed, then return.


1
P
4. Assign the mean to the mean of the cluster. j |Cj | iCj xi .

5. Go back to 1.

The number of clusters is in fact a hyper-parameter for this algorithm. How do we initialize
i ? We have a few options:

Pick 1 , 2 , . . . k at random.

Initialize using k-means++. (See stronger results by Schulman, Rabani, Swarmy, Os-
trovski.)

Set 1 to be randomly-selected xi . In high-dimensional space, a randomly-selected


point may easily be distant from our data.
For c = 1, . . . k 1, for i = 1, . . . , n, di = Min1jC kxi j k2 .
P
z= di
di
For i = 1, . . . , n, pi = z
. Set c+1 = xi with probability pi .
If the distance is large, we have a high probability of picking that point. We have
0 probability of picking the original point.

As it turns out, if there exists a good clustering, and we know the number of clusters, this
algorithm is guaranteed to find that clustering.

2
3 Hierarchical Clustering

Previously, we had a top-down approach, where we took clusters and then assigned samples.
Here, we take a bottom-up approach; we form clusters incrementally. Take clusters of 2,
merge the pairs, then the quadruples etc. This inherently gives us a hierarchy. Let us define
one possible distance metric, called average linkage:

1 XX
d(A, B) = Dist(a, b)
|A||B| aA aB

P
We can also define centroid linkage, where A = aA a.

d(A, B) = Dist(A , B )

We could similarly and arbitrarily apply any valid metric:

d(A, B) = Max(Dist(a, b) : a A, b B)

3.1 Greedy Algorithm


1. Initialize with n clsuters, Ci = {xi }.
2. Repeat.
3. For all pairs of clusters (A, B), compute d(A, B)
4. Cnew = A B, where d(A, B) is minimized.

The dendogram represents our steps to union each set of clusters.

We can examine a random greedy algorithm

Choose A uniformly at random


Cnew = A B, where B is closest to A.

This reduces runtime from n3 to n2 an often produces more stable results.

3
4 Spectral Clustering

View data as a graph, where our nodes are data points x1 , . . . xn , and edges are wij , which
denote similarity of two data points, Sim(xi , xj ). Here are a few sample similarity functions.

xT xj
cosine similarity: kxi kkxj k

a kernel function k(xi , xj )


(
1 kxi xj k D0

0 otherwise

4.1 Cuts

As it turns out, we can convert clustering into a graph partition problem. Let us formalize
the problem parameters. Our goal is find cut for our graph. Let V be the set of all nodes,
then our partitions V1 , V2 must satisfy the following.

V1 V2 = V

V1 V2 =

P P
The number of cuts is Cut(V1 , V2 ) = iV1 jV2 wij . However, we can find a trivial solution
that minimizes the number of cuts, which is to consider V1 = V, V2 = . So, we introduce a
penalty term to make a balanced cut.

MinimizeCut(V1 , V2 )

subject to |V1 | = |V2 | = n2 . (We ignore the odd case for now.) This problem is also NP-hard.
We are now going to transform a discrete problem into a continuous problem.

4
4.2 Graph Laplacian

We have several types of matrices that describe the structure of a graph.

adjacency matrix (A): Aij = 1 if i, j connected and 0 otherwise

affinity matrix (W ): entries are s(i, j) if i, j connected and 0 otherwise (no self-loops,
so diagonal entries are 0)

degree matrix (D): In the derivation below, D is a diagonal matrix with sums of the

Laplacian matrix (L = D W ): symmetric, PSD, always has i = 0, vi = 1

Let us call Mass(G1 ) the number of nodes in G1 , or |V1 |. We wish to find 2 or more parittions
of similar sizes, where we cut edges with low weight. We can see that our problem can be
formally expressed as the following.

Cut(G1 , G2 )
Minimize
Mass(G1 )Mass(G2 )

4.3 Minimizing the Cut

Let us first define the cut indicator.

(
1 i V1
vi =
1 i V2

We can then define a cut indicator, which tells us if i, j is in the cut.

n n
1 XX
Cut(V1 , V2 ) = wij (vi vj )2
4 i=1 j=1

If the weight is high, we want nodes to be closer together, and if the weight is low, nodes
are repelled. As it turns out, we can simplify this expression.

5
XX
Cut(V1 , V2 ) = wij
iG1 jG2
X 1
= wij (yi yj )2
4
(i,j)E
1 X
= (wij yi2 2wij yi yj + wij yj2 )
4
(i,j)E
1 X 1 X
= (2wij yi yj ) + (wij yi2 + wij yj2 )
4 4
(i,j)E (i,j)E

In the second summation, we sum over all edges in the cut wij , adding weight for both
vertices i, j. This is equivalent to summing over all vertices in the cut, and for each vertex,
adding all weights for edges in the cut.

n n
1 X X
2
X
Cut(V1 , V2 ) = (2wij yi yj ) + yi wik
4 i=1 k=1
(i,j)E
1
= v T (D W )v
4
1
= v T Lv
4

(
wij i 6= j
where Lij = P . L is known as the Graph Laplacian. This, like the adjacency
k wik i=j
matrix, can uniquely identify a graph. We know a few properties about this matrix L.

L is symmetric.

L is positive semidefinite, if wij > 0. Since all terms are squared and non-negative,
v T Lv 0, v.

L1 = 0, where 1 is the vector of 1s.

6
We thus have a new objective.

1
Minimize v T Lv
4

such that vi P{1, 1}, 1T v = 0. To make this more explicit, note that along the diagonal
of L, we have j wij . Since wii = 0, then we have that this sum is equal to the sum of all
other terms in that row. Thus, L1 = 0. Since v 6= 0, = 0.

We claim only one such exists. Note that if y = 1, Ly = 0 and Cut(G1 , G2 ) = y T Ly = 0,


and all nodes are in G1 or G2 .

Proof: Assume for contradiction that another v2 6= 1 so that 2 = 0. So Lv2 = 0 = 2 v2 =


Cut(G1 , G2 ) = 0. We know v2T Lv2 = 0, and thus the graph is still connected. This means
v2 = 1. Contradiction.

Note that this minimization problem is the exact same problem as the one proposed earlier.
The only difference is that this for continuous-valued numbers. Now, we make an approxi-
mation. Instead, we will subject our problem to kvk2 = n and 1T v = 0. As it turns out, the
solution to this minimization problem is the second-smallest eigenvalue. If 1T v = 0 was not
added, the solution would be the first eigenvalue.

There are a variety of other related to the Graph Laplacian - the normalized cut, maximum
cut etc. All of these are NP-hard.

4.4 Minimizing the Masses

Now, let us consider the denominator. We need to additionally constrian the sizes of the
partitions to be similar. How can we ensure that |V1 | = |V2 | = n2 . We want thte sum of all
entries in v to be 0. So, 1T y = 0. The problem is formally

Minimizev T Lv

subject to the constraint that i, yi = 1 or yi = 1. Consider a two-dimensional represen-


tation, on only y1 , y2 . Plotting all combinations of {1, 1}, we have the corners of a square.
We can loosen this constraint so that y1 , y2 are anywhere on the circle that passes through

7

all corners of the square. This is a circle of radius 2 = n. Generalizing to n, we can relax
this constraint to kvk22 = n or identically, 1T v = 0. Without any constraints, note that

v T Lv
Minimize = min (L) = 0
vT v

So, v1 is not a solution. However, we note that v1 = 1. Note that v2 is orthogonal to v1 , so


v2 satisfies the constraint. Our solution is thus the second-smallest eigenvalue.

Consider now the ellipsoid, {x : xT Ax = 1}. Our semi-axis length is given by 1i . Our
principal directions are given by vi . When A = L, we have an eigenvalue of i = 0, so we
have one axis with length infinity. Seen geometrically, this is a cylinder, where the length of
the cylinder runs along v1 . Since we want to v1T v = 0, then we want v to be orthogonal to
v1 . This is a hyperplane orthogonal to v1 . Per before, we want kvk22 = n. The constraint
in three-dimensional space is a sphere. Thus, we are looking for the intersection of the
hyperplane with the sphere. This is precisely v2 .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy