0% found this document useful (0 votes)
40 views28 pages

Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering

[0, 1] indicating which Hamming distance Network friends are online
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views28 pages

Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering

[0, 1] indicating which Hamming distance Network friends are online
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning by ambedkar@IISc

I Unsupervised Learning

I Dimensionality Reduction

I K-means Clustering
Agenda

What is Unsupervised Learning

Principle Component Analysis and


Dimensionality Detection

Clustering

2
What is Unsupervised Learning
Unsupervised Learning

I Input: A set of unlabeled examples, D = {xn }N


n=1

I Objective: Find patterns in observed data

I Challenge: Since there is no ground-truth or labels it is


very difficult to evaluate the algorithms.

3
Unsupervised Learning

I Examples:
I Clustering - Grouping observed data into unlabeled clusters
I identifying social circles, summarizing observed data etc.

I Dimensionality Reduction - Finding a low-dimensional


representation of the data
I visualization, compression, structure analysis etc.

I Anomaly Detection - Spotting outliers in the data


I detecting fraudulent transactions, data cleaning etc.1

I Density Estimation - Finding the underlying probability


distribution from which D has been sampled.
1
The discovery of Higgs Boson relied on one such algorithm
4
Principle Component Analysis
and Dimensionality Detection
Dimensionality Reduction

I Input: A dataset D = {xn }N


n=1 where each xn ∈ R
d

I Objective: Find a low-dimensional representation of each


point x̃n ∈ Rk where k < d

I In other words: Find a k-dimensional coordinate system


and represent all the points in this coordinate system
I Need to find orthonormal vectors v1 , v2 , . . . , vK which form
the basis of the new coordinate system
I Need to way to represent the original points in this new
coordinate system

I Main Question: How to choose the low dimensional space


and embed the points in it?

5
Dimensionality Reduction - Applications

I Visualization: Find a 2 or 3 dimensional representation of


data such that the essence of data is not lost
I Visualizing financial profile of individuals in two dimensions
to identify patterns

I Compression: Embed the points in a lower dimensional


space such that various topological properties are preserved
to optimize storage
I Minimizing the number of colours needed to represent an
image. Efficient encoding schemes can then be used for
compression

I Feature Selection: Remove redundant or less informative


features
I Identifying and eliminating functionally related or highly
correlated features like density, mass and volume 6
Dimensionality Reduction - Toy Example

I Given 7 points in two dimensions. Need 14 numbers to


store x and y coordinates of all points

I Idea 1: Discard y coordinate of all points (green points).


Only 7 numbers needed now. Lot of information lost.

I Idea 2: Discard x coordinate of all points (orange points).


Only 7 numbers needed now. Better than green points.

I Idea 3: Save the slope of pink line and the x (or y)


coordinate of each point. Need to store 8 numbers. No
information lost.

7
Dimensionality Reduction - Toy Example

8
Dimensionality Reduction - Toy Example - Findings

I Simply discarding coordinates is not a good idea


I Not all ways of dimensionality reduction are equally good
I Need to quantify the amount of information lost while
performing dimensionality reduction
I Real data is not as neat as the toy example, need a way to
deal with noise

I Revised Objective: To find a k dimensional subspace of


Rd and linearly project data onto this subspace while
minimizing the “loss of information”
I Non-linear dimensionality reduction methods exist but are
beyond the current scope
I We will consider Principle Component Analysis (PCA)

9
Dimensionality Reduction - Principle Component Analysis

I Let u ∈ Rd be a direction along which we want to project


data
I Thus, x̃n = (xn | u)u. Note that one only needs to store
xn | u for each n
I PCA uses variance in projected data as a measure of
information
I Information content is assumed to be proportional to
variance of projected data
I Need to retain maximum information, thus, need to find u
such that variance of projected data is maximized

u∗ = arg max Var({x̃n }N


n=1 )
u:||u||=1

10
Dimensionality Reduction - PCA (contd. . . )

N
1 X 2 
Var({x̃n }N
n=1 ) = x̃n − E[x̃n ]2
N
n=1

I Assume WLOG that E[xn ] = 0, thus E[x̃n ] = E[xn ]| u = 0


I Also, x̃2n = u| xn x|n u, thus we get:

N 
X  N
X 
x̃2n − E[x̃n ]2 = u| xn x|n u
n=1 n=1

|
Note that N
P
n=1 xn xn is the covariance matrix X of
I

observed data since E[xn ] = 0. Thus:

1 |
Var({x̃n }N
n=1 ) = u Xu
N 11
Dimensionality Reduction - PCA (contd. . . )

I The constant can be dropped for the purpose of


optimization. Hence the optimization problem becomes:

u∗ = arg max u| Xu
u:||u||=1

I This is a constrained optimization problem, the Lagrangian


is given by:
L(u, µ) = u| Xu + µ(u| u − 1)
∇u L = 0 ⇒ 2Xu + 2µu = 0
⇒ Xu = −µu
I Thus, the optimal u must be an eigenvector of X. Since we
want to maximize u| Xu, u must be the eigenvector
corresponding to largest eigenvalue. Hence:
u∗ = eigenvector of X corresponding to largest eigenvalue 12
Dimensionality Reduction - PCA (contd. . . )

I Usually k > 1 thus we want to find u1 , u2 , . . . , uk and not


just u∗
I Setting u1 = u∗ , one can find u2 as follows:
u2 = arg maxu:||u||=1,u| u1 =0 u| Xu
I u| u1 = 0 is needed to avoid correlations in projected data
I One can show that u2 is the eigenvector of X corresponding
to second largest eigenvalue
I Similarly u1 , u2 , . . . , uk are the eigenvectors of X
corresponding to k largest eigenvalues. Also:
x̃n = U| xn
where, U ∈ Rd×k is a matrix containing u1 , u2 , . . . , uk in its
columns
13
Dimensionality Reduction - PCA (contd. . . )

Algorithm 1 Principle Component Analysis


Input: Dataset D = {xn }N n=1 and number of dimensions k
Output: Low dimensional vectors D̃ = {x̃n }N n=1
Normalize the data so that it is zero mean
|
Compute X = N
P
n=1 xn xn
Find U ∈ Rd×k containing top k eigenvectors of X as columns
Compute x̃n ∈ Rk such that x̃n = U| xn , for all n = 1, . . . , N

14
PCA Example

15
PCA Example

16
Clustering
Clustering

I Input: Data points D = {xn }N n=1 , a similarity/distance


function d(. , .) defined on elements of D and the number of
clusters K
I Objective: Partition the given N points into K subsets
C1 , C2 , . . . , CK such that:
I Ck ⊂ D, Cj 6= Φ for all k = 1, . . . , K
I Ci ∩ Cj = Φ for all i, j = 1, . . . , k, i 6= j
I ∪K
k=1 Cj = D
I Points in the same cluster are more similar than points
across clusters (w.r.t. d(. , .))
I Variants that allow fractional membership of points to
clusters or overlapping clusters exist but we will assume
that each point belongs to exactly one cluster

17
Clustering (contd. . . )

x(i) d(. , .) Clusters


Eye (x, y) coordinate
Hot-spots on
Gaze on screen where Euclidean distance
screen
Tracker user is looking
A binary vector
Social 1 Friendship
indicating friends #common friends
Media groups
of person i
Bag of words
Docu- 1
representation of #common words Topics
ments
document i
Gene
Biology Genes Task dependent expression
patterns
Table 1: Some examples related to clustering
18
Clustering - Approaches

I Agglomerative (bottom-up) vs Divisive (top-down)


I Monothetic (considers features sequentially) vs
Polythetic (considers features all at once)
I Hard (single cluster membership) vs Fuzzy (mixed
memberships allowed)
I Hierarchical (creates hierarchy) vs Partitional (disjoint,
unordered clusters)

Any clustering algorithm can be classified based on this scheme

Example: We will see that k-Means is a polythetic, hard and


partitional clustering algorithm
19
Clustering - Toy Example

Data Hard Clustering Fuzzy Clustering

Clustering is an exploratory data anal-


ysis problem.
There is no single “correct” solution.

Hierarchical
Clustering 20
Clustering - Popular Algorithms

I k-Means and k-Medoids

I Spectral clustering

I Expectation Maximization for Gaussian Mixture Models

I Density-Based Spatial Clustering of Applications with


Noise (DBSCAN)

I etc.

21
Clustering - k-Means

I Let C denote the set of all possible cluster assignment for


the given dataset D
I c ∈ C is such that c ∈ {1, . . . , k}m , where ci = j iff
x(i) ∈ Cj . Recall that:
I k is the number of clusters
I m is the number of data points
I Cj is the j th cluster
I Ideally one would like to solve the following problem:
k
X m
X
c∗ = arg min 1{ci1 = j, ci2 = j}||x(i1 ) − x(i2 ) ||2 ,
c∈C
j=1 i1 ,i2 =1

i.e. minimize the distance between points in same cluster


I This optimization is NP hard so k-Means clustering solves a
relaxed version of this problem
22
Clustering - k-Means (contd. . . )

Algorithm 2 k-Means Clustering Algorithm


Input: Dataset D = {x(i) }m i=1 and number of clusters k
Output: Cluster assignment vector c ∈ {1, . . . k}m , cluster
centers µ1 , . . . , µk
Initialize µ1 , . . . , µk by randomly choosing k distinct points
from D
repeat
Set ci = arg minj ||x(i) − µj ||2 for i = 1, 2, . . . , m
Set µj = |{i:c1i =j}| m (i) for all j = 1, 2, . . . , k
P
i=1 1{ci = j}x
until convergence

Breaks the optimization problem into two parts


I Optimization over memberships c keeping µ1 , . . . , µk fixed
23
I Optimization over cluster centers µ1 , . . . , µk keeping c fixed
Clustering - k-Means (contd. . . )

k-Means on a toy dataset2


24
2
Clustering - k-Means (contd. . . )

Limitations of k-Means:

I Not suitable for non-spherical clusters because of the use of


Euclidean distance
I Transform data appropriately before performing k-Means
(as we will see later for spectral clustering) or use kernel
k-Means

I Not robust to outliers because of the use of arithmetic mean


I Remove outliers before clustering

I Susceptible to sub-optimal solutions


I Run the algorithm multiple times with random
initializations
25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy