0% found this document useful (0 votes)
33 views15 pages

Data Clustering (Contd) : CS771: Introduction To Machine Learning Piyush Rai

This document discusses various clustering algorithms, including extensions of k-means clustering such as soft clustering and kernel k-means. It also describes hierarchical clustering, graph clustering including spectral clustering, and density-based clustering such as DBSCAN. Finally, it provides a brief introduction to probabilistic clustering methods like Gaussian mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views15 pages

Data Clustering (Contd) : CS771: Introduction To Machine Learning Piyush Rai

This document discusses various clustering algorithms, including extensions of k-means clustering such as soft clustering and kernel k-means. It also describes hierarchical clustering, graph clustering including spectral clustering, and density-based clustering such as DBSCAN. Finally, it provides a brief introduction to probabilistic clustering methods like Gaussian mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Clustering (Contd)

CS771: Introduction to Machine Learning


Piyush Rai
2
Plan
 K-means extensions
 Soft clustering
 Kernel K-means
 A few other popular clustering algorithms
 Hierarchical Clustering
 Agglomerative Clustering
 Divisive Clustering
 Graph Clustering
 Spectral Clustering
 Density-based clustering
 DBSCAN
 Basic idea of probabilistic clustering methods, such as Gaussian mixture
models (details when we talk about latent variable models)
CS771: Intro to ML
3
K-means: Hard vs Soft Clustering
 K-means makes hard assignments of points to clusters
 Hard assignment: A point either completely belongs to a cluster or doesn’t belong at all
A more principled
extension of K-means for
doing soft-clustering is via
probabilistic mixture
models such as the
Gaussian Mixture Model

 When clusters overlap, soft assignment is preferable(i.e., probability of being assigned


to each cluster: say and for some point , )
 A heuristic
𝐾 to get soft assignments: Transform distances from clusters into
∑ 𝛾 𝑛𝑘 =1
prob.
𝑘=1

 CS771: Intro to ML
4
K-means: Decision Boundaries and Cluster Sizes/Shapes
 K-mean assumes that the decision boundary between any two clusters is linear
 Reason: The K-means loss function implies assumes equal-sized, spherical clusters

Reason: Use of
Euclidean distances

 May do badly if clusters are not roughly equi-sized and convex-shaped

CS771: Intro to ML
5
Kernel K-means Helps learn non-spherical clusters
and nonlinear cluster boundaries

 Basic idea: Replace the Eucl. distances in K-means by the kernelized versions

Kernelized distance between


input and mean of cluster

 Here denotes the kernel function and is its (implicit) feature map
 Note: is the mean of mappings of the data points assigned to cluster
1
Not the same as the mapping of
the mean of the data points 𝜙 ( μ𝑘 )= ∑
|𝒞 𝑘| 𝑛: 𝑧 =𝑘
𝜙 ( 𝒙 𝑛) Can also used landmarks or
kernel random features idea
assigned to cluster 𝑛
to get new features and run
standard k-means on those

Note: Apart from kernels, it is also possible to


use other distance functions in K-means.
Bregman Divergence* is such a family of
distances (Euclidean and Mahalanobis are
special cases)
*Clustering with Bregman Divergences (Banerjee et al, 2005)
CS771: Intro to ML
6
Hierarchical Clustering Similarity between two clusters
(or two set of points) is needed in
HC algos (e.g., this can be average
pairwise similarity between the
 Can be done in two ways: Agglomerative or Divisive inputs in the two clusters)
Agglomerative: Start
with each point being Keep recursing until the
in a singleton cluster desired number of clusters
found
At each step, greedily merge
At each step, break a cluster
two most “similar” sub-clusters
into (at least) two smaller
Stop when there is a single homogeneous sub-clusters
cluster containing all the Divisive: Start with all
points points being in a single
Learns a dendrogram-like cluster
structure with inputs at the leaf
nodes. Can then choose how
Tricky because no labels
many clusters we want
(unlike Decision Trees)
 Agglomerative is more popular and simpler than divisive (the latter usually needs
complicated heuristics to decide cluster splitting).
 Neither uses any loss function CS771: Intro to ML
7
Graph Clustering
 Often the data is given in form of a graph, not feat. vec.
 Usually in form of a pairwise similarity matrix of size
 is assumed to be the similarity between two nodes/inputs with indices and

 Examples: Social networks and various interaction networks


Various graph
embedding algorithms
 Goal is to cluster the nodes/inputs into clusters (flat partitioning) exist (e.g., node2vec)

 One scheme is to somehow get an embedding of the graph nodes to get feature vector
for each node and run -means or kernel -means or any other clustering algo
 Another way is to perform direct graph clustering
 Spectral clustering is such a popular graph clustering algorithm

CS771: Intro to ML
8
Spectral Clustering
Spectral clustering has a beautiful theory
behind it (won’t get into it in this course; may
refer to a very nice tutorial article listed
below, if interested)

 We are given the node-node similarity matrix of size


 Compute the graph Laplacian
 is a diagonal matrix s.t. (sum of similarities of node with all other nodes)

 Note: Often, we work with a normalized graph Laplacian


 Given the graph Laplacian, solve this spectral decomposition problem Meaning U has
orthonormal
s .t .𝑼 ⊤ 𝑼 =𝑰 columns

 Now run -means on the matrix as the feature matrix of the nodes
 Note: Spectral clustering* is also closely related to kernel -means (but more general
since can represent any graph) and “normalized cuts” for graphs
*Kernel k-means, Spectral Clustering and Normalized Cuts (Dhillon et al, 2004) A Tutorial on Spectral Clustering (Ulrike von Luxburg, 2007)
CS771: Intro to ML
9
Density based Clustering - DBSCAN
 DBSCAN: Density Based Spatial Clustering of Applications with Noise
 Uses notion of density of points (not in the sense of probability density) around a point
DBSCAN treats densely connected Grey points left
points as a cluster, regardless of the unclustered
 Has some very nice properties shape of the cluster since they are
most likely
 Does not require specifying the number of clusters outliers

 Can learn arbitrary shaped clusters (since it only considers of density of points)
 Robust against outliers (leaves them unclustered!), unlike other clust. algos like K-
means Accuracy of DBSCAN
depends crucially on and
minPoint hyperparams
 Basic idea in DBSCAN is as follows
 Want all points within a cluster to be at most distance apart from each other
 Want at least minPoints points within distance of a point (such a point is called “core” point)
 Points that don’t have minPoints within distance are called “border” points
 Points that are neither core nor border point are outliers CS771: Intro to ML
10
DBSCAN (Contd)
 The animation on the right shows DBSCAN in action
 The basic algorithm is as follows
 A point is chosen at random
 If more than minPoint neighbors distance, then call it core point
 Check if more points fall within distance of core/its neighbors
 If yes, include them too in the same cluster
 Once done with this cluster, pick another point randomly and repeat
 An example of clustering obtained by DBSCAN Green points are core points,
blue points are border points,
red points are outliers

DBSCAN is mostly a
heuristic based algorithm.
No loss function unlike K-
means

Animation credit: https://dashee87.github.io/ CS771: Intro to ML


11
Going the Probabilistic Way..
 Assume a generative model for inputs and denotes all the unknown params
 Clustering then boils down to computing posterior cluster probability where denote
the cluster assignment of

(from Bayes rule)

 Assuming prior to be multinoulli with prob. vector and each of the class-conditional
to be a Gaussian
(Here )

Posterior prob. Of cluster assignment also depends on prior Different clusters can have different
probability (fraction of points in that cluster if using MLE) covariances (hence different shapes)

 We know how to estimate if were known (recall generative classification) Just like in K-
means
 But since we don’t know need to estimate both (and ALT-OPT can be used)
CS771: Intro to ML
12
Going the Probabilistic Way..
 At a high-level, a probabilistic clustering algorithm would look somewhat like
this
Akin to initializing the cluster means in K-
means
Akin to computing cluster
assignments in K-means

Akin to updating cluster


means in K-means

 The above algorithm is an instance of a more general Expectation Maximization


CS771: Intro to ML
13
Clustering vs Classification
 Any clustering model (prob/non-prob) typically learns two type of quantities
 Parameters of the clustering model (e.g., cluster means in K-means)
 Cluster assignments for the points
 If cluster assignments were known, learning the parameters is just like learning the
parameters of a classifn model (typically generative classification) using labeled data
 Thus helps to think of clustering as (generative) classification with unknown labels
 Therefore many clustering problems are typically solved in the following fashion
1. Initialize somehow
2. Predict Z given current estimate of
3. Use the predicted Z to improve the estimate of (like learning a generative classification
model)
4. Go to step 2 if not converged yet
CS771: Intro to ML
14
Clustering can help supervised learning, too
 Often “difficult” sup. learning problems can be seen as mixture of simpler
models
 Example: Nonlinear regression or nonlinear classification as mixture of linear models

 Don’t know which point should be modeled by which linear


Such an approachmodel ⇒ Clustering
is also an example of divide and conquer
and is also known as “mixture of experts” (will see it more
 Can therefore solve such problems as follows formally when we discuss latent variable models)

 Initialize each linear model somehow (maybe randomly)


 Cluster the data by assigning each point to its “closest” linear model (one that gives lower error)
 (Re-)Learn a linear model for each cluster’s data. Go to step 2 if not converged. CS771: Intro to ML
15
Coming up next
 Latent Variable Models
 Mixture models using latent variables
 Expectation Maximization algorithm

CS771: Intro to ML

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy