0% found this document useful (0 votes)
2 views81 pages

Clustering and K-Means Algorithm

The document provides an overview of unsupervised learning, focusing on clustering techniques, particularly the K-Means algorithm. It explains the differences between supervised and unsupervised learning, types of clustering, and various clustering methods, including hierarchical and density-based clustering. Additionally, it discusses applications of clustering and the challenges associated with the K-Means method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views81 pages

Clustering and K-Means Algorithm

The document provides an overview of unsupervised learning, focusing on clustering techniques, particularly the K-Means algorithm. It explains the differences between supervised and unsupervised learning, types of clustering, and various clustering methods, including hierarchical and density-based clustering. Additionally, it discusses applications of clustering and the challenges associated with the K-Means method.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

UNSUPERVISED

LEARNING
(Clustering and K-Means
Algorithm)
Outline of Presentation

❑Clustering
❑Types of Clustering
❑Examples and explanation of Clustering
❑K-Means Clustering
❑Working of K-Means Algorithm
❑K-Means Algorithm Examples
❑Hierarchal Clustering
❑Distance Based Clustering
Machine Learning
Classification
Supervised Task Driven (Defined Labels)
Predict Future
Learning Values Regression
(No Defined Labels)

Machine Unsupervised Data Driven


Learning Learning Clustering

Reinforcement Learning from the


mistakes
Learning
Supervised learning vs. Unsupervised learning

Supervised learning: discover patterns in the data that relate


data attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the
target attribute in future data instances.

Unsupervised learning: The data have no target attribute.


• We want to explore the data to find some intrinsic structures in
them.
Clustering

❑ Clustering is a technique for finding similarity groups in data,


called clusters. i.e.,
❖ It groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
❑ Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.
❑ Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
❖ In fact, association rule mining is also unsupervised
WHAT Is Clustering?

“Clustering is the process of dividing the datasets into groups,


consisting of similar data-points”
• Points in the same group are as similar as possible.
• Points in different group are as dissimilar as possible

Examples of Clustering

Items arranges in a Mall Group of Diners in a Restaurant


Illustration of Clustering
The data set has three natural groups of
data points, i.e., 3 natural clusters.

Example 1: groups people of similar sizes together to


make “small”, “medium” and “large” T-Shirts.
• Tailor-made for each person: too expensive
• One-size-fits-all: does not fit all.
Example 2: In marketing, segment customers
according to their similarities
• To do targeted marketing.
➢ It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing, insurance,
libraries, etc.
➢ In recent years, due to the rapid increase of online documents, text
clustering becomes important.
Aspects of Clustering
➢ A clustering algorithm
▪ Partitional clustering
▪ Hierarchical clustering
▪ …

➢ A distance (similarity, or dissimilarity) function

➢ Clustering quality
▪ Inter-clusters distance  maximized
▪ Intra-clusters distance  minimized

➢ The quality of a clustering result depends on the algorithm, the


distance function, and the application.
What is Similarity

• In the field of
cluster analysis,
this similarity
plays an
important part.
• Now, we shall
learn how
similarity (this is
also alternatively
judged as
“dissimilarity”)
between any two
data can be
measured.
Applications of clustering

Recommendation Search results


System
Banking/ Finance/ insurance

Grouping of photos Movie


Recommendation
Retail Stores
Applications of clustering (Face Clustering)
Applications of clustering (Face Clustering)
Types of CLUSTERING
❑ Exclusive Clustering
❑ Overlapping Clustering
❑ Hierarchical Clustering

Exclusive Clustering
• Hard Clustering
• Data Point/Item belongs exclusively to one
cluster
• For Example- K-Means Clustering
Types of CLUSTERING
❑ Exclusive Clustering
❑ Overlapping Clustering
❑ Hierarchical Clustering

Overlapping Clustering
• Soft Cluster
• Data Points/Item Belong to Multiple Cluster
• For Example- Fuzzy/ C-Means Clustering
Types of CLUSTERING
❑ Exclusive Clustering
❑ Overlapping Clustering
❑ Hierarchical Clustering

Hierarchal Clustering
Methods of CLUSTERING
❑ Partitioning Method
❑ Hierarchical Method
❑ Density-Based Methods

Partitioning Methods
Partitioning a database D of n objects into a set of k clusters, such that the sum of
squared distances is minimized (where c i is the centroid or medoid of cluster Ci)

E = ik=1 pCi ( p − ci ) 2
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
K-means Clustering

➢ K-Means is a clustering algorithm whose main goal is to group similar elements


or data-points into a cluster.
➢ “K” represents the number of clusters

Examples:

Pile of Dirty Cloths


Document Classifier
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
How will you find value of k
Summary of K-means algorithm
Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 1: Loading a colour image of
tissue stained with
hemotoxylin and eosin
(H&E)
Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 2: Convert the image from RGB colour space to L*a*b*
colour space
• Unlike the RGB colour model, L*a*b* colour is
designed to approximate human vision.
• There is a complicated transformation between RGB
and L*a*b*.
(L*, a*, b*) = T(R, G, B)
(R, G, B) = T’(L*, a*, b*).
Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 3: Undertake clustering analysis in the (a*, b*) colour space
with the K-means algorithm
• In the L*a*b* colour space, each pixel has a
properties or feature vector: (L*, a*, b*).
• Like feature selection, L* feature is discarded. As a
result, each pixel has a feature vector (a*, b*).
• Applying the K-means algorithm to the image in the
a*b* feature space where K = 3 by applying the
domain knowledge.
Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 4: Label every pixel in the
image using the results
from K-means clustering
(indicated by three
different grey levels)
Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 5: Create Images that Segment the H&E Image by Colour
• Apply the label and the colour information of each pixel to achieve
separate colour images corresponding to three clusters.

“blue” pixels “white” pixels “pink” pixels


Application of K-means algorithm
Colour-Based Image Segmentation Using K-means
Step 6: Segment the nuclei into a
separate image with the L*
feature
• In cluster 1, there are dark and light
blue objects (pixels). The dark blue
objects (pixels) correspond to nuclei
(with the domain knowledge).
• L* feature specifies the brightness
values of each colour.
• With a threshold for L*, we achieve
an image containing the nuclei only.
Problem of the K-Means Method

• The algorithm is only applicable if the mean is defined.


• For categorical data, k-mode - the centroid is represented by
most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other
data points.
• Outliers could be errors in the data recording or some special
data points with very different values.
Problems with outliers
Problem of the K-Means Method

• The algorithm is sensitive to initial seeds.


Problem of the K-Means Method
• If we use different seeds: good results
▪ There are some
methods to help
choose good
seeds
Problem of the K-Means Method

• The k-means algorithm is not suitable for discovering clusters


that are not hyper-ellipsoids (or hyper-spheres).
Problem of the K-Means Method
• If we use different seeds: good results
▪ There are some
methods to help
choose good
seeds
Methods of CLUSTERING
❑ Partitioning Method
❑ Hierarchical Method
❑ Density-Based Methods

Partitioning Methods
Partitioning a database D of n objects into a set of k clusters, such that the sum of
squared distances is minimized (where c i is the centroid or medoid of cluster Ci)

E = ik=1 pCi ( p − ci ) 2
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
Hierarchical CLUSTERING

• Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a termination
condition
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a Agglomerative Nesting
ab
b abcde
c
cde
d
de
e divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Divisive Analysis
Types of hierarchical clustering

➢ Agglomerative (bottom up) clustering: It builds the


dendrogram (tree) from the bottom level, and
▪ merges the most similar (or nearest) pair of cluster types of
hierarchical clustering.
▪ stops when all the data points are merged into a single cluster
(i.e., the root cluster).
➢ Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is
recursively divided further
• stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
An example: working of the algorithm
Hierarchical CLUSTERING

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at the


desired level, then each connected component forms a cluster
Density-Based CLUSTERING

• Clustering based on density (local cluster criterion), such as density-


connected points
• Major features:
➢ Several interesting studies: Discover clusters of arbitrary shape
➢ Handle noise
➢ One scan
➢ Need density parameters as termination condition
• Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
Density-Based CLUSTERING

• Two parameters: Epsilon and Minimum Points


• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-neighbourhood
of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-reachable
from a point q w.r.t. Eps, MinPts if
• p belongs to NEps(q) p MinPts = 5
• core point condition: Eps = 1 cm
|NEps (q)| ≥ MinPts q
Density-Reachable and Density-Connected

• Density-reachable:
• A point p is density-reachable from a
p
point q w.r.t. Eps, MinPts if there is a p1
chain of points p1, …, pn, p1 = q, pn = p q
such that pi+1 is directly density-reachable
from pi
• Density-connected
• A point p is density-connected to a point
q w.r.t. Eps, MinPts if there is a point o
p q
such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts o
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined as a
maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5
DBSCAN: The Algorithm

• Arbitrary select a point p


• Retrieve all points density-reachable from p w.r.t. Eps and MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p and DBSCAN


visits the next point of the database

• Continue the process until all of the points have been processed

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy