0% found this document useful (0 votes)
46 views31 pages

L08 Clustering

This document provides an overview of clustering techniques for unsupervised learning. It discusses popular clustering algorithms like K-means, agglomerative clustering, and DBSCAN. It also covers anomaly detection using clustering. K-means groups data by assigning points to centroids, while agglomerative clustering merges clusters iteratively. DBSCAN finds clusters of varying shapes based on density. Anomaly detection identifies outliers that are distant from clusters of normal data points.

Uploaded by

YONG LONG KHAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views31 pages

L08 Clustering

This document provides an overview of clustering techniques for unsupervised learning. It discusses popular clustering algorithms like K-means, agglomerative clustering, and DBSCAN. It also covers anomaly detection using clustering. K-means groups data by assigning points to centroids, while agglomerative clustering merges clusters iteratively. DBSCAN finds clusters of varying shapes based on density. Anomaly detection identifies outliers that are distant from clusters of normal data points.

Uploaded by

YONG LONG KHAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UCCD2063

Artificial Intelligence Techniques

Unit 08:
Unsupervised Learning:
Clustering

Clustering
Outline
• Unsupervised Learning - Clustering
• Anomaly Detection

Clustering
What is Clustering?
▪ Clustering: the process of grouping data samples that are
similar in some way into classes of similar objects
▪ Clustering is a form of unsupervised learning – class labels are
not known in advance (e.g. y is not provided)
▪ It is a method of data exploration – a way of looking for
patterns or structure in the data that are of interest

Clustering 3
Popular Clustering Algorithms

▪ K-means clustering – trying to separate samples in k


groups of equal variance.

▪ Agglomerative clustering – a bottom up approach which


each observation starts in its own cluster, and clusters are
successively merged together.

▪ DBSCAN – views clusters as areas of high density


separated by areas of low density, clusters found by
DBSCAN can be any shape.

Clustering 4
K-means Clustering

1. Randomly initialize k (e.g. 3)


cluster centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 5
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 6
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 7
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 8
K-means Clustering - animation

Clustering 9
K-means Clustering

Problems with k-means:


• Have to select k centers, can be difficult
• Start with a random choice of cluster centers and
may yield different clustering results on different
runs
• Assume clusters are convex shaped, cannot deal
with complex clusters

Clustering 10
K-means Clustering Problems

(a) Different starting cluster centers

(b) Incorrect number of clusters (c) Non-convex clusters

Clustering 11
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 12
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 13
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 14
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 15
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 16
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 17
Agglomerative Clustering - animation

Clustering 18
DBSCAN
▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
▪ Core point: a data point that has more than a specified number of
MinPts within Eps radius around it
▪ Border point: a data point that has fewer than MinPts within Eps, but
is in the neighborhood of a core point
▪ Noise point: any point that is not a core point or a border point

(Hyperparameters: MinPts and Eps )


Clustering 19
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 20
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 21
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 22
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 23
DBSCAN - animation

Clustering 24
DBSCAN

▪ Strength :
• Can handle noise (outliers ) very well
• Can handle clusters of different shapes and sizes

▪ Weakness :
• Does not work well when dealing with clusters of
varying densities
• Sensitive to the hyperparameters
• May not work well in high dimensionality of data

Clustering 25
Anomaly Detection

▪ Anomaly detection detects data points that does not fit


well with the rest of the data. It has a wide range of
applications such as fraud detection, surveillance,
diagnosis, data cleanup, etc.
▪ Anomaly detection can be approached in many ways
depending on the nature of data (labeled or not labeled,
ordered or unordered, ...)
▪ Today we will focus on anomaly detection on multivariate
unordered data using clustering and Mahalanobis
distance.

Clustering 26
Anomaly Detection with Clustering
▪ The underlying assumption is that if we cluster the data,
normal data will belong to clusters while anomalies will
not belong to any clusters or belong to small clusters.
▪ A data point is considered an anomaly if the distance
from the data point to known large clusters is too far.

Clustering 27
Problem with Euclidean Distance
▪ For the distance measure, it should be noted that Euclidean
distance fails to find the correct distance because it tries to get
ordinary straight-line distance.

Clustering 28
Mahalanobis Distance
▪ The solution is Mahalanobis distance which takes into account
the direction of the variance in order to normalize it properly.

S is the Covariance Matrix of cluster

Clustering 29
Multivariate Outlier Detection With Mahalanobis Distance
▪ To detect outliers, we should specify the distance threshold
▪ The distance threshold is set by multiplying the STD (standard
deviation) of the Mahalanobis Distance by the extremeness
degree k such that:
thresh = k * std

Clustering 30
Next:

Search
SearchAlgorithms
Problems

Clustering

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy