0% found this document useful (0 votes)
33 views40 pages

Week 5 v1.1 - Unsupervised Learning

This document discusses unsupervised learning techniques including k-means clustering and Gaussian mixture models. It provides an overview of a week 5 lab covering unsupervised learning case studies using a data set on Facebook sellers or loan approvals. Details about k-means clustering including the algorithm and visual examples of how it groups data points are also explained.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views40 pages

Week 5 v1.1 - Unsupervised Learning

This document discusses unsupervised learning techniques including k-means clustering and Gaussian mixture models. It provides an overview of a week 5 lab covering unsupervised learning case studies using a data set on Facebook sellers or loan approvals. Details about k-means clustering including the algorithm and visual examples of how it groups data points are also explained.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

ECS784U/P DATA ANALYTICS

(WEEK 5, 2024)
UNSUPERVISED LEARNING

DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
TIMETABLE

2
LECTURE OVERVIEW

Unsupervised learning
▪ Week 5 Lab.
▪ K-means clustering.
▪ Gaussian Mixture Models (GMMs).

3
WEEK 5 LAB OVERVIEW
Data Analytics case studies – supervised and
unsupervised
▪ Working with a data set based on loan approvals
(supervised) or Facebook sellers (unsupervised)
▪ Two demonstration coursework Jupyter Notebooks
▪ How to load and pre-process the data set. (Pandas)
▪ How to use the data set to train different algorithms. (Scikit-
Learn)
▪ How to evaluate the performance of a trained model.

▪ There are many different ways a case study can be


analysed. This lab will covers few of those ways.
▪ Useful for coursework 1!
▪ Week 6 Lab provides continued support for this and CW1
help. 4
SUPERVISED AND UNSUPERVISED
LEARNING

Machine
Learning

Supervised Unsupervised
learning learning

Dimensionality
Regression Classification Clustering
reduction

5
RECALL FROM WEEK 2:
SUPERVISED LEARNING
Machine learning methods are usually distinguished between
supervised and unsupervised learning.
▪ In supervised learning:
▪ We have a target variable of interest which we want to predict.
▪ Usually referred to as target 𝒚.
▪ In the example below, we want to predict the risk of violence 6 months after
release.
▪ We have a set of features/variables that we can use to predict 𝑦.
▪ This set of variables is often denoted as 𝒙.
▪ The model learns how to predict 𝒚 given 𝒙.

Target variable 𝒚
Features 𝒙 6
RECALL FROM WEEK 2:
UNSUPERVISED LEARNING
Machine learning methods are usually distinguished between
supervised and unsupervised learning.
▪ Unsupervised learning:
▪ We do not have a target variable 𝑦.
▪ So we only give 𝑥 to the algorithm.
▪ We usually have a data set with unlabelled data.
▪ We want to find patterns in the data.
▪ We will see some examples after we go through supervised learning.
We usually do not have
definitive labels for the data

Entire data set is There is no target


give as input without
variable 𝑦 to predict 7
specifying 𝑥 and 𝑦
UNSUPERVISED LEARNING:
DIMENSIONALITY REDUCTION WITH PCA
▪ Remember Dimensionality Reduction with PCA we covered in Week 2?
▪ Objective of PCA:
▪ No special variable.
▪ Reduce data dimensions 𝑥 into 𝑧 with minimum information loss.
▪ PCA is an unsupervised learning approach that focuses on
dimensions (data columns).
▪ In this lecture, we will look at unsupervised learning approaches that
perform clustering and focus on data instances (data rows/samples).
𝑧
Dimensionality Reduction

𝑥2

5.
3. .
Clustering

. .
1. . 4
.
. 2

8
𝑥1
CLASSIFICATION VS CLUSTERING
Classification Clustering
▪ Classification line is a model. ▪ Data have no labels.

▪ Predicts the class to which new data ▪ Learns the clusters corresponding to
belongs to. possibly different data classes.
▪ Used distance-based metrics to
determine clusters.

𝑦3
𝑥2 (e.g. length)

𝑥2 (e.g. length)
𝑦1 𝑦𝑛
𝑦2

𝑥1 (e.g., weight) 𝑥1 (e.g., weight)

9
CLUSTERING: VISUAL EXAMPLES

10
EXAMPLES OF CLUSTERING
When might clustering be useful in the real world?
▪ Marketing:
▪ Separate household income into groups and then target each group with different
marketing strategies.
▪ E.g., data include family size and spending level.

▪ Sports analytics:
▪ Find recommended players.
▪ Separate teams by playing style and determine counter strategy.
▪ E.g., possession rate, attacks generated, long shots, distance of passes, tempo etc.

▪ Car insurance:
▪ Separate clients by risk level.
▪ E.g., number of accidents, driving offences, car value.

▪ Social network analysis:


▪ Find recommended friends.
▪ E.g., mutual friends, location, work, education, shared networks.

▪ Politics:
▪ Clustering voting patterns.
11
▪ E.g., who tend to vote in favour of, or against, A.
EXAMPLES OF CLUSTERING
Are there any ‘correct’ clusters available to discover?
▪ Suppose you have a clothing brand and you want to determine T-shirt sizes.
▪ How big should size M be?
▪ Should you offer XXS and XXL?

Weight and Height parameters may depend on the population that are more likely to be
clients of this brand, rather than the entire population.
12
K-MEANS CLUSTERING
▪ K-means clustering is a classic and a simple unsupervised
learning algorithm.
▪ Just like other unsupervised methods, the objective of K-means
clustering is to discover patterns in the data.

𝑥2 (e.g. length)
▪ Input:
▪ An unlabelled data set.
▪ Number of clusters 𝐾.

▪ Objective:
𝑥1 (e.g., weight)
▪ Group the data points into 𝐾 clusters.
▪ By randomly initialising 𝐾 centroids and iteratively changing
the position of the centroids such that the new position
reduces the distance between the centroid and the data
points that are part of its cluster. 13
K-MEANS: VISUAL ILLUSTRATION

Randomise the initial position of the centroids


14
K-MEANS: VISUAL ILLUSTRATION

Assign data points to the nearest centroid


15
K-MEANS: VISUAL ILLUSTRATION

Move the centroids to the position that minimises the distance


across all the data points within its cluster. 16
K-MEANS: VISUAL ILLUSTRATION

Re-assign data points to the nearest centroid


17
K-MEANS: VISUAL ILLUSTRATION

Again, move the centroids to the position that minimises the


distance across all the data points within its cluster. 18
K-MEANS: VISUAL ILLUSTRATION

Re-assign data points to the nearest centroid


19
K-MEANS: VISUAL ILLUSTRATION

The learning
process ends
here because
no data point
has been
assigned to a
different cluster.

Again, move the centroids to the position that minimises the


distance across all the data points within its cluster. 20
K-MEANS: PSEUDOCODE Reading
slide

Input:
1. An unlabelled data set 𝐷 consisting of 𝑁 points/samples.
2. Number of clusters 𝐾.

Initialise 𝐾 centroids, 𝐾1 , 𝐾2 , … , 𝐾𝑛
while the position of centroids changes then
for each sample/data point 𝑁𝑖 in 𝑁 do
assign 𝑁𝑖 to the closest centroid 𝐾𝑛
for each centroid 𝐾𝑛 in 𝐾 do
Move 𝐾𝑛 to the position that minimises the average distance from points
𝑁𝑖 assigned to 𝐾𝑛

21
K-MEANS: DISTANCE Reading
slide

▪ As with other algorithms, different distance metrics and error measures can
be used in this ‘family’

▪ For K-means clustering:


▪ Centroids: mean positions
▪ Distance: Euclidean
▪ Error: Sum squared

▪ For K-medians clustering


▪ Centres: median in each dimension
Euclidean (green) vs. Manhattan
▪ Distance: Manhattan (other colours) distance

▪ Error: Sum of Manhattan distance

22
K-MEANS: LOSS FUNCTION
▪ Formally, optimisation involves minimising error (loss) 𝐸 given cluster
centers µ1:𝐾 and cluster assignments 𝑐1:𝑁

𝑁
2
𝐸𝐾𝑀 𝐷, 𝑐1:𝑁 , µ1:𝐾 = ෍ 𝑁𝑖 − µ𝑐𝑖
𝑖=1

▪ K-means can be viewed a simplified version of the Expectation


Maximisation (EM) process (i.e., for hard, instead of soft, assignment
of clusters):

▪ Expectation E-step:
▪ Find cluster closest to each centroid.

▪ Maximisation M-step:
▪ Find new centroid for each cluster.
23
K-MEANS: CONVERGENCE
▪ K-means converges to a local optimum only.
▪ Because K-means initialises centroids at random, this means that the
algorithm may return a different result each time it runs.
▪ In practice, this issue can be partly resolved by multiple K-means
restarts and selecting the clusters that minimise error across all
runs.
E-step
E-step
Error

M-step M-step

24
Iterations
K-MEANS: SCALING

25
K-MEANS: OTHER LIMITATIONS
▪ Determining 𝐾 is an issue (there are some solutions but remains an open question).
▪ Distance-based method so sensitive to data scaling.
▪ Normalise [0,1] or by standard deviation.
▪ K-means works best when the clusters:
▪ Are spherical, since it constructs a cluster around the centroid;
▪ Well-separated, since it aims to separate data points;
▪ Are of similar volume, since data points of a large cluster may end up being
closer to the centroid of another smaller cluster;
▪ Have the same numbers of data points.

26
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
‫ دقائق استراحة‬10
10 MINUTI DI PAUSA
‫ דקות‬10 ‫הפסקה של‬
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 27
K-MEANS CLUSTERING AND
GAUSSIAN MIXTURE MODELS
▪ A rather important issue with K-means is that it makes hard assignments to
data points.

▪ Can we be confident about making hard assignments to data points in the


figure above?
▪ What if we want to assign a probability to each data point belonging to a
particular cluster?
▪ Gaussian Mixture Models (GMMs) can be used for that purpose.
28
GAUSSIAN MIXTURE MODELS
▪ It is possible for data points that come from different distributions to have the
same values.
▪ In other words, a data point can belong to more than one cluster.
▪ However, one cluster might produce a data point more often than the other.
▪ Determining what probability distribution is responsible for a specified data
point represents a process of density estimation.
▪ The GMM solution explains the data as a weighted sum of 𝑲 Gaussian
distributions.

29
GAUSSIAN MIXTURE MODELS

▪ Averaging Gaussians assumes that each one of these


categories appear in equal proportions in our data.
▪ E.g., equal number of cars and trains.
▪ But because it is unlikely that proportions between
distributions will be equal, we take the:
▪ weighted sum of Gaussian densities, instead of the
average;
▪ E.g., the example on the indicates that the three
distributions are responsible for 50%, 30% and 20%
of the data points.
30
GAUSSIAN MIXTURE MODELS
▪ GMMs work similar to K-means:
▪ Instead of initialising random centroids, GMMs initialise Gaussian distributions with
random parameters µ and 𝜎 for each distribution.
▪ Assigning data points to distributions can be viewed as a soft version of K-means:
▪ Instead of assigning each data point to its closest cluster, GMMs return the
probability for each data point belonging to each Gaussian.

31
GMMS: DENSITY ESTIMATION
▪ Density estimation is achieved using Maximum Likelihood (estimating parameters of an
assumed distribution):
▪ Which data points belong to which cluster?
▪ We can use this information to learn the parameters of the Gaussian distribution
of a cluster (Density Estimation: Gaussian)
▪ Given the clusters, what is the relative size of each cluster?
▪ We can use this information to plot the weighted mixture of Gaussians (Density
Estimation: Multinomial)
▪ If we know the Gaussians, we can then compute the probability for each data
point belonging to each Gaussian distribution (Bayes’ theorem).
▪ We will go through Bayes’ theorem in detail in one of the later lectures.

32
GMMS: OPTIMISATION
▪ Optimisation is similar to K-means, but for Gaussians instead of centroids:

▪ As with K-means, the EM algorithm is used to:

▪ E-step: Given the learnt Gaussians, infer the probability each point belongs to
each cluster.
▪ M-step: Given the soft assignments, revise the Gaussians.

33
GMMS: VISUALISING OPTIMISATION
▪ Unlike K-means, GMMs:
▪ Include shaded assignments indicating soft assignment;
▪ learn non-linear decision boundaries;
▪ learn non-spherical shapes that capture the full covariance;
▪ No need for Gaussians to assume the same number of data points.

GMMs

K-means

34
GMMS: VISUALISING OPTIMISATION
▪ Unlike K-means, GMMs:
▪ Include shaded assignments indicating soft assignment;
▪ learn non-linear decision boundaries;
▪ learn non-spherical shapes that capture the full covariance;
▪ No need for Gaussians to assume the same number of data points.

K-means GMM

▪ Figures from https://towardsdatascience.com/gaussian-mixture-models 35


GMMS: PSEUDOCODE Reading
slide

K-means GMMs

Input: Input:
1. An unlabelled data set 𝐷 consisting of 𝑁 1. An unlabelled data set 𝐷 consisting of 𝑁
points/samples. points/samples.
2. Number of clusters 𝐾. 2. Number of clusters 𝐾.

Initialise 𝐾 centroids, 𝐾1 , 𝐾2 , … , 𝐾𝑛 Initialise 𝐾 Gaussians, 𝐾1 , 𝐾2 , … , 𝐾𝑛


while the position of centroids changes then while the position of centroids changes then
for each sample/data point 𝑁𝑖 in 𝑁 do for each sample/data point 𝑁𝑖 in 𝑁 do
assign 𝑁𝑖 to the closest centroid 𝐾𝑛 Compute the probability for 𝑁𝑖 to belong to
for each centroid 𝐾𝑛 in 𝐾 do each cluster 𝐾.
1. Find cluster closest to each centroid. for each centroid 𝐾𝑛 in 𝐾 do
2. Find new centroid for each cluster. 1. Revise the Gaussians corresponding to
each cluster.
2. Revise the weights corresponding to
each Gaussian.

36
GMMS VS K-MEANS Reading
slide

K-means GMMs

Assumptions: Assumptions:

1. Clusters are spherical. 1. Clusters not necessarily spherical.


2. Clusters are of same size. 2. Clusters not necessarily of the same size.
3. Clusters have similar volume. 3. Clusters not necessarily have similar volume.
4. Data points are well-separated and can 4. Not necessary for data points to be well-
only belong to a single cluster (hard separated since they can belong to more than
assignment). one cluster (soft assignment).
5. Clusters have a centroid. 5. Clusters have a normally distributed mean
with variance.

Optimisation: Optimisation:
Minimising distance error 𝐸 given cluster Log-likelihood of the data:
centers/centroids µ1:𝐾 and cluster assignments
𝑐1:𝑁 i.e., maximise the probability that each data point has
𝑁 been generated by the learnt mixture of Gaussians.
2
𝐸𝐾𝑀 𝐷, 𝑐1:𝑁 , µ1:𝐾 = ෍ 𝑁𝑖 − µ𝑐𝑖
𝑖=1

37
GMMS: LIMITATIONS
GMMs address some of the limitations of K-means, but they also have
some limitations:
▪ Converge to a local optimum, just like K-means.
▪ Random restarts.

▪ Selecting 𝐾 is still an issue.


▪ Higher computational complexity.
▪ Requires large sample size relative to the number of
variables/columns to work well.
▪ E.g., not well-suited for biology data which often contain more
columns (e.g., genes) than rows (samples).

38
GMMS & K-MEANS: SELECTING K
A common option is the elbow method we`ve seen for PCA (Week 2):
▪ Increasing 𝐾 reduces error 𝐸, so not possible iterate over 𝐾 and pick the
lowest error.
▪ Instead, plot the error 𝐸 as a function of 𝐾 and choose the elbow point.

▪ You could also present the solutions/clusters determined close to the elbow to
the end users and consider their recommendations.
▪ They may have the domain expertise needed to determine what makes
39
more sense.
GMMS & K-MEANS: SELECTING K
A more sophisticated approach, especially for those who work in ML, is to
perform Bayesian model selection:
▪ Bayesian Information Criterion (BIC).

▪ Balances accuracy with model dimensionality.


▪ Adding more clusters increases the complexity of the model.
▪ If an additional cluster increases the dimensionality of the model ‘faster’ than it
increases its accuracy, then the algorithm will not add that cluster and consider
the previous model as the optimal model.
▪ We will go through BIC model selection in more detail in a future lecture (no need
to know BIC for Coursework 1 - only for Coursework 2).

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy