0% found this document useful (0 votes)

51 views8 pages

Clustering - K-Means: Prerequisite

Uploaded by

Varun Bhayana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views8 pages

Clustering - K-Means: Prerequisite

Uploaded by

Varun Bhayana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Clustering – K-means

Prerequisite
- Vector algebra
- Distance metrics
Objectives
- Understanding unsupervised learning
- What is clustering
- Different types of distance metrics
- K-means clustering
- Selections of optimal k value –(Elbow method and Silhouette)
Unsupervised learning-
Unsupervised learning is the different type of machine learning algorithm specially used when the
target variable is absent in dataset or the dataset is not labelled. The primary goal of unsupervised
learning is find hidden pattern exist within the data. This can be achieved through grouping the data
points into homogenous groups also known as clusters. Unsupervised learning is much harder and
complex than supervised learning as the results are hard to define in the absence of target variable or
labels. The other issue with unsupervised learning is the definition of objective function. Despite these
issues unsupervised (clustering) is used to gain the insight of data before applying a classification model.
Clustering-
Clustering is one of fundamental problem of unsupervised learning algorithm defined as the process of
dividing the simple data points into homogeneous groups of similar data points. Grouping of data points
is based on the similarities and dissimilarities between them, the points within same group are similar
and among different groups they are dissimilar. This could be understood by following figure where similar
data points are clustered on the basis of colour.

Figure 1 Clustering

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
1
Types of clustering-

Primarily clustering is classified as Partitional clustering and Hierarchical clustering. In this note
we will describe partitional clustering. Partitional clustering is the type of clustering in which the data
points within the dataset are classified into different groups based on their similarities. Ex: K-means
Clustering, CLARA etc.

Figure 2 Types of clustering

K-means Clustering:

K-means clustering is a unsupervised learning algorithm whose goal is to find groups or assign
the data points to clusters on the basis of their similarity. Which means the points in same cluster are
similar to each other and in different clusters are dissimilar with each other. It was developed by
researcher named James Macqueen in 1967. Here K means the number of clusters.

Figure 3 K means Clustering

The above figure is showing the working of K-means clustering. Left part is showing the data points
before applying the K-means whereas the right part is depicting the situation after the clustering. (data
points are grouped into clusters. K-means clustering is very useful and beneficial when most of the data
is in unorganized manner. Before moving forward let define some terminology.

Cluster- It is a collection of the given data points accumulated together because of certain similarities
between them.

Centroid- It can be termed as a real or imaginary location which represents the centre of the cluster.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2
Parameter K- K is a target variable which refers to the number of centroids in the respective or given
dataset. It is used to label the new data.

Means- In K-means clustering, ‘means’ refer to the averaging of data used to find the centroid in a the
cluster.

The working of K-means clustering is depends on the distance metrics, which are used to find the
similarity within data points. The popular distance metric are:

Euclidean Distance-

It is the most commonly used distance metrics and defined as the square root of the sum of squared
differences between the two points. Let the two points are P(x1, x2) and Q (y1, y2) the Euclidean distance
is given by:

𝑃𝑄#$%&'()*+ = .(𝑥1 − 𝑦1 )5 + (𝑥5 − 𝑦5 )5

In general

𝑃𝑄#$%&'()*+ = 78(𝑥' − 𝑦' )5

'91

Manhattan Distance-

It is also known as city block distance or absolute distance. The distance measure is inspired with
the structure of Manhattan city where the distance between two points is measured through city road
grids. The distance is defined as the sum of absolute differences between two points coordinates.

𝑃𝑄:+;<<*+ = |𝑥1 − 𝑦1 | + |𝑥5 − 𝑦5 |

or
+

𝑃𝑄:+;<<*+ = 8|𝑥' − 𝑦' |

'91

Chebyshev Distance-

This distance is also known as Maximum value distance or chessboard distance. The distance is
based on absolute magnitude between the coordinates of pair of two points. This distance is equally used
with the quantitative and ordinal variable.

𝑃𝑄>;)?@A;)B = 𝑀𝑎𝑥(|𝑥1 − 𝑦1 |, |𝑥5 − 𝑦5 |)

or
𝑃𝑄>;)?@A;)B = 𝑀𝑎𝑥( |𝑥' − 𝑦' |)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3
Minkowski Distance-

Minkowski distance is one of the generalized distance measures, which means that by
manipulating the formula different distances measures can be obtained. Above stated distance measures
are the special case of Minkowski distance.
+
1
𝑃𝑄:'+FGHAF' = (8(𝑥' − 𝑦' )I )I
'

When p =1 Minkowski has become Manhattan distance.

When p =2 Minkowski has become Euclidean distance.
When p = ∞ Minkowski has become Chebyshev distance.

Figure: Distance metrics

Mahalanobis Distance-

This distance measure is used to calculate the distance between two points in multivariate space.
The idea is to calculate the distance of a point P from any distribution D in terms of standard deviation
and mean of distribution D. The main advantage of mahalanobis distance is that it includes the covariance
of distribution to measure the similarity between two points. The distance equation is given by:
𝑃𝑄K*;*&*G?'A = .(𝑃 − 𝑄)L 𝑆 N1 (𝑃 − 𝑄)
Where P and Q are two random vectors of same distribution and S is covariance matrix.
NOTE: Most widely used distance measure is Euclidean distance, But all the distances have their
respective purpose and importance. One cannot say or claim that only one particular distance measure
is always accurate.

Working of K-means Clustering- The working of k-means clustering can be summarized as:
Step 1- Initialize the K random centroids or k points.(There can be two strategy for it.)

i. Pick random data points and consider those as starting points.

ii. Choose K random values for each particular variable.

Step 2- For each data point calculate the distance of it from randomly chosen K centroid 𝐶' and assign
each point to minimum distance cluster.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4
Step 3- Update the centroid by using newly assigned data points to the cluster by calculating the
average of data points.

>S
1
𝑉' = 8 𝑥R
|𝐶' |
R91

Where 𝑥R data points within cluster 𝐶' and 𝑉' is vector for centroid 𝐶' .

Step 4- Repeat the above process for a given no. of iterations or until the centroid allocation no longer
changes.

The algorithm is said to be converged once there are no more changes in the values of centroids.
Following figure is showing the above stated process of k-means clustering.

(a) Raw data (b) Choosing number of Cluster (c) Assigning Random centroid

(d) Calculating similarity (e) Updating Centroid (f) Final Result

Objective of clustering-

The objective of clustering is to minimize the distance between data points and its centroid, which can
also be expressed as square error term defined as:
F +
R 5
min 𝐽 = 8 8Y𝑥' − 𝐶R Y
R91 '91

Basically, there is no such method for define the exact value of k, but there is quick rule of thumb
\
which can be used as heuristic to get the maximum number of clusters as 𝑘 = [ 5 .

where N= no. of data points. For example, take N=200 data points, then no. of clusters will be:

200
𝑘 = ] = √100 = 10
2

Hence 10 clusters will be formed for 200 data points.

There are some other techniques from which can be used to find the approximate or optimal value of k.
These techniques includes:
1. Elbow Method
2. Silhouette method
3. The information theoretic jump, the information criteria.
This note will discuss the Elbow method (which is most popular) and Silhouette method to determine the
optimal value of number of clusters.
Elbow method-
It is most popular and well-known method to find the optimal no. of clusters or the value of k in
the process of clustering. This method is based of plotting the value of cost function against different
values of k. As the number of clusters (k) increase lesser number of points fall within clusters or around
the centroids. Hence the average distortion decreases with the increase of number of clusters. The point
where the distortion declines most is said to be the elbow point and define the optimal number of clusters
for dataset.

Figure 4 Elbow method

As it is clear from above figure, the distortion declines most at 3. Hence the optimal value of k will be 3
for performing the clustering. In other words the plot looks as an arm with an elbow at k = 3.

Silhouette is a different method to determine optimal number of clusters for given dataset. It defines as
a coefficient of measure of how similar an observation to its own cluster compared to that of other
clusters. The range of silhouette coefficient varies between -1 to 1. 1 value indicate that an observation
is far from its neighbouring cluster and close to its own whereas -1 denotes that an observation is close
to neighbouring cluster than its own cluster. The 0 value indicate the presence of observation on
boundary of two clusters. Silhouette coefficient is defined as:

𝑎(𝑖)
⎧1− 𝑎(𝑖) < 𝑏(𝑖)
⎪ 𝑏(𝑖)
𝑠(𝑖) = 0 𝑎(𝑖) = 𝑏(𝑖)
⎨𝑏(𝑖)
⎪ − 1 𝑎(𝑖) > 𝑏(𝑖)
⎩𝑎(𝑖)

Where 𝑎(𝑖) is the distance of observation within its own cluster and 𝑏(𝑖) is the distance of observation
with its neighbouring cluster.

Figure 5 Silhouette coefficient

Advantages of K-means clustering:

- Ease of implementation.
- It works great on large scale data.
- Results guarantees convergence.
- Easily works with new examples.

Disadvantages of K-means clustering:

- It is quite difficult to predict number of clusters.

- Initialization of the cluster center is a really crucial part and also somewhat tough.
- K- means clustering only handles numeric data.

X Y
11 4
12 3
10 5
4 10
4 8
6 8

Step 1- Let choose two random centers for clustering as C1(6, 8) and C2(10, 5).
Step 2- For each data point calculate the distance from each cluster centroid 𝐶' and assign each
point to minimum distance cluster as:

X Y C1(6,8) C2(10,5) Label

11 4 6.40 1.41 C2
12 3 7.81 2.83 C2
10 5 5.00 0.00 C2
4 10 2.83 7.81 C1
4 8 2.00 6.71 C1
6 8 0.00 5.00 C1

Step 3- Update the centroid by using newly assigned data points to the cluster by calculating the
average of data points.
4 + 4 + 6 10 + 8 + 8
𝐶1 = j , o = (4.66, 8.66)
3 3
11 + 12 + 10 4 + 3 + 5
𝐶2 = j , o = (11, 4)
3 3

Step 4- Repeat above steps with new centroids C1(4.66, 8.66) and C2(11, 4).

*******

K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
IJRASET Sample Paper For Format
No ratings yet
IJRASET Sample Paper For Format
9 pages
Hacking Books
100% (3)
Hacking Books
1 page
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
Week 10
No ratings yet
Week 10
41 pages
ML Ch-5 Clustering, Dimensionality Reduction and Recommender System
No ratings yet
ML Ch-5 Clustering, Dimensionality Reduction and Recommender System
13 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Lesson 5 - Unsupervised Learning
No ratings yet
Lesson 5 - Unsupervised Learning
11 pages
Hierarchial Problem Statement
No ratings yet
Hierarchial Problem Statement
1 page
Supervised Learning in Healthcare
No ratings yet
Supervised Learning in Healthcare
6 pages
ML Lec13
No ratings yet
ML Lec13
3 pages
Information Technology
No ratings yet
Information Technology
9 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
Advanced Java Unit 4
No ratings yet
Advanced Java Unit 4
39 pages
Sales Forc
No ratings yet
Sales Forc
217 pages
Determination of Discharge Coefficient of Stepped Morning Glory Spillway Using A Hybrid Data-Driven Method
No ratings yet
Determination of Discharge Coefficient of Stepped Morning Glory Spillway Using A Hybrid Data-Driven Method
13 pages
Wireless Camera System Troubleshooting and FAQ
No ratings yet
Wireless Camera System Troubleshooting and FAQ
16 pages
Unit 4
No ratings yet
Unit 4
125 pages
Clustering
No ratings yet
Clustering
24 pages
Catalog Placement Tester User's Guide
No ratings yet
Catalog Placement Tester User's Guide
21 pages
ML Unit-5
No ratings yet
ML Unit-5
21 pages
Clustering Part1
No ratings yet
Clustering Part1
19 pages
AI Assignment 2
No ratings yet
AI Assignment 2
1 page
Business Report: Advanced Statistics Project
100% (5)
Business Report: Advanced Statistics Project
24 pages
K Clustering
No ratings yet
K Clustering
28 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Group 1-Eng and AP-ldm Final Output
No ratings yet
Group 1-Eng and AP-ldm Final Output
511 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Clustering Monograph DSBA
No ratings yet
Clustering Monograph DSBA
36 pages
DM - Week 1 With DSAW
No ratings yet
DM - Week 1 With DSAW
15 pages
DSV - Unit 3 - Data Analysis in Depth
No ratings yet
DSV - Unit 3 - Data Analysis in Depth
53 pages
Internet of Things & Urban Transportation Planning
No ratings yet
Internet of Things & Urban Transportation Planning
10 pages
WSS Plot
No ratings yet
WSS Plot
2 pages
Unit 3 - KmeansClustering
No ratings yet
Unit 3 - KmeansClustering
17 pages
Kmea
No ratings yet
Kmea
53 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
Fixed Wireless Data WM550
No ratings yet
Fixed Wireless Data WM550
2 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
PART2
No ratings yet
PART2
61 pages
E1 Unit1 ExtMemory MG
No ratings yet
E1 Unit1 ExtMemory MG
24 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
What Is A PDF - Portable Document Format - Adobe Acrobat
No ratings yet
What Is A PDF - Portable Document Format - Adobe Acrobat
6 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Unit 4
No ratings yet
Unit 4
22 pages
Algo
No ratings yet
Algo
59 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Names and Addresses: Ipv4: Cs144, Stanford University 1
No ratings yet
Names and Addresses: Ipv4: Cs144, Stanford University 1
8 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
Week 9
No ratings yet
Week 9
66 pages
Goodman Mba Isp Courses
No ratings yet
Goodman Mba Isp Courses
9 pages
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
No ratings yet
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
3 pages
K Means
No ratings yet
K Means
25 pages
Internet of Things IoT Assisted Context Aware Fertilizer Recommendation
No ratings yet
Internet of Things IoT Assisted Context Aware Fertilizer Recommendation
15 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Anova: Module 3 - Advanced Statistics
No ratings yet
Anova: Module 3 - Advanced Statistics
17 pages
Clustering
No ratings yet
Clustering
18 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Lista Precios202504
No ratings yet
Lista Precios202504
5 pages
Smart Sensor
No ratings yet
Smart Sensor
11 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Syllabus of Unreal Engine Certification Online Tra
No ratings yet
Syllabus of Unreal Engine Certification Online Tra
5 pages
IEEE Standards
No ratings yet
IEEE Standards
3 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
BGP 2
No ratings yet
BGP 2
223 pages
LESSON 2 - Assembling A Computer - Performance Checklist
No ratings yet
LESSON 2 - Assembling A Computer - Performance Checklist
2 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Kmean
No ratings yet
Kmean
24 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
No ratings yet
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
3 pages
Mbox 2 Mini Quick Setup
No ratings yet
Mbox 2 Mini Quick Setup
2 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
Analysis of Variance-1
No ratings yet
Analysis of Variance-1
42 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
List Kebutuhan Rsu PSM Full
No ratings yet
List Kebutuhan Rsu PSM Full
15 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
MCQ in Networks and DBMS
No ratings yet
MCQ in Networks and DBMS
3 pages
K Means
No ratings yet
K Means
33 pages
Lab Exercise 1
No ratings yet
Lab Exercise 1
4 pages
The Internet of Things With ESP32
80% (5)
The Internet of Things With ESP32
40 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
Exp 6
No ratings yet
Exp 6
6 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering - K-Means: Prerequisite

Uploaded by

Clustering - K-Means: Prerequisite

Uploaded by

Clustering – K-means

Figure 2 Types of clustering

Figure 3 K means Clustering

𝑃𝑄#$%&'()*+ = .(𝑥1 − 𝑦1 )5 + (𝑥5 − 𝑦5 )5

𝑃𝑄#$%&'()*+ = 78(𝑥' − 𝑦' )5

𝑃𝑄:+;<<*+ = |𝑥1 − 𝑦1 | + |𝑥5 − 𝑦5 |

𝑃𝑄:+;<<*+ = 8|𝑥' − 𝑦' |

𝑃𝑄>;)?@A;)B = 𝑀𝑎𝑥(|𝑥1 − 𝑦1 |, |𝑥5 − 𝑦5 |)

When p =1 Minkowski has become Manhattan distance.

Figure: Distance metrics

i. Pick random data points and consider those as starting points.

ii. Choose K random values for each particular variable.

(d) Calculating similarity (e) Updating Centroid (f) Final Result

Hence 10 clusters will be formed for 200 data points.

Figure 4 Elbow method

Figure 5 Silhouette coefficient

Advantages of K-means clustering:

Disadvantages of K-means clustering:

- It is quite difficult to predict number of clusters.

X Y C1(6,8) C2(10,5) Label

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Clustering - K-Means: Prerequisite

Uploaded by

Clustering - K-Means: Prerequisite

Uploaded by

Clustering – K-means

Figure 2 Types of clustering

Figure 3 K means Clustering

𝑃𝑄#$%&'()*+ = .(𝑥1 − 𝑦1 )5 + (𝑥5 − 𝑦5 )5

𝑃𝑄#$%&'()*+ = 78(𝑥' − 𝑦' )5

𝑃𝑄:*+;*<<*+ = |𝑥1 − 𝑦1 | + |𝑥5 − 𝑦5 |

𝑃𝑄:*+;*<<*+ = 8|𝑥' − 𝑦' |

𝑃𝑄>;)?@A;)B = 𝑀𝑎𝑥(|𝑥1 − 𝑦1 |, |𝑥5 − 𝑦5 |)

When p =1 Minkowski has become Manhattan distance.

Figure: Distance metrics

i. Pick random data points and consider those as starting points.

ii. Choose K random values for each particular variable.

(d) Calculating similarity (e) Updating Centroid (f) Final Result

Hence 10 clusters will be formed for 200 data points.

Figure 4 Elbow method

Figure 5 Silhouette coefficient

Advantages of K-means clustering:

Disadvantages of K-means clustering:

- It is quite difficult to predict number of clusters.

X Y C1(6,8) C2(10,5) Label

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

𝑃𝑄:+;<<*+ = |𝑥1 − 𝑦1 | + |𝑥5 − 𝑦5 |

𝑃𝑄:+;<<*+ = 8|𝑥' − 𝑦' |