100% found this document useful (1 vote)

122 views14 pages

Machine Learning Notes Anna University

Machine learning Clustering Unsupervised learning DATA preprocessing

Uploaded by

Jeeva Jeeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

122 views14 pages

Machine Learning Notes Anna University

Machine learning Clustering Unsupervised learning DATA preprocessing

Uploaded by

Jeeva Jeeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this
topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number
of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that
has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process. O
Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. Hence each cluster has
datapoints with some commonalities, and it is away from other clusters. The below diagram explains the working of the K-means Clustering
Algorithm:
How does the K-Means Algorithm Work? The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which mean reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below: Let's take number k of clusters, i.e.,
K=2, to identify the dataset and to put them into different clusters. It means here we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster. These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we
have studied to calculate the distance between two points. So, we will draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear visualization. As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as
below: Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line.
Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups the data points together that are close to each other based on the
measure of similarity or distance. The assumption is that data points that are close to each other are more similar or related than data points that are
farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical relationships between groups. Individual data points
are located at the bottom of the dendrogram, while the largest clusters, which include all the data points, are located at the top. In order to generate
different numbers of clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a measure of similarity or distance between data points. Clusters
are divided or merged repeatedly until all data points are contained within a single cluster, or until the predetermined number of clusters is
attained.
We can look at the dendrogram and measure the height at which the branches of the dendrogram form distinct clusters to calculate the ideal
number of clusters. The dendrogram can be sliced at this height to determine the number of clusters.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1.Agglomerative Clustering
2.Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the
unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-
up algorithms treat each data as a singleton cluster at the outset and then successively agglomerate pairs of clusters until all clusters have been
merged into a single cluster that contains all data.
Steps:
•Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.
•In the second step, comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each
other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
•We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
•Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster. We’re now left with clusters
[(A), (BCDEF)].
•At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify the number of clusters. Top-down clustering requires a
method for splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.
Computing Distance Matrix
While merging two clusters we check the distance between two every pair of clusters and merge the pair with the least distance/most similarity.
But the question is how is that distance determined. There are different ways of defining Inter Cluster distance/similarity. Some of them are:
1.Min Distance: Find the minimum distance between any two points of the cluster.
2.Max Distance: Find the maximum distance between any two points of the cluster.
3.Group Average: Find the average distance between every two points of the clusters.
4.Ward’s Method: The similarity of two clusters is based on the increase in squared error when two clusters are merged.
For example, if we group a given data using different methods, we may get different results:
Hierarchical Agglomerative vs Divisive Clustering
•Divisive clustering is more complex as compared to agglomerative clustering, as in the case of divisive clustering we need a flat clustering
method as “subroutine” to split each cluster until we have each data having its own singleton cluster.
•Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to individual data leaves. The time complexity
of a naive agglomerative clustering is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1
iterations. Using priority queue data structure we can reduce this complexity to O(n2logn). By using some more optimizations it can be brought
down to O(n2). Whereas for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like K-Means, divisive
algorithms are linear in the number of patterns and clusters.
•A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering the local patterns or neighbor points
without initially taking into account the global distribution of data. These early decisions cannot be undone. whereas divisive clustering takes into
consideration the global distribution of data when making top-level partitioning decisions.
Mean-Shift Clustering
Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised learning that assigns the data points to the clusters
iteratively by shifting points towards the mode (mode is the highest density of data points in the region, in the context of the Meanshift). As such,
it is also known as the Mode-seeking algorithm.
Mean-shift algorithm has applications in the field of image processing and computer vision. Unlike the popular K-Means cluster algorithm, mean-
shift does not require specifying the number of clusters in advance. The number of clusters is determined by the algorithm with respect to the data.
The process of mean-shift clustering algorithm can be summarized as follows:
Initialize the data points as cluster centroids.
Repeat the following steps until convergence or a maximum number of iterations is reached:
For each data point, calculate the mean of all points within a certain radius (i.e., the “kernel”) centered at the data point.
Shift the data point to the mean.
Identify the cluster centroids as the points that have not moved after convergence.
Return the final cluster centroids and the assignments of data points to clusters.
One of the main advantages of mean-shift clustering is that it does not require the number of clusters to be specified beforehand.
It also does not make any assumptions about the distribution of the data, and can handle arbitrary shapes and sizes of clusters. However, it can be
sensitive to the choice of kernel and the radius of the kernel.
Mean-Shift clustering can be applied to various types of data, including image and video processing, object tracking and bioinformatics.
Kernel Density Estimation
The first step when applying mean shift clustering algorithms is representing your data in a mathematical manner this means representing your
data as points such as the set below.

Mean-shift builds upon the concept of kernel density estimation, in short KDE. Imagine that the above data was sampled from a probability
distribution. KDE is a method to estimate the underlying distribution also called the probability density function for a set of data. It works by
placing a kernel on each point in the data set.
A kernel is a fancy mathematical word for a weighting function generally used in convolution. There are many different types of kernels, but the
most popular one is the Gaussian kernel. Adding up all of the individual kernels generates a probability surface example density function.
Depending on the kernel bandwidth parameter used, the resultant density function will vary.
Below is the KDE surface for our points above using a Gaussian kernel with a kernel bandwidth of 2.

Surface plot: Contour plot:

Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Best Poetry
100% (1)
Best Poetry
12 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Clustering
No ratings yet
Clustering
75 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Think Big Think Small Groups PDF
No ratings yet
Think Big Think Small Groups PDF
115 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Selenium With Java: Module-1: Overview On Automation & Selenium
No ratings yet
Selenium With Java: Module-1: Overview On Automation & Selenium
5 pages
NTSE Practice Paper - 07 Mental Ability Test
No ratings yet
NTSE Practice Paper - 07 Mental Ability Test
7 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unit 4: A.G. Gardiner: Life and Works
No ratings yet
Unit 4: A.G. Gardiner: Life and Works
7 pages
The Myth of The Eternal Return
No ratings yet
The Myth of The Eternal Return
4 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Cluster
100% (1)
Cluster
72 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
BogdanVatra Extending QT Android Apps With JNI
No ratings yet
BogdanVatra Extending QT Android Apps With JNI
56 pages
Hindu Vocab 2
No ratings yet
Hindu Vocab 2
303 pages
Object Oriented New
No ratings yet
Object Oriented New
7 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
A ULTIMA ReleaseNotesAxiomV PDF
No ratings yet
A ULTIMA ReleaseNotesAxiomV PDF
38 pages
Microprocessor
No ratings yet
Microprocessor
16 pages
Unit IV Electrical Type Flow Meter
No ratings yet
Unit IV Electrical Type Flow Meter
129 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
The Idea of Provincializing Europe
No ratings yet
The Idea of Provincializing Europe
1 page
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Log Cat 1673074091324
No ratings yet
Log Cat 1673074091324
157 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Detection of Hidden Objects
No ratings yet
Detection of Hidden Objects
23 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
How To Install PostgreSQL 11 On CentOS 7
No ratings yet
How To Install PostgreSQL 11 On CentOS 7
5 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Egranger Version 1 Code
No ratings yet
Egranger Version 1 Code
7 pages
Clustering
No ratings yet
Clustering
10 pages
10th English First Revision QP Theni DT
No ratings yet
10th English First Revision QP Theni DT
4 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Unit - III 2 Marks
No ratings yet
Unit - III 2 Marks
8 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
About Authorizing NetBackup
No ratings yet
About Authorizing NetBackup
2 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Going out-UNIT 5
No ratings yet
Going out-UNIT 5
6 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Fear
No ratings yet
Fear
18 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
ML Unit 2
No ratings yet
ML Unit 2
17 pages
Annamalai University Annamalai University Annamalai University Annamalai University
No ratings yet
Annamalai University Annamalai University Annamalai University Annamalai University
100 pages
Where The Iron Crosses Grow - Robert Forczyk
No ratings yet
Where The Iron Crosses Grow - Robert Forczyk
390 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Reagine
No ratings yet
Reagine
2 pages
Skills and Structure
No ratings yet
Skills and Structure
5 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
M5
No ratings yet
M5
40 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
DS Lab 9 - Recursion in C++
No ratings yet
DS Lab 9 - Recursion in C++
10 pages
Vertical Separation 1
No ratings yet
Vertical Separation 1
6 pages
Unit IV
No ratings yet
Unit IV
51 pages
Clustering
No ratings yet
Clustering
38 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Export
No ratings yet
Export
7 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
21 pages
PFD of Industry
No ratings yet
PFD of Industry
7 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
9 pages
Unit III Dbms Question and Answer
No ratings yet
Unit III Dbms Question and Answer
13 pages
C Programming
No ratings yet
C Programming
3 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Sulphate Dig
No ratings yet
Sulphate Dig
8 pages
Unit Op-1
No ratings yet
Unit Op-1
6 pages
(Ebook PDF) Journalism Next: A Practical Guide To Digital Reporting and Publishing 4th Editionpdf Download
100% (5)
(Ebook PDF) Journalism Next: A Practical Guide To Digital Reporting and Publishing 4th Editionpdf Download
58 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Algo
No ratings yet
Algo
59 pages
Chattanein Full Book - Roman
No ratings yet
Chattanein Full Book - Roman
57 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Mulesoft MCD Level 2 Dumps by Farrell 05 04 2024 7qa Braindumpscollection
No ratings yet
Mulesoft MCD Level 2 Dumps by Farrell 05 04 2024 7qa Braindumpscollection
9 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Unsupervised Learning Part 1
No ratings yet
Unsupervised Learning Part 1
9 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 3
No ratings yet
Unit 3
12 pages
Metrics and Models in Software Quality Engineering 2nd Edition by Stephen Kan ISBN 9788131703243 PDF Download
100% (1)
Metrics and Models in Software Quality Engineering 2nd Edition by Stephen Kan ISBN 9788131703243 PDF Download
45 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Ijireeice 2025 13318
No ratings yet
Ijireeice 2025 13318
4 pages
Ijireeice 2025 13317
No ratings yet
Ijireeice 2025 13317
9 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Machine Learning Notes Anna University

Uploaded by

Machine Learning Notes Anna University

Uploaded by

K-Means Clustering Algorithm

The k-means clustering algorithm mainly performs two tasks:

Surface plot: Contour plot:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.