ML Unit 4 Part A Material
ML Unit 4 Part A Material
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
Clustering:
Clustering analysis or simply Clustering is basically an unsupervised learning method
that divides the data points into a number of specific batches or groups, such that the
data points in the same groups have similar properties and data points in different
groups have different properties in some sense.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-
shift (distance between points), DBSCAN (distance between nearest points), Gaussian
mixtures (Mahalanobis distance to centers), Spectral clustering (graph distance) etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate
similarities and then we use it to cluster the data points into groups or batches.
K-Means:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
➢ Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third step, which means reassign each data point to the new closest
centroid of each cluster.
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
Sum of Squares, which defines the total variations within a cluster. The formula to
calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in
2
CLuster3 distance(Pi C3)
In the above formula of WCSS,
∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such
as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
➢ It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
➢ For each value of K, calculates the WCSS value.
➢ Plots a curve between calculated WCSS values and the number of clusters K.
➢ The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
Example:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Iteration-01:
➢ We calculate the distance of each point from each of the center of the three clusters.
➢ The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and
each of the center of the three clusters-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
Cluster-01:
Cluster-02:
Cluster-03:
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
Now,
➢ We re-compute the new cluster clusters.
➢ The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Iteration-02:
➢ We calculate the distance of each point from each of the center of the three clusters.
➢ The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the
center of the three clusters-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
=4+4
=8
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.
Next,
➢ We draw a table showing all the results.
➢ Using the table, we decide which point belongs to which cluster.
➢ The given point belongs to that cluster whose center is nearest to it.
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Cluster-01:
Cluster-02:
Cluster-03:
Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
• C3(1.5, 3.5)
Limits of K-Means:
Another use case for clustering is in semi-supervised learning, when we have plenty of
unlabeled instances and very few labeled instances. Let’s train a logistic regression
model on a sample of 50 labeled instances from the digits dataset:
n_labeled = 50
log_reg = LogisticRegression()
log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])
What is the performance of this model on the test set?
>>> log_reg.score(X_test, y_test)
0.8266666666666667
The accuracy is just 82.7%: it should come as no surprise that this is much lower than
earlier, when we trained the model on the full training set. Let’s see how we can do
better. First, let’s cluster the training set into 50 clusters, then for each cluster let’s find
the image closest to the centroid. We will call these images the representative images:
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
k = 50
kmeans = KMeans(n_clusters=k)
X_digits_dist = kmeans.fit_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
X_representative_digits = X_train[representative_digit_idx]
Figure 1 shows these 50 representative images:
DBSCAN:
Density-based spatial clustering of applications with noise (DBSCAN) clustering
method.
Clusters are dense regions in the data space, separated by regions of the lower density
of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and
“noise”. The key idea is that for each point of a cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. Moreover, they are also severely
affected by the presence of noise and outliers in the data.
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
The figure below shows a data set containing nonconvex clusters and outliers/noises.
Given such data, k-means algorithm has difficulties in identifying these clusters with
arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance between
two points is lower or equal to ‘eps’ then they are considered neighbors. If the eps
value is chosen too small then large part of the data will be considered as outliers.
If it is chosen very large then the clusters will merge and the majority of the data
points will be in the same clusters. One way to find the eps value is based on the k-
distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum
MinPts can be derived from the number of dimensions D in the dataset as, MinPts
>= D+1. The minimum value of MinPts must be chosen at least 3.
Gaussian Mixtures:
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the
instances were generated from a mixture of several Gaussian distributions whose
parameters are unknown. All the instances generated from a single Gaussian distribution
form a cluster that typically looks like an ellipsoid. Each cluster can have a different
ellipsoidal shape, size, density and orientation, just like in Figure 1. When you observe
an instance, you know it was generated from one of the Gaussian distributions, but you
are not told which one, and you do not know what the parameters of these distributions
are. There are several GMM variants: in the simplest variant, implemented in the
Gaussian Mixture class, you must know in advance the number k of Gaussian
distributions.
The dataset X is assumed to have been generated through the following probabilistic
process:
Dr K N MADHAVI LATHA
ASSOCIATE PROFESSOR
DEPT OF CSE
SIRCRRCOE
• For each instance, a cluster is picked randomly among k clusters. The probability of
choosing the jth cluster is defined by the cluster’s weight ϕ(j).6 The index of the cluster
chosen for the ith instance is noted z(i).
• If z(i)=j, meaning the ith instance has been assigned to the jth cluster, the location x(i)
of this instance is sampled randomly from the Gaussian distribution with mean μ(j) and
covariance matrix Σ(j).
This generative process can be represented as a graphical model (see Figure 1).
This is a graph which represents the structure of the conditional dependencies between
random variables.
• The squiggly arrow from z(i) to x(i) represents a switch: depending on the value of
z(i), the instance x(i) will be sampled from a different Gaussian distribution. For
example, if z(i)=j, then i ∼ μ j , Σ j .
• Shaded nodes indicate that the value is known, so in this case only the random
variables x(i) have known values: they are called observed variables. The unknown
random variables z(i) are called latent variables.