ML Unit-5
ML Unit-5
• The Euclidean distance formula helps to find the distance of a line segment.
Let us assume two points, such as (x1, y1) and (x2, y2) in the two-
dimensional coordinate plane.
• Thus, the Euclidean distance formula is given by:
We can choose the right number of clusters with the help of the Within-Cluster-
Sum-of-Squares (WCSS) method. WCSS stands for the sum of the squares of
distances of the data points in each and every cluster from its centroid.
The main idea is to minimize the distance (e.g., euclidean distance) between
the data points and the centroid of the clusters. The process is iterated until we
reach a minimum value for the sum of distances.
Elbow Method
Here are the steps to follow in order to find the optimal number of clusters
using the elbow method:
Step 1: Execute the K-means clustering on a given dataset for different K
values (ranging from 1-10).
Step 2: For each value of K, calculate the WCSS value.
Step 3: Plot a graph/curve between WCSS values and the respective number
of clusters K.
Step 4: The sharp point of bend or a point (looking like an elbow joint) of the
plot, like an arm, will be considered as the best/optimal value of K.
From the above image, it is clear that points left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's colour them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids:
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points will
be assigned to new centroids.
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
o Scalability
Unsupervised learning algorithms handle large-scale datasets without
manual labelling and make them more scalable than supervised
learning in certain scenarios. Unsupervised learning algorithms
handle large-scale datasets without manual labelling and make them
more scalable than supervised learning in certain scenarios.
Unsupervised learning algorithms handle large-scale datasets without
manual labeling and make them more scalable than supervised
learning in certain scenarios.
o Anomaly Detection
Unsupervised learning can effectively detect anomalies or outliers in
data, which is particularly useful for fraud detection, network security,
or identifying rare events.
o Data Preprocessing
Unsupervised learning techniques like dimensionality reduction can
help preprocess data by reducing noise, removing irrelevant features,
and improving efficiency in subsequent supervised learning tasks.
o Interpretability
Unsupervised learning algorithms often provide clusters or patterns
without explicit labels or explanations. Interpreting and understanding
the meaning of these clusters can be challenging and subjective.
o Overfitting and Model Selection
o Limited Guidance
Unlike supervised learning, where the algorithm learns from explicit
feedback, unsupervised learning lacks explicit guidance, which can
result in the algorithm discovering irrelevant or noisy patterns.
5.7. Difference between Supervised and Unsupervised learning
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent
remove that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
Step-3:
Step 2: Generating Association Rules