0% found this document useful (0 votes)
23 views30 pages

ML Unit 4 V1

Uploaded by

sampathmandru18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views30 pages

ML Unit 4 V1

Uploaded by

sampathmandru18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MACHINE LEARNING

UNSUPERVISED LEARNING TECHNIQUES


BTECH III YEAR – II SEMESTER
Computer Science & Engineering

UNIT-IV
By

Dr.Satyabrata Dash
Professor
Department of Computer Science & Engineering
Ramachandra College of Engineering, Eluru

4/2/2023 by Dr.Satyabrata Dash 1


SYLLABUS
UNIT-IV
MACHINE LEARNING
Unsupervised Learning Techniques:
• Clustering,
• K-Means, Limits of K-Means,
• Using Clustering for Image Segmentation,
• Using Clustering for Preprocessing,
• Using Clustering for Semi-Supervised Learning,
• DBSCAN,
• Gaussian Mixtures.
Dimensionality Reduction:
• The Curse of Dimensionality,
• Main Approaches for Dimensionality Reduction,
• PCA,
• Using Scikit-Learn,
• Randomized PCA,
• Kernel PCA.

4/2/2023 by Dr.Satyabrata Dash 2


Introduction to Unsupervised Learning

4/2/2023 by Dr.Satyabrata Dash 3


Introduction to Unsupervised Learning

1. Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision.
2. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset,
and the machine predicts the output without any supervision.
3. In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
4. The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences.
5. Machines are instructed to find the hidden patterns from the input dataset.
6. Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
7. The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.

4
Unsupervised Learning

Need of Unsupervised Learning


1. Unsupervised learning is helpful for finding useful insights from the data.
2. Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
3. Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
4. In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized


into two types of problems:

5
Unsupervised Learning
1. Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
2. Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.

6
Clustering

4/2/2023 by Dr.Satyabrata Dash 7


Clustering

1. Clustering is a technique in machine learning and data analysis that involves grouping similar
data points together into clusters.
2. The goal of clustering is to identify patterns or structures in the data that may not be
immediately apparent, and to gain insights into the underlying relationships among the data
points.
3. Clustering is commonly used in a wide range of applications, including image analysis,
natural language processing, bioinformatics, and customer segmentation.
4. There are several different types of clustering algorithms, including hierarchical clustering, k-
means clustering, and density-based clustering.
5. Clustering can be used for a variety of purposes, including exploratory data analysis, data
visualization, and anomaly detection.

4/2/2023 by Dr.Satyabrata Dash 8


K-Means Clustering
1. K-means clustering is a unsupervised machine learning algorithm used to group similar
data points together into k clusters.
2. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters
3. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
4. The process continues until the assignment of data points to clusters no longer changes, or
until a predefined number of iterations is reached.

4/2/2023 by Dr.Satyabrata Dash 9


K-Means Clustering
The K-mean algorithm steps:

1. Assignment step: Each data point is


assigned to the cluster with the nearest
centroid. The distance between a data point
and a centroid can be measured using a
distance metric, such as Euclidean distance.

2. Update step: The centroid of each cluster


is updated based on the mean of the data
points assigned to that cluster. This
involves computing the mean of each
feature across all the data points in the
cluster.

3. Repeat: Steps 1 and 2 are repeated until


the assignment of data points to clusters no
longer changes, or until a predefined
number of iterations is reached.

4/2/2023 by Dr.Satyabrata Dash 10


The K-mean using Python
Program:
import numpy as np
from sklearn.cluster import KMeans
# Generate some random data OutPut
X = np.random.rand(100, 2)
# Set the number of clusters Cluster Labels: [0 0 3 0 3 1 2 2 2 2 1 1 2 2 2 3 0
k=4 31022013031122000013 313213
# Initialize the k-means algorithm 20332311030330113032013100
kmeans = KMeans(n_clusters=k) 30001023331230100101230333
# Fit the data to the algorithm 2 1 0 0 1]
kmeans.fit(X)
# Get the cluster labels for each point
labels = kmeans.labels_ Cluster Centers: [[0.68645396 0.70837678]
# Get the cluster centers [0.20982496 0.75451388]
centers = kmeans.cluster_centers_ [0.21195521 0.22924606]
# Print the results [0.78986404 0.2240606 ]]
print("random data",X)
print("Cluster Labels: ", labels)
print("Cluster Centers: ", centers)

4/2/2023 by Dr.Satyabrata Dash 11


The K-mean algorithm Example

4/2/2023 by Dr.Satyabrata Dash 12


The K-mean algorithm Example

4/2/2023 by Dr.Satyabrata Dash 13


The K-mean algorithm steps:

4/2/2023 by Dr.Satyabrata Dash 14


Limitations of K-Means Clustering
K-Means clustering is a widely used clustering algorithm that works well on many types of data.
However, it has some limitations that should be considered.

1. Sensitivity to initial conditions: The final clustering result of K-Means can depend heavily on the
initial choice of cluster centers. Depending on the random initialization, K-Means can converge to
different local optima, resulting in different clustering results.
2. Assumes spherical clusters: K-Means algorithm assumes that the clusters are spherical in shape,
which may not be true for all types of data. If the clusters are elongated or have irregular shapes,
K-Means may fail to capture the true underlying structure of the data.
3. Requires predefined number of clusters: K-Means requires the number of clusters to be
predefined, which may not always be known in advance. Selecting the optimal number of clusters
can be a difficult task, and an incorrect choice can lead to poor clustering results.

4/2/2023 by Dr.Satyabrata Dash 15


Limits of K-Means Clustering

4. Sensitive to outliers: K-Means is sensitive to outliers, as they can significantly affect the
location of cluster centers. Outliers can distort the clustering result and lead to incorrect
conclusions.
5. Cannot handle categorical data: K-Means is designed to work with numerical data, and
cannot handle categorical or binary data directly. Data preprocessing is required to convert
categorical data to numerical data before applying K-Means.
6. Cannot handle non-linear relationships: K-Means assumes that the clusters are separated
by linear boundaries. If the data has non-linear relationships between the features, K-Means
may not be able to capture the underlying structure of the data.

4/2/2023 by Dr.Satyabrata Dash 16


Clustering for Image Segmentation

1. In image segmentation, clustering is used to group pixels or regions in an image into different
clusters based on their similarity in terms of color, texture, or other visual features.
2. The resulting clusters can be used to segment the image into different regions or objects.
3. There are different clustering algorithms that can be used for image segmentation, including k-
means clustering, fuzzy c-means clustering, and spectral clustering.
4. K-means clustering is a popular algorithm that partitions the data into k clusters based on
minimizing the distance between the data points and the centroid of each cluster and commonly
used for image segmentation.
5. To apply clustering for image segmentation, we first represent the image as a matrix of pixels
with each pixel having a set of visual features such as color and texture. We then apply the
clustering algorithm to group the pixels into different clusters based on their similarity. Finally,
we assign each pixel to the cluster it belongs to, and we can use the resulting clusters to segment
the image into different regions.

4/2/2023 by Dr.Satyabrata Dash 17


K-Mean Clustering for Image Segmentation

The algorithm for image segmentation

1. First, we need to select the value of K in K-means clustering.


2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance to measure the
similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that is
similar to it until you can’t combine more.

4/2/2023 by Dr.Satyabrata Dash 18


Clustering for Image Segmentation Example

4/2/2023 by Dr.Satyabrata Dash 19


Image Segmentation( Guess the K value)

4/2/2023 by Dr.Satyabrata Dash 20


Clustering for Pre-Processing
1. Clustering can also be used as a preprocessing step in machine learning to help with data
cleaning, dimensionality reduction, and feature extraction.
2. Clustering is a unsupervised technique, meaning that it does not require labeled data to group
similar data points together.
3. One common use case of clustering for preprocessing is outlier detection. Outliers are data points
that are significantly different from the rest of the data and can negatively impact the performance
of machine learning models. Clustering can help identify outliers by grouping data points into
clusters and identifying the data points that are far away from their cluster centers. These data
points can be treated as outliers and removed from the dataset.
4. Another use case of clustering for preprocessing is dimensionality reduction. In some datasets,
there may be many features that are highly correlated or redundant, which can lead to overfitting
and decreased model performance. Clustering can be used to group similar features together and
reduce the dimensionality of the dataset. This can help simplify the model and improve its
generalization performance.
4/2/2023 by Dr.Satyabrata Dash 21
Clustering for Pre-Processing

1. Clustering can also be used for feature extraction. In some datasets, the raw data may be high
dimensional and difficult to work with. Clustering can help identify the underlying structure in
the data and extract meaningful features that can be used for machine learning. For example, in
image processing, clustering can be used to extract texture features or color features from the
image.
2. Clustering can be a useful preprocessing technique for machine learning.
3. It can help with data cleaning, dimensionality reduction, and feature extraction, which can
improve the performance and generalization of machine learning models. However, it is
important to carefully choose the clustering algorithm and parameters based on the
characteristics of the dataset and the desired preprocessing task.

4/2/2023 by Dr.Satyabrata Dash 22


DBSCAN

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering


algorithm commonly used in machine learning and data mining for grouping together data
points based on their proximity to each other.
2. DBSCAN works by defining clusters as regions of high-density points, separated by
regions of low-density points.
3. The algorithm groups together points that are close to each other, while also identifying
and ignoring points that are isolated or noise.
4. The algorithm takes two input parameters: epsilon (ε) and the minimum number of points
(minPts).
5. Epsilon defines the radius of a neighborhood around each point,
6. MinPts defines the minimum number of points required to form a dense region.

4/2/2023 by Dr.Satyabrata Dash 23


DBSCAN

4/2/2023 by Dr.Satyabrata Dash 24


DBSCAN
• The algorithm takes two input parameters: epsilon
(ε) and the minimum number of points (minPts).
• Epsilon defines the radius of a neighborhood around
each point,
• MinPts defines the minimum number of points
required to form a dense region.
• 3 types of data points.
Core Point: A point is a core point if it has more
than MinPts points within eps.
Border Point: A point which has fewer than MinPts
within eps but it is in the neighborhood of a core
point.
Noise or outlier: A point which is not a core point or
border point.
4/2/2023 by Dr.Satyabrata Dash 25
DBSCAN
DBSCAN algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as the
core point.
4. A point a and b are said to be density connected if there exist a point c which has a sufficient
number of points in its neighbors and both the points a and b are within the eps distance. This
is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in
turn is neighbor of a implies that b is neighbor of a.
5. Iterate through the remaining unvisited points in the dataset. Those points that do not belong
to any cluster are noise.

4/2/2023 by Dr.Satyabrata Dash 26


Gaussian Mixtures

1. Gaussian mixture models (GMMs) are a type of machine learning algorithm.


2. They are used to classify data into different categories based on the probability distribution.
Gaussian mixture models can be used in many different areas, including finance, marketing
and many more.
3. Used to model real-world data sets.
4. GMMs can be used to find clusters in data sets where the clusters may not be clearly
defined.
5. Additionally, GMMs can be used to estimate the probability that a new data point belongs to
each cluster.
6. GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian
distribution is defined as a continuous probability distribution that takes on a bell-
shaped curve. Another name for Gaussian distribution is the normal distribution.

4/2/2023 by Dr.Satyabrata Dash 27


Gaussian Mixtures

1. GMM has many applications, such as density estimation, clustering, and image segmentation. For
density estimation, GMM can be used to estimate the probability density function of a set of data
points.
2. For clustering, GMM can be used to group together data points that come from the same Gaussian
distribution.
3. And for image segmentation, GMM can be used to partition an image into different regions.

4/2/2023 by Dr.Satyabrata Dash 28


Dimensionality Reduction:

1. Dimensionality Reduction:
2. The Curse of Dimensionality,
3. Main Approaches for Dimensionality Reduction,
4. PCA,
5. Using Scikit-Learn,
6. Randomized PCA,
7. Kernel PCA.

4/2/2023 by Dr.Satyabrata Dash 29


Thank You

4/2/2023 by Dr.Satyabrata Dash 30

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy