0% found this document useful (0 votes)
6 views32 pages

Unit - Iv Unsupervisied Learning - Notes

The document covers the topic of Unsupervised Learning in machine learning, focusing on clustering algorithms, dimensionality reduction, and anomaly detection. It explains various clustering techniques such as K-Means, Hierarchical, and DBSCAN, along with their methodologies and applications. Additionally, it discusses dimensionality reduction methods like Feature Selection and Principal Component Analysis (PCA) to simplify datasets for better analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Unit - Iv Unsupervisied Learning - Notes

The document covers the topic of Unsupervised Learning in machine learning, focusing on clustering algorithms, dimensionality reduction, and anomaly detection. It explains various clustering techniques such as K-Means, Hierarchical, and DBSCAN, along with their methodologies and applications. Additionally, it discusses dimensionality reduction methods like Feature Selection and Principal Component Analysis (PCA) to simplify datasets for better analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

MACHINE LEARNING

BTCSE023602
B. Tech CSE – III
Mr. Naresh A. Kamble
Asst. Professor,
Department of Computer Science and Engineering
D.Y.Patil Agriculture and Technical University,
Talsande, Kolhapur

NARESH KAMBLE 1
UNIT-IV
UNSUPERVISIED
LEARNING

NARESH KAMBLE 2
CONTENTS
• Clustering Algorithms
• Dimensionality Reduction
• Anomaly Detection

NARESH KAMBLE 3
UNIT-IV UNSUPERVISIED LEARNING

CLUSTERING

Clustering or cluster analysis is a machine learning technique, which groups the


unlabeled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with
another group."

It does it by finding some similar patterns in the unlabeled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of
those similar patterns.

It is an unsupervised learning method; hence no supervision is provided to the


algorithm, and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with
a cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.

NARESH KAMBLE 4
Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same
way. Other examples of clustering are grouping documents according to the
topic.

Apart from these general usages, it is used by the Amazon in its


recommendation system to provide the recommendations as per the past
search of products. Netflix also uses this technique to recommend the movies
and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.

NARESH KAMBLE 5
TYPES OF CLUSTERING

Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:

Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters. So each data point will either belong to cluster 1 or
cluster 2.

Soft Clustering: In this type of clustering, instead of assigning each data point
into a separate cluster, a probability or likelihood of that point being that cluster
is evaluated. For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So we will be evaluating a probability of a data point
belonging to both clusters. This probability is calculated for all data points.

NARESH KAMBLE 6
TYPES OF CLUSTERING ALGORITHMS

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

PARTITIONING CLUSTERING

It is a type of clustering that divides the data into non-hierarchical groups. It is


also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way

NARESH KAMBLE 7
that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.

DENSITY-BASED CLUSTERING

The density-based clustering method connects the highly-dense areas into


clusters, and the arbitrarily shaped distributions are formed as long as the dense
region can be connected. This algorithm does it by identifying different clusters
in the dataset and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset
has varying densities and high dimensions.

NARESH KAMBLE 8
DISTRIBUTION MODEL-BASED CLUSTERING

In the distribution model-based clustering method, the data is divided based on


the probability of how a dataset belongs to a particular distribution. The
grouping is done by assuming some distributions commonly Gaussian
Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm


that uses Gaussian Mixture Models (GMM).

NARESH KAMBLE 9
HIERARCHICAL CLUSTERING

Hierarchical clustering can be used as an alternative for the partitioned


clustering as there is no requirement of pre-specifying the number of clusters to
be created. In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram. The observations or any
number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical
algorithm.

NARESH KAMBLE 10
K-MEAN CLUSTERING ALGORITHM

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without
the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative


process.

Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away
from other clusters.

NARESH KAMBLE 11
The below diagram explains the working of the K-means Clustering Algorithm:

WORKING OF K-MEANS ALGORITHM

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

NARESH KAMBLE 12
EXAMPLE

NARESH KAMBLE 13
NARESH KAMBLE 14
HIERARCHICAL CLUSTERING ALGORITHM (HCA)

Hierarchical clustering is another unsupervised machine learning algorithm,


which is used to group the unlabelled datasets into a cluster and also known as
hierarchical cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree,


and this tree-shaped structure is known as the dendrogram.

The hierarchical clustering technique has two approaches:

Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging them
until one cluster is left.

NARESH KAMBLE 15
Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.

Problems with K-Means Algorithm

Predetermined number of clusters

It always tries to create the clusters of the same size.

AGGLOMERATIVE HIERARCHICAL CLUSTERING

To group the datasets into clusters, it follows the bottom-up approach.

It means, this algorithm considers each dataset as a single cluster at the


beginning, and then start combining the closest pair of clusters together.

It does this until all the clusters are merged into a single cluster that contains all
the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

WORKING OF AGGLOMERATIVE HCA

Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

NARESH KAMBLE 16
Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

NARESH KAMBLE 17
Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

DBSCAN ALGORITHM

DBSCAN, which stands for Density-Based Spatial Clustering of Applications


with Noise, is a powerful clustering algorithm that groups point that are
closely packed together in data space.

Unlike some other clustering algorithms, DBSCAN doesn't require you to


specify the number of clusters beforehand.

The algorithm works by defining clusters as dense regions separated by


regions of lower density.

NARESH KAMBLE 18
KEY CONCEPTS

Core Points: These are points that have at least a minimum number of other
points (MinPts) within a specified distance (ε or epsilon).

Border Points: These are points that are within the ε distance of a core point
but don't have MinPts neighbors themselves.

Noise Points: These are points that are neither core points nor border points.
They're not close enough to any cluster to be included.

PARAMETERS IN DBSCAN

ε (epsilon): The maximum distance between two points for them to be


considered as neighbors.

NARESH KAMBLE 19
MinPts: The minimum number of points required to form a dense region.

WORKING OF DBSCAN

STEP-1: Parameter Selection

Choose ε (epsilon): The maximum distance between two points for them to be
considered as neighbors.

Choose MinPts: The minimum number of points required to form a dense


region.

STEP-2: Select a Starting Point

The algorithm starts with an arbitrary unvisited point in the dataset.

NARESH KAMBLE 20
STEP-3: Examine the Neighborhood

It retrieves all points within the ε distance of the starting point.

If the number of neighboring points is less than MinPts, the point is labeled as
noise (for now).

If there are at least MinPts points within ε distance, the point is marked as a
core point, and a new cluster is formed.

STEP-4: Expand the Cluster

All the neighbors of the core point are added to the cluster.

For each of these neighbors:

If it's a core point, its neighbors are added to the cluster recursively.

If it's not a core point, it's marked as a border point, and the expansion stops.

STEP-5: Repeat the Process

The algorithm moves to the next unvisited point in the dataset.

Steps 3-4 are repeated until all points have been visited.

STEP-6: Finalize Clusters

After all points have been processed, the algorithm identifies all clusters.

Points initially labeled as noise might now be border points if they're within ε
distance of a core point.

NARESH KAMBLE 21
STEP-7: Handling Noise

Any points not belonging to any cluster remain classified as noise.

DIMENSIONALITY REDUCTION

The number of input features, variables, or columns present in a given dataset


is known as dimensionality, and the process to reduce these features is called
dimensionality reduction.

INTRODUCTION

A dataset contains a huge number of input features in various cases, which


makes the predictive modeling task more complicated.

NARESH KAMBLE 22
Because it is very difficult to visualize or make predictions for the training
dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting


the higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information."

It is commonly used in the fields that deal with high-dimensional data, such as
speech recognition, signal processing, bioinformatics, etc. It can also be used for
data visualization, noise reduction, cluster analysis, etc.

TYPES OF DIMENSIONALITY REDUCTION

FEATURE SELECTION

In this method, we are interested in finding k of the total of n features that give
us the most information and we discard the other (n-k) dimensions.

Example - Suppose we have 10 features (n=10) provided for a dataset but we


need only 7 features (k=7), then we will discard the rest 3 features (n-k = 10-7 =
3).

FEATURE EXTRACTION

In this method, we are interested in finding a new set of k features that are
combination of the original n features.

There are 2 types of feature extraction techniques

1. Principal Component Analysis (PCA)

NARESH KAMBLE 23
2. Linear Discriminant Analysis (LDA)

PRINCIPAL COMPONENT ANALYSIS (PCA)

Principal Component Analysis (PCA) is an unsupervised learning algorithm that


is used for the dimensionality reduction in machine learning. It is a statistical
process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation. These
n PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.

PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

• Variance and Covariance


• Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

NARESH KAMBLE 24
Dimensionality: It is the number of features or variables present in the given
dataset. More easily, it is the number of columns present in the dataset.

Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables are directly
proportional to each other.

Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.

Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.


Then v will be eigenvector if Av is the scalar multiple of v.

Covariance Matrix: A matrix containing the covariance between the pair of


variables is called the Covariance Matrix.

Principal Components in PCA

As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than
the original features present in the dataset. Some properties of these principal
components are given below:

The principal component must be the linear combination of the original


features.

These components are orthogonal, i.e., the correlation between a pair of


variables is zero.

The importance of each component decreases when going to 1 to n, it means


the 1 PC has the most importance, and n PC will have the least importance.

NARESH KAMBLE 25
Steps for PCA algorithm

Getting the dataset

Firstly, we need to take the input dataset and divide it into two subparts X and
Y, where X is the training set, and Y is the validation set.

Representing data into a structure

Now we will represent our dataset into a structure. Such as we will represent
the two-dimensional matrix of independent variable X. Here each row
corresponds to the data items, and the column corresponds to the Features. The
number of columns is the dimensions of the dataset.

Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with
lower variance.

If the importance of features is independent of the variance of the feature, then


we will divide each data item in a column with the standard deviation of the
column. Here we will name the matrix as Z.

Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the Covariance
matrix of Z.

Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of

NARESH KAMBLE 26
the axes with high information. And the coefficients of these eigenvectors are
defined as the eigenvalues.

Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be
named as P*.

Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix
to the Z. In the resultant matrix Z*, each observation is the linear combination
of original features. Each column of the Z* matrix is independent of each other.

Remove less or unimportant features from the new dataset.

The new feature set has occurred, so we will decide here what to keep and what
to remove. It means, we will only keep the relevant or important features in the
new dataset, and unimportant features will be removed out.

Applications of Principal Component Analysis

PCA is mainly used as the dimensionality reduction technique in various AI


applications such as computer vision, image compression, etc.

It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.

NARESH KAMBLE 27
LINEAR DISCRIMINANT ANALYSIS (LDA)

Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality


reduction techniques in machine learning to solve more than two-class
classification problems.

What is Linear Discriminant Analysis (LDA)?

Linear Discriminant analysis is one of the most popular dimensionality reduction


techniques used for supervised classification problems in machine learning.

It is also considered a pre-processing step for modeling differences in ML and


applications of pattern classification.

For e.g., if we have two classes with multiple features and need to separate
them efficiently.

When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must


increase the number of features regularly.

EXAMPLE

Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:

NARESH KAMBLE 28
To create a new axis, Linear Discriminant Analysis uses the following criteria:

It maximizes the distance between means of two classes.

It minimizes the variance within the individual class.

NARESH KAMBLE 29
ANOMOLY DETECTION

Anomaly Detection is the technique of identifying rare events or observations


which can raise suspicions by being statistically different from the rest of the
observations.

Such “anomalous” behavior typically translates to some kind of a problem like a


credit card fraud, failing machine in a server, a cyber-attack, etc.

An anomaly can be broadly categorized into three categories –

Point Anomaly: A tuple in a dataset is said to be a Point Anomaly if it is far off


from the rest of the data.

Contextual Anomaly: An observation is a Contextual Anomaly if it is an anomaly


because of the context of the observation.

Collective Anomaly: A set of data instances help in finding an anomaly.

ISOLATION FOREST

Isolation Forest is an unsupervised anomaly detection algorithm particularly


effective for high-dimensional data.

It operates under the principle that anomalies are rare and distinct, making
them easier to isolate from the rest of the data.

Unlike other methods that profile normal data, Isolation Forests focus on
isolating anomalies.

At its core, the Isolation Forest algorithm, it banks on the fundamental concept
that anomalies, they deviate significantly, thereby making them easier to
identify.

NARESH KAMBLE 30
The workings of isolation forests are defined below:

Building Isolation Trees: The algorithm starts by creating a set of isolation trees,
typically hundreds or even thousands of them.

These trees are similar to traditional decision trees, but with a key difference:
they are not built to classify data points into specific categories.

Instead, isolation trees aim to isolate individual data points by repeatedly


splitting the data based on randomly chosen features and split values.

Splitting on Random Features: Isolation trees introduce randomness at each


node of the tree, a random feature from the dataset is selected.

Then, a random split value is chosen within the range of that particular feature's
values.

This randomness helps ensure that anomalies, which tend to be distinct from
the majority of data points, are not hidden within specific branches of the tree.

Isolating Data Points: The data points are then directed down the branches of
the isolation tree based on their feature values.

If a data point's value for the chosen feature falls below the split value, it goes
to the left branch. Otherwise, it goes to the right branch.

This process continues recursively until the data point reaches a leaf node, which
simply represents the isolated data point.

NARESH KAMBLE 31
Anomaly Score: The key concept behind Isolation Forests lies in the path length
of a data point through an isolation tree.

Anomalies, by virtue of being different from the majority, tend to be easier to


isolate. They require fewer random splits to reach a leaf node because they are
likely to fall outside the typical range of values for the chosen features.

Conversely, normal data points, which share more similarities with each other,
might require more splits on their path down the tree before they are isolated.

Anomaly Score Calculation: Each data point is evaluated through all the
isolation trees in the forest.

For each tree, the path length (number of splits) required to isolate the data
point is recorded.

An anomaly score is then calculated for each data point by averaging the path
lengths across all the isolation trees in the forest.

Identifying Anomalies: Data points with shorter average path lengths are
considered more likely to be anomalies.

This is because they were easier to isolate, suggesting they deviate significantly
from the bulk of the data.

A threshold is set to define the anomaly score that separates normal data points
from anomalies.

NARESH KAMBLE 32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy