Unit - Iv Unsupervisied Learning - Notes
Unit - Iv Unsupervisied Learning - Notes
BTCSE023602
B. Tech CSE – III
Mr. Naresh A. Kamble
Asst. Professor,
Department of Computer Science and Engineering
D.Y.Patil Agriculture and Technical University,
Talsande, Kolhapur
NARESH KAMBLE 1
UNIT-IV
UNSUPERVISIED
LEARNING
NARESH KAMBLE 2
CONTENTS
• Clustering Algorithms
• Dimensionality Reduction
• Anomaly Detection
NARESH KAMBLE 3
UNIT-IV UNSUPERVISIED LEARNING
CLUSTERING
It does it by finding some similar patterns in the unlabeled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of
those similar patterns.
After applying this clustering technique, each cluster or group is provided with
a cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
NARESH KAMBLE 4
Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same
way. Other examples of clustering are grouping documents according to the
topic.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
NARESH KAMBLE 5
TYPES OF CLUSTERING
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters. So each data point will either belong to cluster 1 or
cluster 2.
Soft Clustering: In this type of clustering, instead of assigning each data point
into a separate cluster, a probability or likelihood of that point being that cluster
is evaluated. For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So we will be evaluating a probability of a data point
belonging to both clusters. This probability is calculated for all data points.
NARESH KAMBLE 6
TYPES OF CLUSTERING ALGORITHMS
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
PARTITIONING CLUSTERING
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way
NARESH KAMBLE 7
that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
DENSITY-BASED CLUSTERING
These algorithms can face difficulty in clustering the data points if the dataset
has varying densities and high dimensions.
NARESH KAMBLE 8
DISTRIBUTION MODEL-BASED CLUSTERING
NARESH KAMBLE 9
HIERARCHICAL CLUSTERING
NARESH KAMBLE 10
K-MEAN CLUSTERING ALGORITHM
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without
the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
Hence each cluster has data points with some commonalities, and it is away
from other clusters.
NARESH KAMBLE 11
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
NARESH KAMBLE 12
EXAMPLE
NARESH KAMBLE 13
NARESH KAMBLE 14
HIERARCHICAL CLUSTERING ALGORITHM (HCA)
NARESH KAMBLE 15
Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.
It does this until all the clusters are merged into a single cluster that contains all
the datasets.
Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
NARESH KAMBLE 16
Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
NARESH KAMBLE 17
Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
DBSCAN ALGORITHM
NARESH KAMBLE 18
KEY CONCEPTS
Core Points: These are points that have at least a minimum number of other
points (MinPts) within a specified distance (ε or epsilon).
Border Points: These are points that are within the ε distance of a core point
but don't have MinPts neighbors themselves.
Noise Points: These are points that are neither core points nor border points.
They're not close enough to any cluster to be included.
PARAMETERS IN DBSCAN
NARESH KAMBLE 19
MinPts: The minimum number of points required to form a dense region.
WORKING OF DBSCAN
Choose ε (epsilon): The maximum distance between two points for them to be
considered as neighbors.
NARESH KAMBLE 20
STEP-3: Examine the Neighborhood
If the number of neighboring points is less than MinPts, the point is labeled as
noise (for now).
If there are at least MinPts points within ε distance, the point is marked as a
core point, and a new cluster is formed.
All the neighbors of the core point are added to the cluster.
If it's a core point, its neighbors are added to the cluster recursively.
If it's not a core point, it's marked as a border point, and the expansion stops.
Steps 3-4 are repeated until all points have been visited.
After all points have been processed, the algorithm identifies all clusters.
Points initially labeled as noise might now be border points if they're within ε
distance of a core point.
NARESH KAMBLE 21
STEP-7: Handling Noise
DIMENSIONALITY REDUCTION
INTRODUCTION
NARESH KAMBLE 22
Because it is very difficult to visualize or make predictions for the training
dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
It is commonly used in the fields that deal with high-dimensional data, such as
speech recognition, signal processing, bioinformatics, etc. It can also be used for
data visualization, noise reduction, cluster analysis, etc.
FEATURE SELECTION
In this method, we are interested in finding k of the total of n features that give
us the most information and we discard the other (n-k) dimensions.
FEATURE EXTRACTION
In this method, we are interested in finding a new set of k features that are
combination of the original n features.
NARESH KAMBLE 23
2. Linear Discriminant Analysis (LDA)
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.
NARESH KAMBLE 24
Dimensionality: It is the number of features or variables present in the given
dataset. More easily, it is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables are directly
proportional to each other.
Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than
the original features present in the dataset. Some properties of these principal
components are given below:
NARESH KAMBLE 25
Steps for PCA algorithm
Firstly, we need to take the input dataset and divide it into two subparts X and
Y, where X is the training set, and Y is the validation set.
Now we will represent our dataset into a structure. Such as we will represent
the two-dimensional matrix of independent variable X. Here each row
corresponds to the data items, and the column corresponds to the Features. The
number of columns is the dimensions of the dataset.
In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with
lower variance.
To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the Covariance
matrix of Z.
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of
NARESH KAMBLE 26
the axes with high information. And the coefficients of these eigenvectors are
defined as the eigenvalues.
In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be
named as P*.
Here we will calculate the new features. To do this, we will multiply the P* matrix
to the Z. In the resultant matrix Z*, each observation is the linear combination
of original features. Each column of the Z* matrix is independent of each other.
The new feature set has occurred, so we will decide here what to keep and what
to remove. It means, we will only keep the relevant or important features in the
new dataset, and unimportant features will be removed out.
It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.
NARESH KAMBLE 27
LINEAR DISCRIMINANT ANALYSIS (LDA)
For e.g., if we have two classes with multiple features and need to separate
them efficiently.
When we classify them using a single feature, then it may show overlapping.
EXAMPLE
Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
NARESH KAMBLE 28
To create a new axis, Linear Discriminant Analysis uses the following criteria:
NARESH KAMBLE 29
ANOMOLY DETECTION
ISOLATION FOREST
It operates under the principle that anomalies are rare and distinct, making
them easier to isolate from the rest of the data.
Unlike other methods that profile normal data, Isolation Forests focus on
isolating anomalies.
At its core, the Isolation Forest algorithm, it banks on the fundamental concept
that anomalies, they deviate significantly, thereby making them easier to
identify.
NARESH KAMBLE 30
The workings of isolation forests are defined below:
Building Isolation Trees: The algorithm starts by creating a set of isolation trees,
typically hundreds or even thousands of them.
These trees are similar to traditional decision trees, but with a key difference:
they are not built to classify data points into specific categories.
Then, a random split value is chosen within the range of that particular feature's
values.
This randomness helps ensure that anomalies, which tend to be distinct from
the majority of data points, are not hidden within specific branches of the tree.
Isolating Data Points: The data points are then directed down the branches of
the isolation tree based on their feature values.
If a data point's value for the chosen feature falls below the split value, it goes
to the left branch. Otherwise, it goes to the right branch.
This process continues recursively until the data point reaches a leaf node, which
simply represents the isolated data point.
NARESH KAMBLE 31
Anomaly Score: The key concept behind Isolation Forests lies in the path length
of a data point through an isolation tree.
Conversely, normal data points, which share more similarities with each other,
might require more splits on their path down the tree before they are isolated.
Anomaly Score Calculation: Each data point is evaluated through all the
isolation trees in the forest.
For each tree, the path length (number of splits) required to isolate the data
point is recorded.
An anomaly score is then calculated for each data point by averaging the path
lengths across all the isolation trees in the forest.
Identifying Anomalies: Data points with shorter average path lengths are
considered more likely to be anomalies.
This is because they were easier to isolate, suggesting they deviate significantly
from the bulk of the data.
A threshold is set to define the anomaly score that separates normal data points
from anomalies.
NARESH KAMBLE 32