0% found this document useful (0 votes)
13 views28 pages

Unit 4 Self Made

Uploaded by

ginni bhayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Unit 4 Self Made

Uploaded by

ginni bhayana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT - 4

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is
used to solve the clustering problems in machine learning or data
science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.
It allows us to cluster the data into different groups and a
convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated
with a centroid. The main aim of this algorithm is to minimize the
sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the
dataset into k-number of clusters, and repeats the process until it
does not find the best clusters. The value of k should be
predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by
an iterative process.
o Assigns each data point to its closest k-center. Those data
points which are near to the particular k-center, create a
cluster.
Hence each cluster has datapoints with some commonalities, and it
is away from other clusters.
The below diagram explains the working of the K-means Clustering
Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below
steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from
the input dataset).
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each
cluster.
Step-5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to
FINISH.
Step-7: The model is ready.
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabelled datasets into a
cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of
a tree, and this tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical
clustering may look similar, but they both differ depending on how
they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1. Agglomerative: Agglomerative is a bottom-up approach, in
which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the
agglomerative algorithm as it is a top-down approach.

Linkage Criteria:
o Different methods can be used to determine the distance
between clusters, which impacts how clusters are merged:
 Single Linkage: Merges clusters based on the
shortest distance between two points in different
clusters.
 Complete Linkage: Uses the largest distance
between points in different clusters.
 Average Linkage: Considers the average distance
between points across clusters.
 Ward’s Method: Minimizes the variance within
clusters when merging.
No Need for Predefined Number of Clusters:
o Unlike methods like K-Means, hierarchical clustering
does not require the user to specify the number of clusters
in advance. The appropriate number of clusters can be
determined by cutting the dendrogram at different levels
based on the problem at hand.

Agglomerative Clustering is a type of hierarchical clustering that


builds a hierarchy of clusters in a bottom-up fashion. Here are five
key points to define it simply:
1. Bottom-Up Approach: It starts by treating each data point as
its own cluster and progressively merges the closest clusters
until all points are combined into one large cluster.
2. At each step, the two clusters that are closest to each other
(based on a distance metric) are merged together.
3. Linkage Criteria: Different methods like single linkage (closest
points), complete linkage (farthest points), or average linkage
(average distance) are used to determine which clusters to
merge.
4. Dendrogram Output: The result is a tree-like diagram called a
dendrogram that shows the merging process and helps
visualize the cluster hierarchy.
5. No Predefined Clusters: It doesn’t require specifying the
number of clusters beforehand. You can decide the number of
clusters by "cutting" the dendrogram at a specific level.
STEPS:
 Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
 In the second step, comparable clusters are merged together
to form a single cluster. Let’s say cluster (B) and cluster (C) are
very similar to each other therefore we merge them in the
second step similarly to cluster (D) and (E) and at last, we get
the clusters [(A), (BC), (DE), (F)]
 We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form
new clusters as [(A), (BC), (DEF)]
 Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re
now left with clusters [(A), (BCDEF)].
 At last, the two remaining clusters are merged together to
form a single cluster [(ABCDEF)].
Divisive Clustering is a type of hierarchical clustering that works in
a top-down manner. Here are five simple points to define it:
1. Top-Down Approach: Divisive clustering starts with all data
points in a single large cluster and then progressively splits
them into smaller clusters.
2. Recursive Splitting: At each step, the algorithm selects a
cluster and divides it into two or more sub-clusters based on
the differences between data points.
3. Dendrogram Representation: The splitting process can be
visualized using a dendrogram, where the initial large cluster
at the top splits down into smaller branches (clusters).
4. No Need for Predefined Clusters: Like other hierarchical
methods, you don’t need to specify the number of clusters
beforehand. You can choose the number of clusters by
selecting a level in the dendrogram.
5. Less Common than Agglomerative: While divisive clustering is
effective, it’s less commonly used than agglomerative
clustering because splitting clusters is computationally more
challenging.
Divisive clustering is a useful method when you want to break
down a large group of data into meaningful subgroups step by step.
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is
crucial for the hierarchical clustering. There are various ways to
calculate the distance between two clusters, and these ways decide
the rule for clustering. These measures are called Linkage methods.
Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest
points of the clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two


points of two different clusters. It is one of the popular linkage
methods as it forms tighter clusters than single-linkage.

Example using Single


Self-Organizing Map (or Kohonen Map or SOM) is a type
of Artificial Neural Network which is also inspired by biological models of
neural systems from the 1970s. It follows an unsupervised learning
approach and trained its network through a competitive learning algorithm.
SOM is used for clustering and mapping (or dimensionality reduction)
techniques to map multidimensional data onto lower-dimensional which
allows people to reduce complex problems for easy interpretation. SOM has
two layers, one is the Input layer and the other one is the Output layer.
The architecture of the Self Organizing Map with two clusters and n input
features of any sample is given below:

How do SOM works?


Let’s say an input data of size (m, n) where m is the number of training
examples and n is the number of features in each example. First, it initializes
the weights of size (n, C) where C is the number of clusters. Then iterating
over the input data, for each training example, it updates the winning vector
(weight vector with the shortest distance (e.g Euclidean distance) from
training example). Weight updation rule is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i
denotes the ith feature of training example and k denotes the kth training
example from the input data. After training the SOM network, trained
weights are used for clustering new examples. A new example falls in the
cluster of winning vectors.
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as
winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate
the new weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
Key Concepts
1. Structure:
o SOMs consist of two layers:
 Input Layer: Represents the dimensions of the input data.
 Output Layer: A 2D grid (or sometimes higher dimensions)
of neurons where each neuron has a weight vector of the
same dimension as the input.
o Each neuron in the grid is connected to all the input features.
2. Training Process:
o SOMs use competitive learning, where neurons "compete" to
represent the input data.
o The training involves:
1. Initialization: Neurons' weights are initialized randomly or with small
values.
2. Competition: For each input, the neuron with the closest weight
vector (using a distance metric like Euclidean distance) is selected as the
Best Matching Unit (BMU).
3. Adaptation: The BMU and its neighbors adjust their weights to move
closer to the input. The adjustment is influenced by:
 A learning rate that decreases over time.
 A neighborhood function, which defines the region of
influence around the BMU and also decreases over
time.
o These steps repeat over many iterations until convergence.

3. Output:
o After training, similar input data points are mapped to nearby
neurons on the grid, creating clusters.
o The resulting 2D map provides a visual representation of the
relationships in the high-dimensional data.
Applications
 Data Visualization: Representing complex datasets in two dimensions
for easier interpretation.
 Clustering: Grouping similar data points without predefined labels.
 Pattern Recognition: Identifying trends or anomalies in data.
 Dimensionality Reduction: Reducing the dimensionality of data while
preserving its structure.
 Market Segmentation: Analyzing customer behaviors or preferences.
Advantages
 Preserves the topological structure of the input data.
 Does not require labeled data (unsupervised learning).
 Provides a visually interpretable map.
Disadvantages
 Requires careful tuning of parameters like learning rate and
neighborhood size.
 Computationally expensive for large datasets.
 Limited to mostly clustering and visualization tasks.
Feature selection is a process in machine learning used to identify and
select the most relevant features (variables, attributes, or predictors) from a
dataset that contribute the most to the predictive power of a model. It aims
to improve model performance, reduce overfitting, and decrease
computation time by eliminating irrelevant, redundant, or noisy data.
Detailed Explanation
1. Purpose of Feature Selection
o Improves Model Performance: By focusing only on relevant
features, the model can better capture the underlying patterns,
leading to improved accuracy.
o Reduces Overfitting: Irrelevant or redundant features can lead to
overfitting, where the model performs well on training data but
poorly on unseen data. Removing such features improves
generalization.
o Decreases Training Time: A smaller feature set reduces the
computational complexity, speeding up the training process.
o Enhances Model Interpretability: A simpler model with fewer
features is easier to understand and explain.
Feature Selection vs. Feature Extraction
Feature Selection and Feature Extraction are both techniques used in
machine learning to reduce the dimensionality of a dataset, but they differ
in their approaches and outcomes.

1. Feature Selection
Definition: Feature selection is the process of selecting a subset of the
original features from the dataset that are most relevant or important for
the predictive model, while discarding the irrelevant or redundant ones.
Key Characteristics:
 Subset of Original Features: It retains a selection of the original
features without altering them.
 Purpose: Improve model performance, reduce overfitting, and speed
up training.
 Approaches: Includes filter methods (e.g., correlation), wrapper
methods (e.g., recursive feature elimination), and embedded methods
(e.g., regularization techniques like LASSO).
Example: If a dataset contains 10 features, feature selection might identify
that only 4 of them are important for the model and discard the other 6.

2. Feature Extraction
Definition: Feature extraction is the process of transforming the original
features into a new set of features that better capture the information
relevant to the task. These new features are combinations or
transformations of the original ones.
Key Characteristics:
 Derived Features: It creates new features by combining or
transforming existing ones.
 Purpose: Capture underlying patterns in the data that the original
features may not explicitly represent.
 Techniques: Includes methods like Principal Component Analysis
(PCA), Singular Value Decomposition (SVD), Autoencoders, and t-SNE.
Example: If a dataset contains 10 features, feature extraction might reduce
them to 3 new features, each being a combination of the original features.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable
and can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable
and can be used for the unlabelled dataset.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations. It trains the algorithm by using the
subset of features iteratively.

On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
Some techniques of wrapper methods are:
o Forward selection - Forward selection is an iterative process, which
begins with an empty set of features. After each iteration, it keeps
adding on a feature and evaluates the performance to check whether
it is improving the performance or not. The process continues until the
addition of a new variable/feature does not improve the performance
of the model.
o Backward elimination - Backward elimination is also an iterative
approach, but it is the opposite of forward selection. This technique
begins the process by considering all the features and removes the
least significant feature. This elimination process continues until
removing the features does not improve the performance of the
model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of
the best feature selection methods, which evaluates each feature set
as brute-force. It means this method tries & make each possible
combination of features and return the best performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization
approach, where features are selected by recursively taking a smaller
and smaller subset of features. Now, an estimator is trained with each
set of features, and the importance of each feature is determined
using coef_attribute or through a feature_importances_attribute.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational
time and does not overfit the data.
Some common techniques of Filter methods are as follows:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy
while transforming the dataset. It can be used as a feature selection
technique by calculating the information gain of each variable with respect
to the target variable.
Chi-square Test: Chi-square test is a technique to determine the
relationship between the categorical variables. The chi-square value is
calculated between each feature and the target variable, and the desired
number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature
set against the threshold value. The formula for obtaining the missing value
ratio is the number of missing values in each column divided by the total
number of observations. The variable is having more than the threshold
value can be dropped.

3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.

These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:
o Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in
the model. This penalty term is added to the coefficients; hence it
shrinks some coefficients to zero. Those features with zero coefficients
can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets
(L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of
selecting features. Here, feature importance specifies which feature
has more importance in model building or has a great impact on the
target variable. Random Forest is such a tree-based method, which is a
type of bagging algorithm that aggregates a different number of
decision trees. It automatically ranks the nodes by their performance
or decrease in the impurity (Gini impurity) over all the trees. Nodes
are arranged as per the impurity values, and thus it allows to pruning
of trees below a specific node. The remaining nodes create a subset of
the most important features.
Dimensionality Reduction: Definition and Overview
Dimensionality reduction is a technique used to reduce the number of
features in a dataset while retaining as much of the important information
as possible. In other words, it is a process of transforming high-dimensional
data into a lower-dimensional space that still preserves the essence of the
original data.
In machine learning, high-dimensional data refers to data with a large
number of features or variables. The curse of dimensionality is a common
problem in machine learning, where the performance of the model
deteriorates as the number of features increases. This is because the
complexity of the model increases with the number of features, and it
becomes more difficult to find a good solution. In addition, high-
dimensional data can also lead to overfitting, where the model fits the
training data too closely and does not generalize well to new data.
Why is Dimensionality Reduction Important?
1. Curse of Dimensionality:
o As the number of features increases, the volume of the feature
space grows exponentially, making data sparse and harder to
analyze.
o Models trained on high-dimensional data may suffer from
overfitting or poor generalization to new data.
2. Improved Computational Efficiency:
o Reducing the number of dimensions lowers the computational
complexity of machine learning algorithms.
o It speeds up training and inference.
3. Better Data Visualization:
o High-dimensional data is difficult to visualize. Dimensionality
reduction techniques like PCA or t-SNE allow data to be
represented in 2D or 3D for easier interpretation.
4. Noise Reduction:
o It helps eliminate redundant or noisy features, improving the
overall quality of the dataset.
Types of Dimensionality Reduction
Dimensionality reduction techniques can be broadly classified into two
categories:
1. Feature Selection
 Reduces the number of dimensions by selecting a subset of the most
important features from the original dataset.
 Methods include:
o Statistical tests (e.g., Chi-square, ANOVA).
o Recursive Feature Elimination (RFE).
o Regularization techniques like LASSO (L1).
2. Feature Extraction
 Reduces dimensions by transforming data into a lower-dimensional
space, creating new features that capture the essence of the original
data.
 Techniques include:
o Linear Methods:
 Principal Component Analysis (PCA): Identifies directions
(principal components) that capture the most variance in
the data.
 Linear Discriminant Analysis (LDA): Maximizes the
separability between different classes.
Applications of Dimensionality Reduction
1. Data Visualization:
o Helps in exploring and understanding patterns in datasets.
o Commonly used with t-SNE or PCA to represent data in 2D or 3D.
2. Noise Reduction:
o Removes less important features to create a cleaner dataset for
machine learning models.
What is Principal Component Analysis(PCA)?
Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901. It works on the condition that
while the data in a higher dimensional space is mapped to data in a
lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
 Principal Component Analysis (PCA) is a statistical procedure that
uses an orthogonal transformation that converts a set of correlated
variables to a set of uncorrelated variables.PCA is the most widely
used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
 Principal Component Analysis (PCA) is an unsupervised
learning algorithm technique used to examine the interrelations
among a set of variables. It is also known as a general factor analysis
where regression determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important
patterns or relationships between the variables without any prior
knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the
dimensionality of a data set by finding a new set of variables, smaller
than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.
Principal Component Analysis
1. Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data. The
principal components are linear combinations of the original variables
in the dataset and are ordered in decreasing order of importance. The
total variance captured by all the principal components is equal to the
total variance in the original dataset.
2. The first principal component captures the most variation in the data,
but the second principal component captures the
maximum variance that is orthogonal to the first principal component,
and so on.
3. Principal Component Analysis can be used for a variety of purposes,
including data visualization, feature selection, and data compression.
In data visualization, PCA can be used to plot high-dimensional data in
two or three dimensions, making it easier to interpret. In feature
selection, PCA can be used to identify the most important variables in
a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.
4. In Principal Component Analysis, it is assumed that the information is
carried in the variance of the features, that is, the higher the variation
in a feature, the more information that features carries.
Numerical :
https://www.geeksforgeeks.org/mathematical-approach-to-pca/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy