Unit 4 Self Made
Unit 4 Self Made
Linkage Criteria:
o Different methods can be used to determine the distance
between clusters, which impacts how clusters are merged:
Single Linkage: Merges clusters based on the
shortest distance between two points in different
clusters.
Complete Linkage: Uses the largest distance
between points in different clusters.
Average Linkage: Considers the average distance
between points across clusters.
Ward’s Method: Minimizes the variance within
clusters when merging.
No Need for Predefined Number of Clusters:
o Unlike methods like K-Means, hierarchical clustering
does not require the user to specify the number of clusters
in advance. The appropriate number of clusters can be
determined by cutting the dendrogram at different levels
based on the problem at hand.
3. Output:
o After training, similar input data points are mapped to nearby
neurons on the grid, creating clusters.
o The resulting 2D map provides a visual representation of the
relationships in the high-dimensional data.
Applications
Data Visualization: Representing complex datasets in two dimensions
for easier interpretation.
Clustering: Grouping similar data points without predefined labels.
Pattern Recognition: Identifying trends or anomalies in data.
Dimensionality Reduction: Reducing the dimensionality of data while
preserving its structure.
Market Segmentation: Analyzing customer behaviors or preferences.
Advantages
Preserves the topological structure of the input data.
Does not require labeled data (unsupervised learning).
Provides a visually interpretable map.
Disadvantages
Requires careful tuning of parameters like learning rate and
neighborhood size.
Computationally expensive for large datasets.
Limited to mostly clustering and visualization tasks.
Feature selection is a process in machine learning used to identify and
select the most relevant features (variables, attributes, or predictors) from a
dataset that contribute the most to the predictive power of a model. It aims
to improve model performance, reduce overfitting, and decrease
computation time by eliminating irrelevant, redundant, or noisy data.
Detailed Explanation
1. Purpose of Feature Selection
o Improves Model Performance: By focusing only on relevant
features, the model can better capture the underlying patterns,
leading to improved accuracy.
o Reduces Overfitting: Irrelevant or redundant features can lead to
overfitting, where the model performs well on training data but
poorly on unseen data. Removing such features improves
generalization.
o Decreases Training Time: A smaller feature set reduces the
computational complexity, speeding up the training process.
o Enhances Model Interpretability: A simpler model with fewer
features is easier to understand and explain.
Feature Selection vs. Feature Extraction
Feature Selection and Feature Extraction are both techniques used in
machine learning to reduce the dimensionality of a dataset, but they differ
in their approaches and outcomes.
1. Feature Selection
Definition: Feature selection is the process of selecting a subset of the
original features from the dataset that are most relevant or important for
the predictive model, while discarding the irrelevant or redundant ones.
Key Characteristics:
Subset of Original Features: It retains a selection of the original
features without altering them.
Purpose: Improve model performance, reduce overfitting, and speed
up training.
Approaches: Includes filter methods (e.g., correlation), wrapper
methods (e.g., recursive feature elimination), and embedded methods
(e.g., regularization techniques like LASSO).
Example: If a dataset contains 10 features, feature selection might identify
that only 4 of them are important for the model and discard the other 6.
2. Feature Extraction
Definition: Feature extraction is the process of transforming the original
features into a new set of features that better capture the information
relevant to the task. These new features are combinations or
transformations of the original ones.
Key Characteristics:
Derived Features: It creates new features by combining or
transforming existing ones.
Purpose: Capture underlying patterns in the data that the original
features may not explicitly represent.
Techniques: Includes methods like Principal Component Analysis
(PCA), Singular Value Decomposition (SVD), Autoencoders, and t-SNE.
Example: If a dataset contains 10 features, feature extraction might reduce
them to 3 new features, each being a combination of the original features.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable
and can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable
and can be used for the unlabelled dataset.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations. It trains the algorithm by using the
subset of features iteratively.
On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
Some techniques of wrapper methods are:
o Forward selection - Forward selection is an iterative process, which
begins with an empty set of features. After each iteration, it keeps
adding on a feature and evaluates the performance to check whether
it is improving the performance or not. The process continues until the
addition of a new variable/feature does not improve the performance
of the model.
o Backward elimination - Backward elimination is also an iterative
approach, but it is the opposite of forward selection. This technique
begins the process by considering all the features and removes the
least significant feature. This elimination process continues until
removing the features does not improve the performance of the
model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of
the best feature selection methods, which evaluates each feature set
as brute-force. It means this method tries & make each possible
combination of features and return the best performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization
approach, where features are selected by recursively taking a smaller
and smaller subset of features. Now, an estimator is trained with each
set of features, and the importance of each feature is determined
using coef_attribute or through a feature_importances_attribute.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational
time and does not overfit the data.
Some common techniques of Filter methods are as follows:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy
while transforming the dataset. It can be used as a feature selection
technique by calculating the information gain of each variable with respect
to the target variable.
Chi-square Test: Chi-square test is a technique to determine the
relationship between the categorical variables. The chi-square value is
calculated between each feature and the target variable, and the desired
number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features
selection. It returns the rank of the variable on the fisher's criteria in
descending order. Then we can select the variables with a large fisher's
score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature
set against the threshold value. The formula for obtaining the missing value
ratio is the number of missing values in each column divided by the total
number of observations. The variable is having more than the threshold
value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration. Some techniques of embedded methods
are:
o Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in
the model. This penalty term is added to the coefficients; hence it
shrinks some coefficients to zero. Those features with zero coefficients
can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets
(L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of
selecting features. Here, feature importance specifies which feature
has more importance in model building or has a great impact on the
target variable. Random Forest is such a tree-based method, which is a
type of bagging algorithm that aggregates a different number of
decision trees. It automatically ranks the nodes by their performance
or decrease in the impurity (Gini impurity) over all the trees. Nodes
are arranged as per the impurity values, and thus it allows to pruning
of trees below a specific node. The remaining nodes create a subset of
the most important features.
Dimensionality Reduction: Definition and Overview
Dimensionality reduction is a technique used to reduce the number of
features in a dataset while retaining as much of the important information
as possible. In other words, it is a process of transforming high-dimensional
data into a lower-dimensional space that still preserves the essence of the
original data.
In machine learning, high-dimensional data refers to data with a large
number of features or variables. The curse of dimensionality is a common
problem in machine learning, where the performance of the model
deteriorates as the number of features increases. This is because the
complexity of the model increases with the number of features, and it
becomes more difficult to find a good solution. In addition, high-
dimensional data can also lead to overfitting, where the model fits the
training data too closely and does not generalize well to new data.
Why is Dimensionality Reduction Important?
1. Curse of Dimensionality:
o As the number of features increases, the volume of the feature
space grows exponentially, making data sparse and harder to
analyze.
o Models trained on high-dimensional data may suffer from
overfitting or poor generalization to new data.
2. Improved Computational Efficiency:
o Reducing the number of dimensions lowers the computational
complexity of machine learning algorithms.
o It speeds up training and inference.
3. Better Data Visualization:
o High-dimensional data is difficult to visualize. Dimensionality
reduction techniques like PCA or t-SNE allow data to be
represented in 2D or 3D for easier interpretation.
4. Noise Reduction:
o It helps eliminate redundant or noisy features, improving the
overall quality of the dataset.
Types of Dimensionality Reduction
Dimensionality reduction techniques can be broadly classified into two
categories:
1. Feature Selection
Reduces the number of dimensions by selecting a subset of the most
important features from the original dataset.
Methods include:
o Statistical tests (e.g., Chi-square, ANOVA).
o Recursive Feature Elimination (RFE).
o Regularization techniques like LASSO (L1).
2. Feature Extraction
Reduces dimensions by transforming data into a lower-dimensional
space, creating new features that capture the essence of the original
data.
Techniques include:
o Linear Methods:
Principal Component Analysis (PCA): Identifies directions
(principal components) that capture the most variance in
the data.
Linear Discriminant Analysis (LDA): Maximizes the
separability between different classes.
Applications of Dimensionality Reduction
1. Data Visualization:
o Helps in exploring and understanding patterns in datasets.
o Commonly used with t-SNE or PCA to represent data in 2D or 3D.
2. Noise Reduction:
o Removes less important features to create a cleaner dataset for
machine learning models.
What is Principal Component Analysis(PCA)?
Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901. It works on the condition that
while the data in a higher dimensional space is mapped to data in a
lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
Principal Component Analysis (PCA) is a statistical procedure that
uses an orthogonal transformation that converts a set of correlated
variables to a set of uncorrelated variables.PCA is the most widely
used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
Principal Component Analysis (PCA) is an unsupervised
learning algorithm technique used to examine the interrelations
among a set of variables. It is also known as a general factor analysis
where regression determines a line of best fit.
The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important
patterns or relationships between the variables without any prior
knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the
dimensionality of a data set by finding a new set of variables, smaller
than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.
Principal Component Analysis
1. Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data. The
principal components are linear combinations of the original variables
in the dataset and are ordered in decreasing order of importance. The
total variance captured by all the principal components is equal to the
total variance in the original dataset.
2. The first principal component captures the most variation in the data,
but the second principal component captures the
maximum variance that is orthogonal to the first principal component,
and so on.
3. Principal Component Analysis can be used for a variety of purposes,
including data visualization, feature selection, and data compression.
In data visualization, PCA can be used to plot high-dimensional data in
two or three dimensions, making it easier to interpret. In feature
selection, PCA can be used to identify the most important variables in
a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.
4. In Principal Component Analysis, it is assumed that the information is
carried in the variance of the features, that is, the higher the variation
in a feature, the more information that features carries.
Numerical :
https://www.geeksforgeeks.org/mathematical-approach-to-pca/