ML Unit 4 at VS
ML Unit 4 at VS
UNIT – IV
Dimensionality Reduction – Linear Discriminant Analysis –
Principal Component Analysis – Factor Analysis – Independent
Component Analysis – Locally Linear Embedding – Isomap – Least
Squares Optimization
Page 1 of 33
There are two main approaches to dimensionality reduction: feature
selection and feature extraction.
Feature Selection:
Feature selection involves selecting a subset of the original features
that are most relevant to the problem at hand. The goal is to reduce the
dimensionality of the dataset while retaining the most important
features. There are several methods for feature selection, including
filter methods, wrapper methods, and embedded methods. Filter
methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria
for selecting features, and embedded methods combine feature
selection with the model training process.
There are two main approaches to dimensionality reduction: feature
selection and feature extraction.
Feature Extraction:
Page 2 of 33
In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature,
since both of the aforementioned are correlated to a high degree.
Hence, we can reduce the number of features in such problems. A 3D
classification problem can be hard to visualize, whereas a 2-D one can
be mapped to a simple 2-dimensional space, and a 1-D problem to a
simple line. The below figure illustrates this concept, where a 3-D
feature space is split into two 2-D feature spaces, and later, if found to
be correlated, the number of features can be reduced even further.
Page 4 of 33
• Data Preprocessing: Dimensionality reduction can be used as a
preprocessing step before applying machine learning algorithms
to reduce the dimensionality of the data and hence improve the
performance of the model.
• Improved Performance: Dimensionality reduction can help in
improving the performance of machine learning models by
reducing the complexity of the data, and hence reducing the
noise and irrelevant information in the data.
Page 5 of 33
Factor analysis, a method within the realm of statistics and part of the
general linear model (GLM), serves to condense numerous variables
into a smaller set of factors. By doing so, it captures the maximum
shared variance among the variables and condenses them into a
unified score, which can subsequently be utilized for further
analysis.Factor analysis operates under several assumptions: linearity
in relationships, absence of multicollinearity among variables,
inclusion of relevant variables in the analysis, and genuine
correlations between variables and factors. While multiple methods
exist, principal component analysis stands out as the most prevalent
approach in practice.
What does Factor mean in Factor Analysis?
In the context of factor analysis, a “factor” refers to an underlying,
unobserved variable or latent construct that represents a common
source of variation among a set of observed variables. These observed
variables, also known as indicators or manifest variables, are the
measurable variables that are directly observed or measured in a study.
How to do Factor Analysis (Factor Analysis Steps)?
Factor analysis is a statistical method used to describe variability
among observed, correlated variables in terms of a potentially lower
number of unobserved variables called factors. Here are the general
steps involved in conducting a factor analysis:
Page 7 of 33
• Validate the results using additional data or by conducting a
confirmatory factor analysis if necessary.
Types of Factor Analysis
There are two main types of Factor Analysis used in data science: 1.
Exploratory Factor Analysis (EFA)
Exploratory Factor Analysis (EFA) is used to uncover the underlying
structure of a set of observed variables without imposing preconceived
notions about how many factors there are or how the variables are
related to each factor. It explores complex interrelationships among
items and aims to group items that are part of unified concepts or
constructs.
• Researchers do not make a priori assumptions about the
relationships among factors, allowing the data to reveal the
structure organically.
• Exploratory Factor Analysis (EFA) helps in identifying the
number of factors needed to account for the variance in the
observed variables and understanding the relationships
between variables and factors.
2. Confirmatory Factor Analysis (CFA)
Confirmatory Factor Analysis (CFA) is a more structured approach
that tests specific hypotheses about the relationships between observed
variables and latent factors based on prior theoretical knowledge or
expectations. It uses structural equation modeling techniques to test a
measurement model, wherein the observed variables are assumed to
load onto specific factors.
• Confirmatory Factor Analysis (CFA) assesses the fit of the
hypothesized model to the actual data, examining how well
the observed variables align with the proposed factor
structure.
• This method allows for the evaluation of relationships
between observed variables and unobserved factors, and it
can accommodate measurement error.
• Researchers hypothesize the relationships between variables
and factors before conducting the analysis, and the model is
Page 8 of 33
tested against empirical data to determine its validity. In
summary, while Exploratory Factor Analysis (EFA) is more
exploratory and flexible, allowing the data to dictate the
factor structure, Confirmatory Factor Analysis (CFA) is more
confirmatory, testing specific hypotheses about how the
observed variables are related to latent factors. Both methods
are valuable tools in understanding the underlying structure of
data and have their respective strengths and applications.
Types of Factor Extraction Methods
Some of the Type of Factor Extraction methods are dicussed below:
1. Principal Component Analysis (PCA):
• PCA is a widely used method for factor extraction.
• It aims to extract factors that account for the
maximum possible variance in the observed
variables.
• Factor weights are computed to extract successive
factors until no further meaningful variance can be
extracted.
• After extraction, the factor model is often rotated for
further analysis to enhance interpretability.
2. Canonical Factor Analysis:
• Also known as Rao’s canonical factoring, this
method computes a similar model to PCA but uses
the principal axis method.
• It seeks factors that have the highest canonical
correlation with the observed variables.
• Canonical factor analysis is not affected by arbitrary
rescaling of the data, making it robust to certain data
transformations.
3. Common Factor Analysis:
• Also referred to as Principal Factor Analysis (PFA) or
Principal Axis Factoring (PAF).
Page 9 of 33
• This method aims to identify the fewest factors
necessary to account for the common variance
(correlation) among a set of variables.
• Unlike PCA, common factor analysis focuses on
capturing shared variance rather than overall
variance.
Assumptions of Factor Analysis
Let’s have a closer look onto the assumptions of factorial analysis, that
are as follows:
1. Linearity: The relationships between variables and factors
are assumed to be linear.
2. Multivariate Normality: The variables in the dataset should
follow a multivariate normal distribution.
3. No Multicollinearity: Variables should not be highly
correlated with each other, as high multicollinearity can affect
the stability and reliability of the factor analysis results.
4. Adequate Sample Size: Factor analysis generally requires a
sufficient sample size to produce reliable results. The
adequacy of the sample size can depend on factors such as the
complexity of the model and the ratio of variables to cases.
5. Homoscedasticity: The variance of the variables should be
roughly equal across different levels of the factors.
6. Uniqueness: Each variable should have unique variance that
is not explained by the factors. This assumption is particularly
important in common factor analysis.
7. Independent Observations: The observations in the dataset
should be independent of each other.
8. Linearity of Factor Scores: The relationship between the
observed variables and the latent factors is assumed to be
linear, even though the observed variables may not be linearly
related to each other.
9. Interval or Ratio Scale: Factor analysis typically assumes
that the variables are measured on interval or ratio scales, as
opposed to nominal or ordinal scales.
Page 10 of 33
Violation of these assumptions can lead to biased parameter estimates
and inaccurate interpretations of the results. Therefore, it’s important
to assess the data for these assumptions before conducting factor
analysis and to consider potential remedies or alternative methods if
the assumptions are not met.
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are
statistically independent of each other.
2. The second assumption is that each source signal exhibits nonGaussian
distributions.
Mathematical Representation of Independent Component Analysis
The observed random vector is , representing
the observed data with m components. The hidden components are
represented by the random vector , where n is the number of hidden sources.
Linear Static Transformation
Page 11 of 33
The observed data X is transformed into hidden components S using a linear static
transformation representation by the matrix W.
Page 12 of 33
Cocktail Party Problem
Consider Cocktail Party Problem or Blind Source Separation problem to understand
the problem which is solved by independent component analysis.
Problem: To extract independent sources’ signals from a mixed signal composed
of the signals from those sources.
Given: Mixed signal from five different independent sources.
Aim: To decompose the mixed signal into independent sources:
• Source 1
• Source 2
• Source 3
• Source 4
• Source 5
Solution: Independent Component Analysis
Here, there is a party going into a room full of people. There is ‘n’ number of
speakers in that room, and they are speaking simultaneously at the party. In the
same room, there are also ‘n’ microphones placed at different distances from the
speakers, which are recording ‘n’ speakers’ voice signals. Hence, the number of
speakers is equal to the number of microphones in the room. Step 4:
Visualize the signals
Page 13 of 33
•
Python3
# Plot the results
plt.figure(figsize=(8, 6))
plt.subplot(3, 1, 1)
plt.title('Original Sources')
plt.plot(S) plt.subplot(3, 1,
2) plt.title('Observed
Signals') plt.plot(X)
plt.subplot(3, 1, 3)
plt.title('Estimated Sources (FastICA)')
plt.plot(S_) plt.tight_layout()
plt.show()
Now, using these microphones’ recordings, we want to separate all the ‘n’
speakers’ voice signals in the room, given that each microphone recorded the
voice signals coming from each speaker of different intensity due to the difference
in distances between them.
Decomposing the mixed signal of each microphone’s recording into an
independent source’s speech signal can be done by using the machine learning
technique, independent component analysis.
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2,
…, Yn are the new features and are independent components that are
independent of each other.
Page 14 of 33
LLE(Locally Linear Embedding) is an unsupervised approach designed to transform
data from its original high-dimensional space into a lowerdimensional
representation, all while striving to retain the essential geometric characteristics
of the underlying non-linear feature structure.
LLE operates in several key steps:
• Firstly, it constructs a nearest neighbors graph to capture these local
relationships. Then, it optimizes weight values for each data point,
aiming to minimize the reconstruction error when expressing a point as
a linear combination of its neighbors. This weight matrix reflects the
strength of connections between points.
• Next, LLE computes a lower dimensional representation of the data by
finding eigenvectors of a matrix derived from the weight matrix. These
eigenvectors represent the most relevant directions in the reduced
space. Users can specify the desired dimensionality for the output
space, and LLE selects the top eigenvectors accordingly.
Page 15 of 33
Mathematical Implementation of LLE Algorithm
The key idea of LLE is that locally, in the vicinity of each data point, the data lies
approximately on a linear subspace. LLE attempts to unfold or unroll the data
while preserving these local linear relationships. Here is a mathematical
overview of the LLE algorithm:
Minimize: ∑𝑖|𝑥𝑖 − ∑𝑗 𝑊𝑖𝑗𝑥𝑗| 2
Subject to : ∑𝑗𝑤𝑖𝑗 = 1
Where:
• xi represents the i-th data point.
• wij are the weights that minimize the reconstruction error for data point
xi using its neighbors.
It aims to find a lower-dimensional representation of data while preserving local
relationships. The mathematical expression for LLE involves minimizing the
reconstruction error of each data point by expressing it as a weighted sum of its k
nearest neighbors‘ contributions. This optimization is subject to constraints
ensuring that the weights sum to 1 for each data point. Locally Linear Embedding
(LLE) is a dimensionality reduction technique used in machine learning and data
analysis. It focuses on preserving local relationships between data points when
mapping high-dimensional data to a lower-dimensional space.
Here, we will explain the LLE algorithm and its parameters.
Minimize:
Subject to :
Where:
• xi represents the i-th data point.
• wij are the weights that minimize the reconstruction error for data point
xi using its neighbors.
Page 17 of 33
Manhattan distance, or custom-defined distance functions. The choice
of distance metric can impact the results.
• Regularization (Optional): In some cases, regularization terms are added
to the cost function to prevent overfitting. Regularization can be useful
when dealing with noisy data or when the number of neighbors is high.
• Optimization Algorithm (Optional): LLE often uses optimization
techniques like Singular Value Decomposition (SVD) or eigenvector
methods to find the lower-dimensional representation. These
optimization methods may have their own parameters that can be
adjusted.
Advantages of LLE
The dimensionality reduction method known as locally linear embedding (LLE) has
many benefits for data processing and visualization. The following are LLE’s main
benefits:
• Preservation of Local Structures: LLE is excellent at maintaining the in-
data local relationships or structures. It successfully captures the
inherent geometry of nonlinear manifolds by maintaining pairwise
distances between nearby data points.
• Handling Non-Linearity: LLE has the ability to capture nonlinear patterns
and structures in the data, in contrast to linear techniques like Principal
Component Analysis (PCA). When working with complicated, curved, or
twisted datasets, it is especially helpful.
• Dimensionality Reduction: LLE lowers the dimensionality of the data
while preserving its fundamental properties. Particularly when working
with high-dimensional datasets, this reduction makes data presentation,
exploration, and analysis simpler.
Disavantages of LLE
• Curse of Dimensionality: LLE can experience the “curse of
dimensionality” when used with extremely high-dimensional data, just
like many other dimensionality reduction approaches. The number of
neighbors required to capture local interactions rises as dimensionality
does, potentially increasing the computational cost of the approach.
• Memory and computational Requirements: For big datasets, creating a
weighted adjacency matrix as part of LLE might be memory-intensive.
Page 18 of 33
The eigenvalue decomposition stage can also be computationally taxing
for big datasets.
• Outliers and Noisy data: LLE is susceptible to anomalies and jittery data
points. The quality of the embedding may be affected and the local
linear relationships may be distorted by outliers.
LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2)
X_reduced = lle.fit_transform(X)
Page 19 of 33
#Visualizing the Original and Reduced Data
plt.figure(figsize=(12, 6))
2")
plt.tight_layout() plt.show()
ISOMAP
Isomap
An understanding and representation of complicated data structures are crucial
for the field of machine learning. To achieve this, Manifold Learning, a subset of
unsupervised learning, has a significant role to play. Among the manifold learning
techniques, ISOMAP (Isometric Mapping) stands out for its prowess in capturing
the intrinsic geometry of highdimensional data. In the case of situations in which
linear methods are lacking, they have proved particularly efficient.
ISOMAP is a flexible tool that seamlessly blends multiple learning and
dimensionality reduction intending to obtain more detailed knowledge of the
underlying structure of data. This article takes a look at ISOMAP's inner workings
and sheds light on its parameters, functions, and proper implementation with
SkLearn.
Isometric mapping is an approach to reduce the dimensionality of
machine learning.
Relation between Geodesic Distances and Euclidean Distances
Understanding the distinction between equatorial and elliptic distances is of vital
importance for ISOMAP. The geodesic distance considers the shortest path along
the curved surface of the manifold, as opposed to Euclidean distances which are
measured by measuring straight Line distances in the input space. In order to
provide a more precise representation of the data's internal structure, ISOMAP
exploits these quantum distances.
ISOMAP Parameters
ISOMAP comes with several parameters, each influencing the dimensionality
reduction process:
Page 21 of 33
• n_neighbors: Determines the number of neighbors used to approximate
geodesic distances. Higher values may make it possible to achieve higher
results, but they still require more computing power.
• n_components: Determines the number of dimensions in a low
dimensional representation.
• eigen_solver: Determines the method used for decomposing an
Eigenvalue. There are options such as "auto", "arpack" and "dense."
• radius: You can designate a radius within which neighbors are taken into
account in place of using a set number of neighbors.
Outside of this range, data points are not regarded as neighbors.
• tol: tolerance in the eigenvalue solver to attain convergence. While a
lower value might result in a more accurate solution, it might also
lengthen the computation time.
• max_iter: The maximum number of times the eigenvalue solver can run.
It continues if None is selected, unless convergence or additional
stopping conditions are satisfied.
• path_method: chooses the approximation technique for geodesic
distances on the graph. 'auto' (automatic selection) and 'FW' (Floyd-
Warshall algorithm) are available options.
• neighbors_algorithm: A method for calculating the closest neighbors.
'Auto', 'ball_tree', 'kd_tree', and 'brute' are among the available options.
'auto' selects the best algorithm according to the input data.
• metric: The nearest neighbor search's distance metric.
'Minkowski' is the default; however, 'euclidean','manhattan', and several
other options are also available.
Working of ISOMAP
• Calculate the pairwise distances: The algorithm starts by calculating the
Euclidean distances between the data points.
• Find nearest neighbors according to these distances: For each data
point, its k nearest neighbor is determined by that distance.
• Create a neighborhood plot: the edges of each point are aligned with
their closest neighbors, which creates a diagram that represents the
data's regional structure.
• Calculate geodesic distances: The Floyd algorithm sorts through all the
pairs of data points in a neighborhood graph and finds the most distant
paths. geodesic distances are represented by these shortest paths.
Page 22 of 33
• Perform dimensional reduction: Classical Multi Scaling MDS is used for
geodesic distance matrices that result in low dimensional embedding of
data.
# Apply Isomap
isomap = Isomap(n_neighbors=10, n_components=2)
X_isomap = isomap.fit_transform(X)
This sample of code illustrates how to apply the dimensionality reduction method
Isomap to a dataset of S curves. Plotting the original 3D data next to the reduced
2D data for visualization follows the generation of S-curve data with 3D
coordinates using Isomap. The fundamental connections between data points in a
lower-dimensional space are preserved by the Isomap transformation, which
captures the underlying geometric structure. With the inherent patterns in the
data still intact, the resultant visualization shows how effective Isomap is at
unfolding the Scurve structure in a more manageable 2D representation.
Page 27 of 33
• Step 5: The intercept c is calculated from the following formula: c = Y –
mX
Thus, we obtain the line of best fit as y = mx + c, where values of m and c can be
calculated from the formulae defined above.
These formulas are used to calculate the parameters of the line that best fits the
data according to the criterion of the least squares, minimizing the sum of the
squared differences between the observed values and the values predicted by the
linear model.
Page 28 of 33
The red points in the above plot represent the data points for the sample data
available. Independent variables are plotted as x-coordinates and dependent ones
are plotted as y-coordinates. The equation of the line of best fit obtained from the
Least Square method is plotted as the red line in the graph.
We can conclude from the above graph that how the Least Square method helps
us to find a line that best fits the given data points and hence can be used to make
further predictions about the value of the dependent variable where it is not known
initially. Limitations of the Least Square Method
The Least Square method assumes that the data is evenly distributed and doesn’t
contain any outliers for deriving a line of best fit. But, this method doesn’t provide
accurate results for unevenly distributed data or for data containing outliers.
Check: Least Square Regression Line
4 8 0.2 0 0 0.04
Page 29 of 33
Sum (Σ) 0 0 55 32.8
The slope of the line of best fit can be calculated from the formula as follows:
m = (Σ (X – xi)*(Y – yi)) /Σ(X – xi)2 m = 55/32.8 =
1.68 (rounded upto 2 decimal places) Now, the intercept will be calculated
from the formula as follows: c = Y – mX
c = 8 – 1.68*4.2 = 0.94
Thus, the equation of the line of best fit becomes, y = 1.68x + 0.94.
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to
the larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random searches provided with historical data to direct the search into the region
of better performance in solution space. They are commonly used to generate
high-quality solutions for optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those
species that can adapt to changes in their environment can survive and reproduce
and go to the next generation. In simple words, they simulate “survival of the
fittest” among individuals of consecutive generations to solve a problem. Each
generation consists of a population of individuals and each individual represents
a point in search space and possible solution. Each individual is represented as a
string of character/integer/float/bits. This string is analogous to the Chromosome.
Page 30 of 33
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with the genetic structure and
behavior of chromosomes of the population. Following is the foundation of GAs
based on this analogy –
1. Individuals in the population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more
offspring than others
3. Genes from the “fittest” parent propagate throughout the generation,
that is sometimes parents create offspring which is better than either
parent.
4. Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space. Each individual
represents a solution in search space for given problem. Each individual is coded
as a finite length vector (analogous to chromosome) of components. These
variable components are analogous to Genes. Thus a chromosome (individual) is
composed of several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual
to “compete”. The individual having optimal fitness score (or near optimal) are
sought.
The GAs maintains the population of n individuals
(chromosome/solutions) along with their fitness scores.The individuals having
better fitness scores are given more chance to reproduce than others. The
individuals with better fitness scores are selected who mate and produce better
offspring by combining chromosomes of parents. The population size is static so
the room has to be created for new arrivals. So, some individuals die and get
replaced by new arrivals eventually creating new generation when all the mating
opportunity of the old population is exhausted. It is hoped that over successive
Page 31 of 33
generations better solutions will arrive while least fit die. Each new generation
has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than
previous generations. Once the offspring produced having no significant
difference from offspring produced by previous populations, the population is
converged. The algorithm is said to be converged to a set of solutions for the
problem. Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the generation using
following operators –
1) Selection Operator: The idea is to give preference to the individuals with
good fitness scores and allow them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two
individuals are selected using selection operator and crossover sites are chosen
randomly. Then the genes at these crossover sites are exchanged thus creating a
completely new individual (offspring). For example –
Page 32 of 33
1) Randomly initialize populations p 2)
Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
Page 33 of 33