0% found this document useful (0 votes)
19 views49 pages

Regularization in Deep Learning (1)

The document discusses overfitting in deep learning, explaining how regularization techniques like L1, L2, and dropout can help prevent it by modifying the learning process to improve model generalization. It also compares L1 and L2 regularization, highlighting their differences and applications, and introduces Radial Basis Function networks and Self-Organizing Maps as additional neural network concepts. Overall, the document emphasizes the importance of regularization in enhancing model accuracy and preventing overfitting in deep learning models.

Uploaded by

Nidhi Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

Regularization in Deep Learning (1)

The document discusses overfitting in deep learning, explaining how regularization techniques like L1, L2, and dropout can help prevent it by modifying the learning process to improve model generalization. It also compares L1 and L2 regularization, highlighting their differences and applications, and introduces Radial Basis Function networks and Self-Organizing Maps as additional neural network concepts. Overall, the document emphasizes the importance of regularization in enhancing model accuracy and preventing overfitting in deep learning models.

Uploaded by

Nidhi Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Regularization in Deep Learning: L1, L2 &

Dropout
What is Overfitting?

Simply put, when a model trains on sample data for an excessively long time or

becomes very complicated, it may begin to learn "noise," or unimportant

information, from the dataset. The model becomes "overfitted" and unable to

generalize successfully to new data when it memorizes the noise and fits the

training set too closely. A model won't be able to carry out the classification or

prediction tasks that it was designed for if it can't generalize successfully to new

data.

What is Regularization?

When completely new data from the problem domain is fed as an input into deep learning

models, regularization is a collection of strategies that can assist prevent overfitting in neural

networks and improve their accuracy by modifying the learning procedure slightly such that the

model generalizes more successfully. The model then performs better on the unobserved data as

a result.

Why Regularization?

Through Regularization the bigger coefficient input parameters receive a "penalty", which

ultimately reduces the variance of the model, and particularly in deep learning the nodes' weight
matrices are penalized. With regularization, a more optimized and better accurate model for

better output is achieved.

How does Regularization work?

When modeling the data, a low bias and high variance scenario is referred to as

overfitting. To handle this, regularization techniques trade more bias for less

variance. Effective regularization is one that strikes the optimal balance between

bias and variation, with the final result being a notable decrease in variance at

the least possible cost to bias. This would imply low variation without

significantly raising the bias value, to put it another way.

Additionally, Regularization orders possible models from weakest overfit to

biggest and adds penalties to more complicated models. Regularization makes

the assumption that the least weights could result in simpler models and help

prevent overfitting.

Techniques of Regularization

So as we now have a better understanding of what overfitting is and how

regularization helps in making deep learning models better and more effective,

now let's shift our focus to the techniques that we need to use for regularization

in deep learning.

L1 Regularization
Essentially, the L1 regularizer searches for parameter vectors that minimize the

parameter vector's norm (the length of the vector). The main issue here is how

to best optimize the parameters of a single neuron, a single layer neural network

generally, and a single layer feed-forward neural network specifically.

Since L1 regularization offers sparse solutions, it is the favored method when

there are many features. Even so, we benefit from the computational advantage

since it is possible to omit features with zero coefficients.

The mathematical representation for the L1 regularization is:

Here the lambda is the regularization parameter. Here we penalize the absolute

value of the weights and weights may be reduced to zero. Hence L1

regularization techniques come very handily when we are trying to compress the

deep learning model.

L2 Regularization

By limiting the coefficient and maintaining all the variables, L2 regularization helps solve

problems with multicollinearity (highly correlated independent variables). The importance of

predictors may be estimated using L2 regression, and based on that, the unimportant predictors

can be penalized.

The mathematical representation for the L2 regularization is:


The regularization parameter, in this case, is lambda. The value of this hyper

parameter is generally tweaked for better outcomes. Since L2 regularization

leads the weights to decay towards zero (but not exactly zero), it is also known

as weight decay.

The key differences between L1 and L2 Regularization

A regression model is referred to as Lasso Regression if the L1 Regularization

method is used and Ridge Regression is the term used if the L2 regularization

method is employed.

The penalty for L1 regularization is equal to the amount of the coefficient in

absolute terms. With this form of regularization, sparse models with few

coefficients may be produced. It's possible that certain coefficients will go to

zero and be dropped from the model. Coefficient values are closer to zero when

the penalties are higher (ideal for producing simpler models).

On the other hand, sparse models or coefficients are not eliminated by L2

regularization. As a result, as compared to the Ridge, Lasso Regression is

simpler to understand.
Apart from this, there are a few other factors where the L1 regularization

technique differs from the L1 regularization. These factors are as follows:

1. L1 regularization can add the penalty term to the cost function by taking the absolute

value of the weight parameters into account. On the other hand, the squared value of the

weights in the cost function is added via L2 regularization.

2. In order to avoid overfitting, L2 regularization makes estimates for the data mean instead

of the median as is done by L1 regularization.

3. Since L2 is a square of weight, it has a closed-form solution; however, L1, which is a

non-differentiable function and includes an absolute value, does not. Due to this, L1

regularization requires more approximations, is computationally more costly, and cannot

be done within the framework of matrix measurement.

Dropout Regularization

Dropout is a regularization method in which certain neurons are disregarded at random. They

"drop out" at random. This means that any weight changes are not applied to the neuron on the

backward trip and that their effect on the activation of downstream neurons is temporally erased

on the forward pass. Neuron weights inside a neural network find their place in the network as it

learns.

Neuronal weights are customized for particular characteristics, resulting in some

specialization. Neighboring neurons start to depend on this specialization,

which, if it goes too far, might produce a fragile model that is overly dependent

on the training data, which can be dangerous.


In the dropout regularization technique, complex co-adaptations are used to describe how a

neuron becomes dependent on circumstances during training.

Radial Basis Function Network (RBF)

Radial Basis Function Network (RBF)

FacebookTwitterEmailInstapaperShare
A Radial Basis Function network is an artificial forward single hidden layer

feed neural network that uses in the field of mathematical modeling as activation

functions.

The output of the RBF network is a linear combination of neuron parameters and

radial basis functions of the inputs. This network is used in time series prediction,

function approximation, system control and classification.

Cover’s theorem on the separability of patterns:

This theorem justify the use of a linear output layer in an RBF network. According

to this theorem, the transformation from the input space to the feature space is

nonlinear and the dimensionality of the feature space is high compared to that of

the input space, so there is a high likelihood that a non separable pattern

classification task in the input space is transformed into a linearly separable one in

the feature space.


Interpolation problem:

It requires every input vector to be mapped exactly onto the corresponding target

vector. The interpolation problem is to determine the real coefficient and the

polynomial term. The function is called a radial basis function if the interpolation

problem has a unique solution.

Supervised learning as an ill-posed hyper surface reconstruction:

The supervised part of the training procedure for the RBF network is concerned

with determining suitable values for the weight connections between the hidden

and the output unit layers. The learning of a neural network is viewed as a hyper

surface reconstruction problem, is an ill-posed inverse problem for following

reasons-

i) Lack of information in the training data as need to reconstruct the input-output

mapping uniquely.

ii) Presence of noise in the input data adds uncertainty.

Regularization theory-

Regularization technique is a way of controlling the smoothness properties of a

mapping function. It involves adding to the error function an extra term which is

designed to penalize mappings which are not smooth. Instead of restricting the

number of hidden units, an alternative approach for preventing over fitting in RBF

networks comes from the regularization theory.


Regularization network-

RBF network can be seen as special case of regularization network. RBF network

have sound theoretical foundation in regularization theory. RBF network fit

naturally into the framework of the regularization of interpolation/approximation

task. For these problems, regularization means the smoothing of the

interpolation/approximation curve, surface. This approach to RBF network, is also

known as regularization network.

Generalized Radial Basis Function networks (RBF):

Generalized RBF Network


RBF network have good generalization ability and a simple network structure that

avoids unnecessary and lengthy calculation. The modified or generalized RBF

network has following characteristics-

i) Gaussian function is modified

ii) Hidden neuron activation is normalized

iii) Output weights are the function of input variables

iv) A sequential learning algorithm is presented.

Regularization parameter estimation:

Unknown weights and the error variance are estimated by regularization. The

regularization parameter have an effect on reducing the variances of the network

parameter estimates. The maximum penalized likelihood estimates the weight

parameter in the RBF network and regularization parameter is given as β=Өα2 and

α2 is error variance.

RBF networks- Approximation properties:

RBF are embedded in a two layer neural network, where each hidden unit

implements radial activated function. The output unit implement weighted sum of

hidden unit outputs. While the input into RBF network is nonlinear, the output is

often linear. Owing to their nonlinear approximation capabilities, RBF network are

able to model complex mappings.

RBF networks and multilayer Perceptron comparison:


Similarities-

i) They are both non-linear feed forward network

ii) They are both universal approximates

iii) They are both used in similar application areas

Dissimilarities-

An RBF network has a single hidden layer, whereas multilayer perceptron can

have any number of hidden layers.

Kernel regression and RBF networks relationship:

The theory of kernel regression provides another viewpoint for the use of RBF

network for function approximation. It provides a framework for estimating

regression function for noisy data using kernel density estimation technique. The

objective of function approximation is to find a mapping from input space to

output space. The mapping is provided by forming the regression, or conditional

average of target data, conditioned on input variables. The regression function is

known as the Nadaraya-Watson estimator.

Learning strategies:

Common learning strategies are orthogonal least squares method and hybrid

learning method. In OLS the hidden neuron, RBF centers, are selected one by one

in a supervised manner. Computationally more efficient hybrid learning method

combines both self organized and supervised learning strategies.


Related Questions and Answer

Q1.What are similarities of RBF and MLPs network?

Ans- Both network are non linear feedforward network, universal approximators,

and used in similar application areas.

Q2. Write difference between RBF and MLPs network?

Ans- RBF has single hidden layer. MLPs have more than one hidden layer.

Q3. Classify the learning strategies for RBF network?

Ans- It is classified as-i)RBF network with fixed number of RBF centers

ii) RBF employing supervised procedure for selecting fixed number of RBF

centers

Q4. What is Cover’s theorem?

Ans- According to this theorem, a complex pattern classification problem cast in a

high dimensional space nonlinearly is more likely to be linearly separable than in a

low dimensional space. It tells us that we can map the input space to a high

dimensional space, in which a linear function will be found.

Regularization Networks

https://www.youtube.com/watch?v=u1OpsSwOGe0

Comparison between MLP and RBF

RBF MLP
RBF MLP

1. RBFN is a ingle hidden layer. 1. MLP is a multiple hidden layer.

--- ---

2. In RBF hidden layer computation nodes are 2. MLP follows the common computational

different from output nodes. model in hidden as well as output.

3. In RBF hidden layer is non-linear and output 3. In MLP hidden layer and output layer is

layer is linear. linear.

4. The argument of RBF activation function 4. Each hidden unit computes the inner product

computes Euclidean norm between input vector of input vector and synaptic vector.

and centre.

5. Exponentially decaying local characteristics. 5. Global approximation to non-linear input -

output mapping.

6. RBFN is fully connected. 6. MLP can be partially connected.

7. In RBFN, the hidden nodes operate differently 7. In MLP, the hidden nodes share a common

i.e. they have different models. model not necessary the same activation

function.

8. In RBF network we take differece of input 8. In MLP network we take product of input
RBF MLP

vector and weight vector vector and weight vector.

9. In RBF training of 1 layer at a time. 9. In MLP training of all layer simultaneously.

10. RBFN does faster training process. 10. MLP is slower in training process.

11. RBFN is slow when practically used. 11. MLP is faster when practically used.

Self Organizing Maps or Kohenin’s map is a type of artificial neural

networks introduced by Teuvo Kohonen in the 1980s. (Paper link)

SOM is trained using unsupervised learning, it is a little bit different from

other artificial neural networks, SOM doesn’t learn by backpropagation

with SGD,it use competitive learning to adjust weights in neurons. And we

use this type of artificial neural networks in dimension reduction to reduce

our data by creating a spatially organized representation, also it help us to

discover the correlation between data.

SOM’s architecture :

Self organizing maps have two layers, the first one is the input layer and the

second one is the output layer or the feature map.


Unlike other ANN types, SOM doesn’t have activation function in neurons,

we directly pass weights to output layer without doing anything.

Each neuron in a SOM is assigned a weight vector with the same

dimensionality d as the input space.

Self organizing maps training

As we mention before, SOM doesn’t use backpropagation with SGD to

update weights, this type of unsupervised artificial neural network uses

competetive learning to update its weights.

Competetive learning is based on three processes :

• Competetion

• Cooperation

• Adaptation
Let’s explain those processes.

1)Competetion :

As we said before each neuron in a SOM is assigned a weight vector with

the same dimensionality as the input space.

In the example below, in each neuron of the output layer we will have a

vector with dimension n.

We compute distance between each neuron (neuron from the output layer)

and the input data, and the neuron with the lowest distance will be the

winner of the competetion.

The Euclidean metric is commonly used to measure distance.


2) Coorporation:

We will update the vector of the winner neuron in the final process

(adaptation) but it is not the only one, also it’s neighbor will be updated.

How do we choose the neighbors ?

To choose neighbors we use neighborhood kernel function, this function

depends on two factor : time ( time incremented each new input data) and

distance between the winner neuron and the other neuron (How far is the

neuron from the winner neuron).

The image below show us how the winner neuron’s ( The most green one

in the center) neighbors are choosen depending on distance and time

factors.
Time and distance factors

3) Adaptation:

After choosing the winner neuron and it’s neighbors we compute neurons

update. Those choosen neurons will be updated but not the same update,

more the distance between neuron and the input data grow less we adjust it

like shown in the image below :


neurons of the output layer update

The winner neuron and it’s neighbors will be updated using this formula:
This learning rate indicates how much we want to adjust our weights.

After time t (positive infinite), this learning rate will converge to zero so we

will have no update even for the neuron winner .

The neighborhood kernel depends on the distance between winner neuron

and the other neuron (they are proportionally reversed : d increase make

h(t) decrease) and the neighborhood size wich itself depends on time (

decrease while time incrementing) and this make neighborhood kernel

function decrease also.

Full SOM algorithm :


The concept of a self-organizing map, or SOM, was first put forth by Kohonen. It is a way to

reduce data dimensions since it is an unsupervised neural network that is trained

using unsupervised learning techniques to build a low-dimensional, discretized

representation from the input space of the training samples. This representation is known as a

map.

In this article, we will be going through a Beginner’s guide to a popular Self Organizing Map

- The Kohonen Map. We will start with understanding what Self-Organizing Maps are.

What Are Self-Organizing Maps?

A sort of artificial neural network called a self-organizing map, often known as a Kohonen map

or SOM, was influenced by 1970s neural systems’ biological models. It employs an

unsupervised learning methodology and uses a competitive learning algorithm to train its

network. To minimize complex issues for straightforward interpretation, SOM is utilized for

mapping and clustering (or dimensionality reduction) procedures to map multidimensional data
onto lower-dimensional spaces. The output layer and the input layer are the two layers that make

up the SOM. This is also known as the Kohonen Map.

Now that we have discussed what SOMs are, we will now be discussing how the Kohonen Maps

work.

How Do SOMs Work?

Consider an input set with the dimensions (m, n), where m represents n represents the number of

features present in each example and the number of training examples. The weights of size (n,

C), where C is the number of clusters, are first initialized. The winning vector (the weight vector

with the shortest distance from the training example, for example, the Euclidean distance) is then

updated after iterating over the input data for each training example. Weight update guidelines

are provided by:

wij = wij(old) + alpha(t) * (xik - wij(old))

Here, i stands for the ith feature of the training example, j is the winning vector, alpha is the

learning rate at time t, and k is the input data’s kth training example. The SOM network is

trained, and trained weights are utilized to cluster new examples. A new example is included in

the collection of successful vectors.

Algorithm

The involved actions are:

• Step 1: Initialize each node weight's w_ij to a random value.

• Step 2: Select input vector x k at random.

• Step 3: For each node on the map, repeat steps 4 and 5 once more.

• Step 4: Find the distance in Euclid between the input vector x(t), and the weight vector wij connected

to the first node, where t, i, and j are all equal to 0.

• Step 5: Keep an eye on the node that produces the least t-distance.
• Step 6: Make a global Best Matching Unit calculation (BMU). It refers to the node that is closest to all

other calculated nodes.

• Step 7: Find the BMU's topological neighborhood and its radius in the Kohonen Map.

Note: Steps 2 through 9 represent the training phase, whereas step 1 represents the initiation

phase.

Here,

X → input vector

The neighborhood function's radius, o(t), determines how far neighbor nodes in the 2D grid are

inspected when updating vectors. Over time, it gradually gets smaller.

w_ij → is the association weight between grid nodes I j.

t → current iteration

At iteration t, X(t) equals the input vector instance.

i → is the grid's row coordinate for nodes.

w → Weight vector

The neighborhood function, which represents the distance between nodes I j and the BMU, is

β_ij.

j → is the grid's column coordinate for nodes.

Let us now discuss the various uses of Self-Organizing or Kohonen Maps.


Self-Organizing Maps for Vector

Quantization
Self-Organizing Maps for Vector Quantization: A powerful technique for

data representation and compression in machine learning applications.

Self-Organizing Maps (SOMs) are a type of unsupervised learning

algorithm used in machine learning to represent high-dimensional data in a

lower-dimensional space. They are particularly useful for vector

quantization, a process that compresses data by approximating it with a

smaller set of representative vectors. This article explores the nuances,

complexities, and current challenges of using SOMs for vector

quantization, as well as recent research and practical applications.

Recent research in the field has focused on various aspects of vector

quantization, such as coordinate-independent quantization, ergodic

properties, constrained randomized quantization, and quantization of

Kähler manifolds. These studies have contributed to the development of

new techniques and approaches for quantization, including tautologically


tuned quantization, lattice vector quantization coupled with spatially

adaptive companding, and per-vector scaled quantization.

Three practical applications of SOMs for vector quantization include:

1. Image compression: SOMs can be used to compress images by reducing

the number of colors used in the image while maintaining its overall

appearance. This can lead to significant reductions in file size without a

noticeable loss in image quality.

2. Data clustering: SOMs can be used to group similar data points together,

making it easier to identify patterns and trends in large datasets. This can

be particularly useful in applications such as customer segmentation,

anomaly detection, and document classification.

3. Feature extraction: SOMs can be used to extract meaningful features

from complex data, such as images or audio signals. These features can

then be used as input for other machine learning algorithms, improving

their performance and reducing computational complexity.

A company case study that demonstrates the use of SOMs for vector

quantization is LVQAC, which developed a novel Lattice Vector

Quantization scheme coupled with a spatially Adaptive Companding


(LVQAC) mapping for efficient learned image compression. By replacing

uniform quantizers with LVQAC, the company achieved better rate-

distortion performance without significantly increasing model complexity.

In conclusion, Self-Organizing Maps for Vector Quantization offer a

powerful and versatile approach to data representation and compression in

machine learning applications. By synthesizing information from various

research studies and connecting them to broader theories, we can continue

to advance our understanding of this technique and develop new,

innovative solutions for a wide range of problems.

How do Self-Organizing Maps work in vector quantization?

Self-Organizing Maps (SOMs) work in vector quantization by representing high-dimensional

data in a lower-dimensional space. They use unsupervised learning to create a grid of nodes,

where each node represents a prototype vector. During the training process, the algorithm

adjusts the prototype vectors to better represent the input data. The result is a compressed

representation of the data, where similar data points are grouped together in the lower-

dimensional space.

What are the advantages of using Self-Organizing Maps for vector quantization?

The advantages of using Self-Organizing Maps for vector quantization include: 1. Data

compression: SOMs can significantly reduce the size of data by approximating it with a

smaller set of representative vectors, making it more manageable and efficient to process. 2.
Visualization: By representing high-dimensional data in a lower-dimensional space, SOMs

make it easier to visualize complex data patterns and relationships. 3. Unsupervised learning:

SOMs do not require labeled data for training, making them suitable for applications where

labeled data is scarce or expensive to obtain. 4. Robustness: SOMs are less sensitive to noise

and outliers in the data, making them more robust in real-world applications. 5. Adaptability:

SOMs can be easily adapted to different types of data and problems, making them a versatile

tool for various machine learning tasks.

What are the challenges in using Self-Organizing Maps for vector quantization?

Some challenges in using Self-Organizing Maps for vector quantization include: 1.

Computational complexity: The training process for SOMs can be computationally intensive,

especially for large datasets and high-dimensional data. 2. Parameter selection: Choosing the

appropriate parameters, such as the size of the map and the learning rate, can significantly

impact the performance of the SOM. 3. Lack of a global optimum: SOMs do not guarantee

convergence to a global optimum, which can result in suboptimal solutions. 4.

Interpretability: While SOMs can provide a visual representation of the data, interpreting the

results can still be challenging, especially for non-experts.

How does image compression using Self-Organizing Maps work?

Image compression using Self-Organizing Maps works by reducing the number of colors

used in the image while maintaining its overall appearance. During the training process, the

SOM learns a set of representative colors (prototype vectors) from the input image. The

original colors in the image are then replaced with the closest representative colors from the
trained SOM. This results in a compressed image with a smaller color palette, leading to

significant reductions in file size without a noticeable loss in image quality.

Are there any alternatives to Self-Organizing Maps for vector quantization?

Yes, there are several alternatives to Self-Organizing Maps for vector quantization,

including: 1. K-means clustering: A popular unsupervised learning algorithm that partitions

data into K clusters, where each cluster is represented by a centroid. 2. Principal Component

Analysis (PCA): A linear dimensionality reduction technique that projects data onto a lower-

dimensional space while preserving the maximum amount of variance. 3. Vector

Quantization using Lattice Quantizers: A method that uses a predefined lattice structure to

quantize data points, resulting in a more regular and structured representation. 4.

Autoencoders: A type of neural network that learns to compress and reconstruct input data,

often used for dimensionality reduction and feature extraction. Each of these alternatives has

its own strengths and weaknesses, and the choice of method depends on the specific problem

and requirements of the application.

Principal Component Analysis


Principal Component Analysis is an unsupervised learning algorithm that is used for the

dimensionality reduction in machine learning. It is a statistical process that converts the observations

of correlated features into a set of linearly uncorrelated features with the help of orthogonal

transformation. These new transformed features are called the Principal Components. It is one of

the popular tools that is used for exploratory data analysis and predictive modeling. It is a technique

to draw strong patterns from the given dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good

split between the classes, and hence it reduces the dimensionality. Some real-world applications of

PCA are image processing, movie recommendation system, optimizing the power allocation in

various communication channels. It is a feature extraction technique, so it contains the important

variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance

o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given dataset.

More easily, it is the number of columns present in the dataset.

o Correlation: It signifies that how strongly two variables are related to each other.

Such as if one changes, the other variable also gets changed. The correlation value

ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each

other, and +1 indicates that variables are directly proportional to each other.

o Orthogonal: It defines that variables are not correlated to each other, and hence the

correlation between the pair of variables is zero.

o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v

will be eigenvector if Av is the scalar multiple of v.

o Covariance Matrix: A matrix containing the covariance between the pair of

variables is called the Covariance Matrix.


Principal Components in PCA

As described above, the transformed new features or the output of PCA are the Principal

Components. The number of these PCs are either equal to or less than the original features present in

the dataset. Some properties of these principal components are given below:

o The principal component must be the linear combination of the original features.

o These components are orthogonal, i.e., the correlation between a pair of variables is

zero.

o The importance of each component decreases when going to 1 to n, it means the 1 PC

has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1. Getting the dataset

Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the

training set, and Y is the validation set.

2. Representing data into a structure

Now we will represent our dataset into a structure. Such as we will represent the two-dimensional

matrix of independent variable X. Here each row corresponds to the data items, and the column

corresponds to the Features. The number of columns is the dimensions of the dataset.

3. Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the features with high

variance are more important compared to the features with lower variance.

If the importance of features is independent of the variance of the feature, then we will divide each

data item in a column with the standard deviation of the column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we

will multiply it by Z. The output matrix will be the Covariance matrix of Z.

5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z.

Eigenvectors or the covariance matrix are the directions of the axes with high information. And the

coefficients of these eigenvectors are defined as the eigenvalues.

6. Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from

largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues.

The resultant matrix will be named as P*.

7. Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the

resultant matrix Z*, each observation is the linear combination of original features. Each column of

the Z* matrix is independent of each other.

8. Remove less or unimportant features from the new dataset.

The new feature set has occurred, so we will decide here what to keep and what to remove. It means,

we will only keep the relevant or important features in the new dataset, and unimportant features will

be removed out.

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI

applications such as computer vision, image compression, etc.

o It can also be used for finding hidden patterns if data has high dimensions. Some

fields where PCA is used are Finance, data mining, Psychology, etc.
critical to perform standardization prior to PCA, is that the latter is quite sensitive

regarding the variances of the initial variables. That is, if there are large differences

between the ranges of initial variables, those variables with larger ranges will

dominate over those with small ranges (for example, a variable that ranges between

0 and 100 will dominate over a variable that ranges between 0 and 1), which will

lead to biased results. So, transforming the data to comparable scales can prevent

this problem.

Mathematically, this can be done by subtracting the mean and dividing by the

standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same

scale.

Step 2: Covariance Matrix Computation

The aim of this step is to understand how the variables of the input data set are

varying from the mean with respect to each other, or in other words, to see if there

is any relationship between them. Because sometimes, variables are highly

correlated in such a way that they contain redundant information. So, in order to

identify these correlations, we compute the covariance matrix.


The covariance matrix is a p × p symmetric matrix (where p is the number of

dimensions) that has as entries the covariances associated with all possible pairs of

the initial variables. For example, for a 3-dimensional data set with 3 variables x, y,

and z, the covariance matrix is a 3×3 data matrix of this from:

Covariance Matrix for 3-

Dimensional Data.

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in

the main diagonal (Top left to bottom right) we actually have the variances of each

initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the

entries of the covariance matrix are symmetric with respect to the main diagonal,

which means that the upper and the lower triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us about the

correlations between the variables?

It’s actually the sign of the covariance that matters:

• If positive then: the two variables increase or decrease together (correlated)

• If negative then: one increases when the other decreases (Inversely

correlated)
Now that we know that the covariance matrix is not more than a table that

summarizes the correlations between all the possible pairs of variables, let’s move

to the next step.

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal

components

Eigenvectors and eigenvalues are the linear algebra concepts that we need to

compute from the covariance matrix in order to determine the principal

components of the data.

What you first need to know about eigenvectors and eigenvalues is that they

always come in pairs, so that every eigenvector has an eigenvalue. Also, their

number is equal to the number of dimensions of the data. For example, for a 3-

dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3

corresponding eigenvalues.

It is eigenvectors and eigenvalues who are behind all the magic of principal

components because the eigenvectors of the Covariance matrix are

actually the directions of the axes where there is the most variance (most

information) and that we call Principal Components. And eigenvalues are simply

the coefficients attached to eigenvectors, which give the amount of variance

carried in each Principal Component.


By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you

get the principal components in order of significance.

Principal Component Analysis Example:

Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the

eigenvectors and eigenvalues of the covariance matrix are as follows:

If we rank the eigenvalues in descending order, we get λ1>λ2, which means that

the eigenvector that corresponds to the first principal component (PC1) is v1 and

the one that corresponds to the second principal component (PC2) is v2.

After having the principal components, to compute the percentage of variance

(information) accounted for by each component, we divide the eigenvalue of each

component by the sum of eigenvalues. If we apply this on the example above, we

find that PC1 and PC2 carry respectively 96 percent and 4 percent of the variance

of the data.
Step 4: Create a Feature Vector

As we saw in the previous step, computing the eigenvectors and ordering them by

their eigenvalues in descending order, allow us to find the principal components in

order of significance. In this step, what we do is, to choose whether to keep all

these components or discard those of lesser significance (of low eigenvalues), and

form with the remaining ones a matrix of vectors that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the eigenvectors of the

components that we decide to keep. This makes it the first step towards

dimensionality reduction, because if we choose to keep only p eigenvectors

(components) out of n, the final data set will have only p dimensions.

Principal Component Analysis Example:

Continuing with the example from the previous step, we can either form a feature

vector with both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a

feature vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will

consequently cause a loss of information in the final data set. But given that v2 was
carrying only 4 percent of the information, the loss will be therefore not important

and we will still have 96 percent of the information that is carried by v1.

So, as we saw in the example, it’s up to you to choose whether to keep all the

components or discard the ones of lesser significance, depending on what you are

looking for. Because if you just want to describe your data in terms of new

variables (principal components) that are uncorrelated without seeking to reduce

dimensionality, leaving out lesser significant components is not needed.

Step 5: Recast the Data Along the Principal Components Axes

In the previous steps, apart from standardization, you do not make any changes on

the data, you just select the principal components and form the feature vector, but

the input data set remains always in terms of the original axes (i.e, in terms of the

initial variables).

In this step, which is the last one, the aim is to use the feature vector formed using

the eigenvectors of the covariance matrix, to reorient the data from the original

axes to the ones represented by the principal components (hence the name

Principal Components Analysis). This can be done by multiplying the transpose of

the original data set by the transpose of the feature vector.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy