0% found this document useful (0 votes)
50 views31 pages

Unit 3

Principal Component Analysis (PCA) is a technique used for dimensionality reduction that identifies a set of orthogonal axes called principal components that capture maximum variance in the data. PCA is useful for data visualization, feature selection, and data compression by reducing dimensions while preserving important patterns. It assumes information is carried in feature variance.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views31 pages

Unit 3

Principal Component Analysis (PCA) is a technique used for dimensionality reduction that identifies a set of orthogonal axes called principal components that capture maximum variance in the data. PCA is useful for data visualization, feature selection, and data compression by reducing dimensions while preserving important patterns. It assumes information is carried in feature variance.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Principal Component Analysis(PCA) :

As the number of features or dimensions in a dataset increases, the amount of data


required to obtain a statistically significant result increases exponentially. This can lead
to issues such as overfitting, increased computation time, and reduced accuracy of
machine learning models this is known as the curse of dimensionality problems that
arise while working with high-dimensional data.
As the number of dimensions increases, the number of possible combinations of
features increases exponentially, which makes it computationally difficult to obtain a
representative sample of the data and it becomes expensive to perform tasks such as
clustering or classification because it becomes. Additionally, some machine
learning algorithms can be sensitive to the number of dimensions, requiring more data
to achieve the same level of accuracy as lower-dimensional data.
To address the curse of dimensionality, Feature engineering techniques are used which
include feature selection and feature extraction. Dimensionality reduction is a type of
feature extraction technique that aims to reduce the number of input features while
retaining as much of the original information as possible.
In this article, we will discuss one of the most popular dimensionality reduction
techniques i.e. Principal Component Analysis(PCA).
s ad will end in 9

What is Principal Component Analysis(PCA)?


Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901. It works on the condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of
the data in the lower dimensional space should be maximum.
• Principal Component Analysis (PCA) is a statistical procedure that uses an
orthogonal transformation that converts a set of correlated variables to a set
of uncorrelated variables.PCA is the most widely used tool in exploratory
data analysis and in machine learning for predictive models. Moreover,
• Principal Component Analysis (PCA) is an unsupervised learning algorithm
technique used to examine the interrelations among a set of variables. It is
also known as a general factor analysis where regression determines a line of
best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important patterns or
relationships between the variables without any prior knowledge of the target
variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set
by finding a new set of variables, smaller than the original set of variables, retaining
most of the sample’s information, and useful for the regression and classification of
data.

Principal Component Analysis

1. Principal Component Analysis (PCA) is a technique for dimensionality


reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data. The principal
components are linear combinations of the original variables in the dataset
and are ordered in decreasing order of importance. The total variance
captured by all the principal components is equal to the total variance in the
original dataset.
2. The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that
is orthogonal to the first principal component, and so on.
3. Principal Component Analysis can be used for a variety of purposes,
including data visualization, feature selection, and data compression. In data
visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret. In feature selection, PCA can be
used to identify the most important variables in a dataset. In data
compression, PCA can be used to reduce the size of a dataset without losing
important information.
4. In Principal Component Analysis, it is assumed that the information is carried
in the variance of the features, that is, the higher the variation in a feature, the
more information that features carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex
datasets, making them easier to understand and work with.
Step-By-Step Explanation of PCA (Principal Component Analysis)
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0
and a standard deviation of 1.

Step2: Covariance Matrix Computation


Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we
can use the formula:

The value of covariance can be positive, negative, or zeros.

• Positive: As the x1 increases x2 also increases.


• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation
Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to
Identify Principal Components
.

Advantages of Principal Component Analysis


1. Dimensionality Reduction: Principal Component Analysis is a popular
technique used for dimensionality reduction, which is the process of reducing
the number of variables in a dataset. By reducing the number of variables,
PCA simplifies data analysis, improves performance, and makes it easier to
visualize data.
2. Feature Selection: Principal Component Analysis can be used for feature
selection, which is the process of selecting the most important variables in a
dataset. This is useful in machine learning, where the number of variables can
be very large, and it is difficult to identify the most important variables.
3. Data Visualization: Principal Component Analysis can be used for data
visualization. By reducing the number of variables, PCA can plot high-
dimensional data in two or three dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal
with multicollinearity, which is a common problem in a regression analysis
where two or more independent variables are highly correlated. PCA can help
identify the underlying structure in the data and create new, uncorrelated
variables that can be used in the regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce the
noise in data. By removing the principal components with low variance,
which are assumed to represent noise, Principal Component Analysis can
improve the signal-to-noise ratio and make it easier to identify the underlying
structure in the data.
6. Data Compression: Principal Component Analysis can be used for data
compression. By representing the data using a smaller number of principal
components, which capture most of the variation in the data, PCA can reduce
the storage requirements and speed up processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the
other data points in the dataset. Principal Component Analysis can identify
these outliers by looking for data points that are far from the other points in
the principal component space.
Disadvantages of Principal Component Analysis
1. Interpretation of Principal Components: The principal components created
by Principal Component Analysis are linear combinations of the original
variables, and it is often difficult to interpret them in terms of the original
variables. This can make it difficult to explain the results of PCA to others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the
data. If the data is not properly scaled, then PCA may not work well.
Therefore, it is important to scale the data before applying Principal
Component Analysis.
3. Information Loss: Principal Component Analysis can result in information
loss. While Principal Component Analysis reduces the number of variables, it
can also lead to loss of information. The degree of information loss depends
on the number of principal components selected. Therefore, it is important to
carefully select the number of principal components to retain.
4. Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work
well.
5. Computational Complexity: Computing Principal Component Analysis can
be computationally expensive for large datasets. This is especially true if the
number of variables in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result
in overfitting, which is when the model fits the training data too well and
performs poorly on new data. This can happen if too many principal
components are used or if the model is trained on a small dataset.

Principal component analysis, or PCA, is a statistical procedure that allows


you to summarize the information content in large data tables by means of a
smaller set of “summary indices” that can be more easily visualized and
analyzed. The underlying data can be measurements describing
properties of production samples, chemical compounds or
reactions, process time points of a continuous process, batches
from a batch process, biological individuals or trials of a DOE-
protocol.

Principal component analysis today is one of the most popular multivariate statistical
techniques. It has been widely used in the areas of pattern recognition and signal
processing and is a statistical method under the broad title of factor analysis.
PCA forms the basis of multivariate data analysis based on projection methods. The
most important use of PCA is to represent a multivariate data table as smaller set of
variables (summary indices) in order to observe trends, jumps, clusters and outliers.
This overview may uncover the relationships between observations and variables, and
among the variables.

PCA is a very flexible tool and allows analysis of datasets that may contain, for example,
multicollinearity, missing values, categorical data, and imprecise measurements. The goal is to
extract the important information from the data and to express this information as a set of
summary indices called principal components.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that
approximate the data as well as possible in the least squares sense. A line or plane that is the least
squares approximation of a set of data points makes the variance of the coordinates on the line or
plane as large as possible.
PCA creates a visualization of data that minimizes residual variance in the least squares sense
and maximizes the variance of the projection coordinates.

How PCA works

In a previous article, we explained why pre-treating data for PCA is necessary. Now, let’s take a
look at how PCA works, using a geometrical approach.

Consider a matrix X with N rows (aka "observations") and K columns (aka "variables"). For this
matrix, we construct a variable space with as many dimensions as there are variables (see figure
below). Each variable represents one coordinate axis. For each variable, the length has been
standardized according to a scaling criterion, normally by scaling to unit variance. You can find
more details on scaling to unit variance in the previous blog post.

A K-dimensional variable space. For simplicity, only three variables axes are displayed.
The “length” of each coordinate axis has been standardized according to a specific criterion,
usually unit variance scaling.

In the next step, each observation (row) of the X-matrix is placed in the K-dimensional variable
space. Consequently, the rows in the data table form a swarm of points in this space.
The observations (rows) in the data matrix X can be understood as a swarm of points in the
variable space (K-space).

Mean centering

Next, mean-centering involves the subtraction of the variable averages from the data. The vector
of averages corresponds to a point in the K-space.

In the mean-centering procedure, you first compute the variable averages. This vector of
averages is interpretable as a point (here in red) in space. The point is situated in the middle of
the point swarm (at the center of gravity).
The subtraction of the averages from the data corresponds to a re-positioning of the coordinate
system, such that the average point now is the origin.

The mean-centering procedure corresponds to moving the origin of the coordinate system to
coincide with the average point (here in red).

The first principal component

After mean-centering and scaling to unit variance, the data set is ready for computation of the
first summary index, the first principal component (PC1). This component is the line in the K-
dimensional variable space that best approximates the data in the least squares sense. This line
goes through the average point. Each observation (yellow dot) may now be projected onto this
line in order to get a coordinate value along the PC-line. This new coordinate value is also
known as the score.

The first principal component (PC1) is the line that best accounts for the shape of the point
swarm. It represents the maximum variance direction in the data. Each observation (yellow dot)
may be projected onto this line in order to get a coordinate value along the PC-line. This value
is known as a score.

The second principal component

Usually, one summary index or principal component is insufficient to model the systematic
variation of a data set. Thus, a second summary index – a second principal component (PC2) – is
calculated. The second PC is also represented by a line in the K-dimensional variable space,
which is orthogonal to the first PC. This line also passes through the average point, and improves
the approximation of the X-data as much as possible.

The second principal component (PC2) is oriented such that it reflects the second largest source
of variation in the data while being orthogonal to the first PC. PC2 also passes through the
average point.
Two principal components define a model plane

When two principal components have been derived, they together define a place, a window into
the K-dimensional variable space. By projecting all the observations onto the low-dimensional
sub-space and plotting the results, it is possible to visualize the structure of the investigated data
set. The coordinate values of the observations on this plane are called scores, and hence the
plotting of such a projected configuration is known as a score plot.

Two PCs form a plane. This plane is a window into the multidimensional space, which can be
visualized graphically. Each observation may be projected onto this plane, giving a score for
each.

Singular Value Decomposition (SVD)


The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix
into three matrices. It has some interesting algebraic properties and conveys important
geometrical and theoretical insights about linear transformations. It also has some
important applications in data science. In this article, I will try to explain the
mathematical intuition behind SVD and its geometrical meaning.
Mathematics behind SVD:
The SVD of mxn matrix A is given by the formula
where:
0 seconds of 15 secondsVolume 0%

• U: mxm matrix of the orthonormal eigenvectors of .


• VT: transpose of a nxn matrix containing the orthonormal eigenvectors of
.
• : diagonal matrix with r elements equal to the root of the positive
eigenvalues of AAᵀ or Aᵀ A (both matrics have the same positive eigenvalues
anyway).
Examples

• Find the SVD for the matrix A =


• To calculate the SVD, First, we need to compute the singular values by
finding eigenvalues of AA^{T}.
• The characteristic equation for the above matrix is:

so our singular values are:


• Now we find the right singular vectors i.e orthonormal set of eigenvectors of
ATA. The eigenvalues of ATA are 25, 9, and 0, and since ATA is symmetric
we know that the eigenvectors will be orthogonal.
For

which can be row-reduces to :

A unit vector in the direction of it is:

Similarly, for \lambda = 9, the eigenvector is:


For the 3rd eigenvector, we could use the property that it is perpendicular to v1 and v2
such that:

Solving the above equation to generate the third eigenvector

Now, we calculate U using the formula u_i = \frac{1}{\sigma} A v_i and this gives U =

. Hence, our final SVD equation becomes:

Applications
• Calculation of Pseudo-inverse: Pseudo inverse or Moore-Penrose inverse is
the generalization of the matrix inverse that may not be invertible (such as
low-rank matrices). If the matrix is invertible then its inverse will be equal to
Pseudo inverse but pseudo inverse exists for the matrix that is not invertible.
It is denoted by A+.
Suppose, we need to calculate the pseudo-inverse of a matrix M:
Then, the SVD of M can be given as:
Multiply both sides by M^{-1}.
[Tex]I= M^{-1} U W V^{T}[/Tex]Multiply both side by V:
[Tex]V = M^{-1} U W[/Tex]Multiply by W^{-1}Since the W is the singular matrix, the

inverse of W
is [Tex]V W^{-1} = M^{-1} U W
W^{-1}[/Tex] Multiply by [Tex]V W^{-1} U^{T} = M^{-1}
U U^{T}[/Tex]
The above equation gives the pseudo-inverse.
Solving a set of Homogeneous Linear Equation (Mx =b): if b=0, calculate SVD and
take any column of VT associated with a singular value (in W) equal to 0.
If , Multiply by [Tex]M^{-1} M x = M^{-1} b [/Tex]

From the Pseudo-inverse, we know that


Hence,

• Rank, Range, and Null space:


• The rank of matrix M can be calculated from SVD by the number
of nonzero singular values.
• The range of matrix M is The left singular vectors of U
corresponding to the non-zero singular values.
• The null space of matrix M is The right singular vectors of V
corresponding to the zeroed singular values.

• Curve Fitting Problem: Singular value decomposition can be used to


minimize the least square error. It uses the pseudo inverse to approximate it.
• Besides the above application, singular value decomposition and pseudo-
inverse can also be used in Digital signal processing and image processing

• We need to represent the matrix in a form such that, the most important
part of the matrix which is needed for further computations could be
extracted easily. That’s where the Singular Value
Decomposition(SVD) comes into play.

• SVD is basically a matrix factorization technique, which decomposes


any matrix into 3 generic and familiar matrices. It has some cool
applications in Machine Learning and Image Processing. To understand
the concept of Singular Value Decomposition the knowledge
on eigenvalues and eigenvectors is essential. If you have a pretty good
understanding on eigenvalues and eigenvectors, scroll down a bit to
experience the Singular Value Decomposition.

• Eigenvalues and Eigenvectors


• Image Credits: https://commons.wikimedia.org/wiki/File:Eigenvectors.gif

• The multiplication of a Matrix and a Vector produces another vector


which is defined as the transformation occurred to that vector with
respect to the specific matrix in the given vector space. However, there
exist some vectors for some given Matrix such that their direction does
not change even after the transformation is applied (similar to the
vectors colored in blue in the above GIF). Such vectors are called
the eigenvectors of the given matrix while the scaled valued of the
vector after the transformation is defined as eigenvalue corresponding
to that eigenvector. This can be illustrated as follows:

Autoencoders & Relation to PCA :-


PCA and auto-encoders are two popular methods for lowering the dimensionality of the
feature space.
Principal Component Analysis (PCA)
PCA simply projects the data into another space by learning a linear transformation with
projection vectors specified by the data’s variance. Dimensionality reduction may be
achieved by limiting the dimensionality to a small number of components that account
for the majority of the variation in the data set.
Autoencoders
Autoencoders are neural networks that stack numerous non-linear transformations to
reduce input into a low-dimensional latent space (layers). They use an encoder-decoder
system. The encoder converts the input into latent space, while the decoder reconstructs
it. For accurate input reconstruction, they are trained through backpropagation.
Autoencoders may be used to reduce dimensionality when the latent space has fewer
dimensions than the input. Because they can rebuild the input, these low-dimensional
latent variables should store the most relevant properties, according to intuition.

Simple Illustration of a generic autoencoder

PCA vs Autoencoder
• Although PCA is fundamentally a linear transformation, auto-encoders may
describe complicated non-linear processes.
• Because PCA features are projections onto the orthogonal basis, they are
completely linearly uncorrelated. However, since autoencoded features are
only trained for correct reconstruction, they may have correlations.
• PCA is quicker and less expensive to compute than autoencoders.
• PCA is quite similar to a single layered autoencoder with a linear activation
function.
• Because of the large number of parameters, the autoencoder is prone to
overfitting. (However, regularization and proper planning might help to
prevent this).
What are Autoencoders?
Autoencoder is Feed-Forward Neural Networks where the input and the output are the same.
Autoencoders encode the image and then decode it to get the same image. The core idea of
autoencoders is that the middle layer must contain enough information to represent the input.
There are three important properties of autoencoders:
1. Data Specific: We can only use autoencoders for the data that it has previously been trained
on. For instance, to encode an MNIST digits image, we’ll have to use an autoencoder that
previously has been trained on the MNIST digits dataset.
2. Lossy: Information is lost while encoding and decoding the images using autoencoders, which
means that the reconstructed image will have some missing details compared to the original
image.
3. Unsupervised: Autoencoders belong to the unsupervised machine learning category because
we do not require explicit labels corresponding to the data; the data itself acts as input and
output.

Autoencoders are Neural Networks which are commonly used for feature
selection and extraction. However, when there are more nodes in the hidden
layer than there are inputs, the Network is risking to learn the so-called
“Identity Function”, also called “Null Function”, meaning that the output
equals the input, marking the Autoencoder useless.

Regularization of Autoencoders

1. What is regularization in autoencoder?


Regularized autoencoders use a loss function that encourages the model to have
other properties besides copying its input to its output.
2. What is the need for regularization while training a neural?
If you've built a neural network before, you know how complex they are. This
makes them more prone to overfitting. Regularization is a technique that makes
slight modifications to the learning algorithm such that the model generalizes better.

Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the
error of reconstructing the original input. After training, autoencoders are then used as a normal feed-
forward neural network for activations. This is an unsupervised form of feature extraction because the
neural network uses only the original input for learning weights rather than backpropagation, which has
labels. Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a
single network rarely uses both).

Use of autoencoders
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in
reducing the dimensions of the dataset. The output of the autoencoder is a reformation of the
input data in the most efficient form.

Similarities of autoencoders to multilayer perceptron

Autoencoders are identical to multilayer perceptron neural networks because, like multilayer
perceptrons, autoencoders have an input layer, some hidden layers, and an output layer. The key
difference between a multilayer perceptron network and an autoencoder is that the output layer
of an autoencoder has the same number of neurons as that of the input layer.
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to
minimize parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda,
controlling the trade-off between finding a good fit and keeping the value of certain feature
weights low as the exponents on features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller.
Smaller-valued weights lead to simpler hypotheses, which are the most generalizable.
Unregularized weights with several higher-order polynomials in the feature sets tend to overfit
the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters
tend to increase in magnitude. This is appropriate because an excess of features relative to
training set examples leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden
layer of smaller dimensions than the input. The regularized autoencoders use a loss function that
helps the model to have other properties besides copying input to the output. We can generally
find two types of regularized autoencoder: the denoising autoencoder and the sparse
autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add
random noise to the input and recover it to the original form by removing noise from the input
data. This prevents the autoencoder from copying the data from input to output because it
contains random noise. We ask it to subtract the noise and produce meaningful underlying data.
This is called a denoising autoencoder.

In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
will not get the original images, but autoencoders are trained in such a way that they will remove noise
and generate the original images.

The only difference between implementing the denoising autoencoder and the normal
autoencoder is a change in input data. The rest of the implementation is the same for both the
autoencoders. Below is the difference between training the autoencoder.
Training simple autoencoder:
autoencoder.fit(x_train, x_train)

Training denoising autoencoder:


autoencoder.fit(x_train_noisy, x_train)
Simple as that, everything else is exactly the same. The input to the autoencoder is the noisy
image, and the expected target is the original noise-free one.

Sparse autoencoders

Another way of regularizing the autoencoder is by using a sparsity constraint. In this way of
regularization, only fraction nodes are allowed to do forward and backward propagation. These
nodes have non-zero values and are called active nodes.
To do so, we add a penalty term to the loss function, which helps to activate the fraction of
nodes. This forces the autoencoder to represent each input as a combination of a small number of
nodes and demands it to discover interesting structures in the data. This method is efficient even
if the code size is large because only a small subset of the nodes will be active.

Denoising in Autoencoders:

Denoising Autoencoders are neural network models that remove noise from corrupted or

noisy data by learning to reconstruct the initial data from its noisy cou nterpart. We train the

model to minimize the disparity between the original and reconstructed data. We can stack

these autoencoders together to form deep networks, increasing their performance.

Additionally, tailor this architecture to handle a variety of data formats, including images,

audio, and text. Additionally, customise the noise, such as including salt -and-pepper or

Gaussian noise. As the DAE reconstructs the image, it effectively learns the input features,

leading to enhanced extraction of latent representations. It is important to highlight that the

Denoising Autoencoder reduces the likelihood of learning the identity function compared to

a regular autoencoder.

Learning Objectives

• An overview of denoising automatic encoders (DAEs) and their use in obtaining a low-

dimensional representation by reconstructing the original data from noisy types.

• We will also cover aspects of DAE architecture, including encoder and decoder components.
• Examining their performance can provide insight into their role in reconstructing the

original data from their noisy counterparts.

• Furthermore, we consider various applications of DAE such as denoising, compression,

feature extraction, and representation learning. As an illustrative example, we focus on the

DAE implementation for image denoising using the Keras dataset.


What is Denoising Autoencoders?

Denoising autoencoders are a specific type of neural network that enables unsupervised

learning of data representations or encodings. Their primary objective is to reconstru ct the

original version of the input signal corrupted by noise. This capability proves valuable in

problems such as image recognition or fraud detection, where the goal is to recover the

original signal from its noisy form.

An autoencoder consists of two main components:

• Encoder: This component maps the input data into a low-dimensional representation or

encoding.

• Decoder: This component returns the encoding to the original data space.

During the training phase, present the autoencoder with a set of clean i nput examples along

with their corresponding noisy versions. The objective is to learn a task using an encoder -

decoder architecture that efficiently transforms noisy input into clean output.
Architecture of DAE

The denoising autoencoder (DAE) architecture is similar to a standard autoencoder. It

consists of two main components:

Encoder

• The encoder creates a neural network equipped with one or more hidden layers.

• Its purpose is to receive noisy input data and generate an encoding, which represents a low -

dimensional representation of the data.

• Understand an encoder as a compression function because the encoding has fewer

parameters than the input data.


Decoder

• Decoder acts as an expansion function, which is responsible for reconstructing the original

data from the compressed encoding.

• It takes as input the encoding generated by the encoder and reconstructs the original data.

• Like encoders, decoders are implemented as neural networks featuring one or more hidden

layers.


During the training phase, present the denoising autoencoder (DAE) with a collection of

clean input examples along with their respective noisy counterparts. The objective is to

acquire a function that maps a noisy input to a relatively clean output using an encoder -

decoder architecture. To achieve this, a reconstruction loss function is typically employed to

evaluate the disparity between the clean input and the reconstructed output. A DAE is

trained by minimizing this loss through the use of backpropagation, which involves

updating the weights of both encoder and decoder components.

Applications of Denoising Autoencoders (DAEs) span a variety of domains, including

computer vision, speech processing, and natural language processing.

Examples

• Image Denoising: DAEs are effective in removing noise from images, such as Gaussian

noise or salt-and-pepper noise.

• Fraud Detection: DAEs can contribute to identifying fraudulent transactions by learning to

reconstruct common transactions from their noisy counterparts.

• Data Imputation: To reconstruct missing values from available data by learning, DAEs can

facilitate data imputation in datasets with incomplete information.

• Data Compression: DAEs can compress data by obtaining a concise representation of the

data in the encoding space.

• Anomaly Detection: Using DAEs, anomalies in a dataset can be detected by training a

model to reconstruct normal data and then flag challenging inputs as potentially abnormal.
Denoising Autoencoders solve this problem by corrupting the data on purpose
by randomly turning some of the input values to zero. In general, the
percentage of input nodes which are being set to zero is about 50%. Other
sources suggest a lower count, such as 30%. It depends on the amount of data
and input nodes you have.

When calculating the Loss function, it is important to compare the output


values with the original input, not with the corrupted input. That way, the risk
of learning the identity function instead of extracting features is eliminated.
Applications of Denoising Autoencoders

Denoising autoencoders are used in various applications, including:

• Feature Learning: They can learn to extract useful features that are invariant
to the type of corruption applied, which can be beneficial for tasks such as
classification or recognition.
• Data Preprocessing:

They can be used to preprocess noisy data for other machine learning algorithms,
effectively cleaning the data before it is used for training.
• Image Processing:

In computer vision, denoising autoencoders can be used for tasks such as image
denoising, inpainting, and super-resolution.

• Anomaly Detection: They can be used to detect anomalies by learning a normal


data distribution and identifying examples that do not conform to this
distribution.

Advantages and Limitations

Advantages:

• Denoising autoencoders can learn more robust features compared to standard


autoencoders.
• They can improve the performance of other machine learning models by
providing cleaner data.
• The approach is unsupervised, which means it does not require labeled data for
training.

Limitations:

• The choice of noise and its level can greatly affect the performance of the
model and may require careful tuning.
• Like other neural networks, denoising autoencoders can be computationally
intensive to train, especially for large datasets or complex architectures.
• While they can remove noise, they may also lose some detail or relevant
information in the data if not properly regularized.

Sparse Autoencoders :
Sparse Autoencoders are a variant of autoencoders, which are neural
networks trained to reconstruct their input data. However, unlike traditional
autoencoders, sparse autoencoders are designed to be sensitive to specific
types of high-level features in the data, while being insensitive to most other
features. This is achieved by imposing a sparsity constraint on the hidden units during training,
which forces the autoencoder to respond to unique statistical features of the dataset it is trained
on.
How do Sparse Autoencoders work?
Sparse Autoencoders consist of an encoder, a decoder, and a loss function. The encoder is used
to compress the input into a latent-space representation, and the decoder is used to reconstruct
the input from this representation. The sparsity constraint is typically enforced by adding a
penalty term to the loss function that encourages the activations of the hidden units to be sparse.

The sparsity constraint can be implemented in various ways, such as by using a sparsity penalty,
a sparsity regularizer, or a sparsity proportion. The sparsity penalty is a term added to the loss
function that penalizes the network for having non-sparse activations. The sparsity regularizer is
a function that encourages the network to have sparse activations. The sparsity proportion is a
hyperparameter that determines the desired level of sparsity in the activations.

Why are Sparse Autoencoders important?


Sparse Autoencoders are important because they can learn useful features from unlabeled data,
which can be used for tasks such as anomaly detection, denoising, and dimensionality reduction.
They are particularly useful when the dimensionality of the input data is high, as they can learn a
lower-dimensional representation that captures the most important features of the data.

Furthermore, Sparse Autoencoders can be used to pretrain deep neural networks. Pretraining a
deep neural network with a sparse autoencoder can help the network learn a good initial set of
weights, which can improve the performance of the network on a subsequent supervised
learning task.

Applications of Sparse Autoencoders


Sparse Autoencoders have been used in a variety of applications, including:

• Anomaly detection: Sparse autoencoders can be used to learn a normal representation of the data,
and then detect anomalies as data points that have a high reconstruction error.
• Denoising: Sparse autoencoders can be used to learn a clean representation of the data, and then
reconstruct the clean data from a noisy input.
• Dimensionality reduction: Sparse autoencoders can be used to learn a lower-dimensional
representation of the data, which can be used for visualization or to reduce the computational
complexity of subsequent tasks.
• Pretraining deep neural networks: Sparse autoencoders can be used to pretrain the weights of a
deep neural network, which can improve the performance of the network on a subsequent
supervised learning task.
• Sparse Autoencoders are one of the valuable types of Autoencoders. The idea behind
Sparse Autoencoders is that we can achieve an information bottleneck (same information
with fewer neurons) without reducing the number of neurons in the hidden layers. The
number of neurons in the hidden layer can be greater than the number in the input layer.
• We achieve this by imposing a sparsity constraint on the learning. According to the
sparsity constraint, only some percentage of nodes can be active in a hidden layer. The
neurons with output close to 1 are active, whereas the neurons close to 0 are in-active
neurons.
• More specifically, we penalize the loss function such that only a few neurons are active in
a layer. We force the autoencoder to represent the input information in fewer neurons by
reducing the number of neurons. Also, we can increase the code size because only a few
neurons are active, corresponding to a layer.

Contractive Autoencoder (CAE)


Contractive Autoencoder was proposed by the researchers at the University of Toronto
in 2011 in the paper Contractive auto-encoders: Explicit invariance during feature
extraction. The idea behind that is to make the autoencoders robust of small changes in
the training dataset.
To deal with the above challenge that is posed in basic autoencoders, the authors
proposed to add another penalty term to the loss function of autoencoders. We will
discuss this loss function in details.
The Loss function:
Contractive autoencoder adds an extra term in the loss function of autoencoder, it is
given as:

i.e the above penalty term is the Frobenius Norm of the


encoder, the frobenius norm is just a generalization of Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the hidden
layer, calculating a jacobian of the hidden layer with respect to input is similar to
gradient calculation.
Relationship with Sparse Autoencoder
In sparse autoencoder, our goal is to have the majority of components of representation
close to 0, for this to happen, they must be lying in the left saturated part of the sigmoid
function, where their corresponding sigmoid value is close to 0 with a very small first
derivative, which in turn leads to the very small entries in the Jacobian matrix. This
leads to highly contractive mapping in the sparse autoencoder, even though this is not
the goal in sparse Autoencoder.
Relationship with Denoising Autoencoder
The idea behind denoising autoencoder is just to increase the robustness of the encoder
to the small changes in the training data which is quite similar to the motivation of
Contractive Autoencoder. However, there is some difference:
• CAEs encourage robustness of representation f(x), whereas DAEs encourage
robustness of reconstruction, which only partially increases the robustness of
representation.
• DAE increases its robustness by stochastically training the model for the
reconstruction, whereas CAE increases the robustness of the first derivative
of Jacobian matrix.
What is a Contractive Autoencoder?

A Contractive Autoencoder (CAE) is a specific type of autoencoder used in


unsupervised machine learning. Autoencoders are neural networks designed to learn
efficient representations of the input data, called encodings, by training the network to
ignore insignificant data (“noise”). These encodings can then be used for tasks such as
dimensionality reduction, feature learning, and more.

The "contractive" aspect of CAEs comes from the fact that they are regularized to be
insensitive to slight variations in the input data. This is achieved by adding a penalty
to the loss function during training, which forces the model to learn a representation
that is robust to small changes or noise in the input. The penalty is typically the
Frobenius norm of the Jacobian matrix of the encoder activations with respect to the
input and encourages the learned representations to contract around the training data.

How Contractive Autoencoders Work

A Contractive Autoencoder consists of two main components: an encoder and a


decoder. The encoder compresses the input into a lower-dimensional representation,
and the decoder reconstructs the input from this representation. The goal is for the
reconstructed output to be as close as possible to the original input.

The training process involves minimizing a loss function that has two terms. The first
term is the reconstruction loss, which measures the difference between the original
input and the reconstructed output. The second term is the regularization term, which
measures the sensitivity of the encoded representations to the input. By penalizing the
sensitivity, the CAE learns to produce encodings that do not change much when the
input is perturbed slightly, leading to more robust features.

Applications of Contractive Autoencoders

Contractive Autoencoders have several applications in the field of machine learning


and artificial intelligence:

• Feature Learning: CAEs can learn to capture the most salient features in the
data, which can then be used for various downstream tasks such as
classification or clustering.
• Dimensionality Reduction:

Like other autoencoders, CAEs can reduce the dimensionality of data, which is
useful for visualization or as a preprocessing step for other algorithms that
perform poorly with high-dimensional data.

• Denoising: Due to their contractive property, CAEs can be used to remove


noise from data, as they learn to ignore small variations in the input.
• Data Generation: While not their primary application, autoencoders can
generate new data points by decoding samples from the learned encoding
space.

Advantages of Contractive Autoencoders

Contractive Autoencoders offer several advantages:

• Robustness to Noise: By design, CAEs are robust to small perturbations or


noise in the input data.
• Improved Generalization: The contractive penalty encourages the model to
learn more general features that do not depend on the specific noise or
variations present in the training data.
• Stability: The regularization term helps to stabilize the training process by
preventing the model from learning trivial or overfitted representations.

Challenges with Contractive Autoencoders

Despite their advantages, CAEs also present some challenges:

• Computational Complexity: Calculating the Jacobian matrix for the


contractive penalty can be computationally expensive, especially for large
neural networks.
• Hyperparameter Tuning:

The strength of the contractive penalty is controlled by a hyperparameter that


needs to be carefully tuned to balance the reconstruction loss and the
regularization term.

• Choice of Regularization: The effectiveness of the CAE can depend on the


choice of regularization term, and different problems may require different
forms of the contractive penalty.

Contractive Autoencoders
A contractive autoencoder is considered an unsupervised deep learning technique. It helps a
neural network to encode unlabeled training data. The idea behind that is to make the
autoencoders robust small changes in the training dataset.
We use autoencoders to learn a representation, or encoding, for a set of unlabeled data. It is
usually the first step towards dimensionality reduction or generating new data models.
Contractive autoencoder targets to learn invariant representations to unimportant transformations
for the given data.
Working of Contractive Autoencoders
A contractive autoencoder is less sensitive to slight variations in the training dataset. We can
achieve this by adding a penalty term or regularizer to whatever cost or objective function the
algorithm is trying to minimize. The result reduces the learned representation's sensitivity
towards the training input. This regularizer needs to conform to the Frobenius norm of the
Jacobian matrix for the encoder activation sequence concerning the input.
If this value is zero, we don't observe any change in the learned hidden representations as we
change input values. But if the value is huge, then the learned model is unstable as the input
values change.
We generally employ Contractive autoencoders as one of several other autoencoder nodes. It is
in active mode only when other encoding schemes fail to label a data point.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy