Unit 3
Unit 3
Principal component analysis today is one of the most popular multivariate statistical
techniques. It has been widely used in the areas of pattern recognition and signal
processing and is a statistical method under the broad title of factor analysis.
PCA forms the basis of multivariate data analysis based on projection methods. The
most important use of PCA is to represent a multivariate data table as smaller set of
variables (summary indices) in order to observe trends, jumps, clusters and outliers.
This overview may uncover the relationships between observations and variables, and
among the variables.
PCA is a very flexible tool and allows analysis of datasets that may contain, for example,
multicollinearity, missing values, categorical data, and imprecise measurements. The goal is to
extract the important information from the data and to express this information as a set of
summary indices called principal components.
Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that
approximate the data as well as possible in the least squares sense. A line or plane that is the least
squares approximation of a set of data points makes the variance of the coordinates on the line or
plane as large as possible.
PCA creates a visualization of data that minimizes residual variance in the least squares sense
and maximizes the variance of the projection coordinates.
In a previous article, we explained why pre-treating data for PCA is necessary. Now, let’s take a
look at how PCA works, using a geometrical approach.
Consider a matrix X with N rows (aka "observations") and K columns (aka "variables"). For this
matrix, we construct a variable space with as many dimensions as there are variables (see figure
below). Each variable represents one coordinate axis. For each variable, the length has been
standardized according to a scaling criterion, normally by scaling to unit variance. You can find
more details on scaling to unit variance in the previous blog post.
A K-dimensional variable space. For simplicity, only three variables axes are displayed.
The “length” of each coordinate axis has been standardized according to a specific criterion,
usually unit variance scaling.
In the next step, each observation (row) of the X-matrix is placed in the K-dimensional variable
space. Consequently, the rows in the data table form a swarm of points in this space.
The observations (rows) in the data matrix X can be understood as a swarm of points in the
variable space (K-space).
Mean centering
Next, mean-centering involves the subtraction of the variable averages from the data. The vector
of averages corresponds to a point in the K-space.
In the mean-centering procedure, you first compute the variable averages. This vector of
averages is interpretable as a point (here in red) in space. The point is situated in the middle of
the point swarm (at the center of gravity).
The subtraction of the averages from the data corresponds to a re-positioning of the coordinate
system, such that the average point now is the origin.
The mean-centering procedure corresponds to moving the origin of the coordinate system to
coincide with the average point (here in red).
After mean-centering and scaling to unit variance, the data set is ready for computation of the
first summary index, the first principal component (PC1). This component is the line in the K-
dimensional variable space that best approximates the data in the least squares sense. This line
goes through the average point. Each observation (yellow dot) may now be projected onto this
line in order to get a coordinate value along the PC-line. This new coordinate value is also
known as the score.
The first principal component (PC1) is the line that best accounts for the shape of the point
swarm. It represents the maximum variance direction in the data. Each observation (yellow dot)
may be projected onto this line in order to get a coordinate value along the PC-line. This value
is known as a score.
Usually, one summary index or principal component is insufficient to model the systematic
variation of a data set. Thus, a second summary index – a second principal component (PC2) – is
calculated. The second PC is also represented by a line in the K-dimensional variable space,
which is orthogonal to the first PC. This line also passes through the average point, and improves
the approximation of the X-data as much as possible.
The second principal component (PC2) is oriented such that it reflects the second largest source
of variation in the data while being orthogonal to the first PC. PC2 also passes through the
average point.
Two principal components define a model plane
When two principal components have been derived, they together define a place, a window into
the K-dimensional variable space. By projecting all the observations onto the low-dimensional
sub-space and plotting the results, it is possible to visualize the structure of the investigated data
set. The coordinate values of the observations on this plane are called scores, and hence the
plotting of such a projected configuration is known as a score plot.
Two PCs form a plane. This plane is a window into the multidimensional space, which can be
visualized graphically. Each observation may be projected onto this plane, giving a score for
each.
Now, we calculate U using the formula u_i = \frac{1}{\sigma} A v_i and this gives U =
Applications
• Calculation of Pseudo-inverse: Pseudo inverse or Moore-Penrose inverse is
the generalization of the matrix inverse that may not be invertible (such as
low-rank matrices). If the matrix is invertible then its inverse will be equal to
Pseudo inverse but pseudo inverse exists for the matrix that is not invertible.
It is denoted by A+.
Suppose, we need to calculate the pseudo-inverse of a matrix M:
Then, the SVD of M can be given as:
Multiply both sides by M^{-1}.
[Tex]I= M^{-1} U W V^{T}[/Tex]Multiply both side by V:
[Tex]V = M^{-1} U W[/Tex]Multiply by W^{-1}Since the W is the singular matrix, the
inverse of W
is [Tex]V W^{-1} = M^{-1} U W
W^{-1}[/Tex] Multiply by [Tex]V W^{-1} U^{T} = M^{-1}
U U^{T}[/Tex]
The above equation gives the pseudo-inverse.
Solving a set of Homogeneous Linear Equation (Mx =b): if b=0, calculate SVD and
take any column of VT associated with a singular value (in W) equal to 0.
If , Multiply by [Tex]M^{-1} M x = M^{-1} b [/Tex]
• We need to represent the matrix in a form such that, the most important
part of the matrix which is needed for further computations could be
extracted easily. That’s where the Singular Value
Decomposition(SVD) comes into play.
•
• Image Credits: https://commons.wikimedia.org/wiki/File:Eigenvectors.gif
PCA vs Autoencoder
• Although PCA is fundamentally a linear transformation, auto-encoders may
describe complicated non-linear processes.
• Because PCA features are projections onto the orthogonal basis, they are
completely linearly uncorrelated. However, since autoencoded features are
only trained for correct reconstruction, they may have correlations.
• PCA is quicker and less expensive to compute than autoencoders.
• PCA is quite similar to a single layered autoencoder with a linear activation
function.
• Because of the large number of parameters, the autoencoder is prone to
overfitting. (However, regularization and proper planning might help to
prevent this).
What are Autoencoders?
Autoencoder is Feed-Forward Neural Networks where the input and the output are the same.
Autoencoders encode the image and then decode it to get the same image. The core idea of
autoencoders is that the middle layer must contain enough information to represent the input.
There are three important properties of autoencoders:
1. Data Specific: We can only use autoencoders for the data that it has previously been trained
on. For instance, to encode an MNIST digits image, we’ll have to use an autoencoder that
previously has been trained on the MNIST digits dataset.
2. Lossy: Information is lost while encoding and decoding the images using autoencoders, which
means that the reconstructed image will have some missing details compared to the original
image.
3. Unsupervised: Autoencoders belong to the unsupervised machine learning category because
we do not require explicit labels corresponding to the data; the data itself acts as input and
output.
Autoencoders are Neural Networks which are commonly used for feature
selection and extraction. However, when there are more nodes in the hidden
layer than there are inputs, the Network is risking to learn the so-called
“Identity Function”, also called “Null Function”, meaning that the output
equals the input, marking the Autoencoder useless.
•
Regularization of Autoencoders
Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the
error of reconstructing the original input. After training, autoencoders are then used as a normal feed-
forward neural network for activations. This is an unsupervised form of feature extraction because the
neural network uses only the original input for learning weights rather than backpropagation, which has
labels. Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a
single network rarely uses both).
Use of autoencoders
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in
reducing the dimensions of the dataset. The output of the autoencoder is a reformation of the
input data in the most efficient form.
Autoencoders are identical to multilayer perceptron neural networks because, like multilayer
perceptrons, autoencoders have an input layer, some hidden layers, and an output layer. The key
difference between a multilayer perceptron network and an autoencoder is that the output layer
of an autoencoder has the same number of neurons as that of the input layer.
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to
minimize parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda,
controlling the trade-off between finding a good fit and keeping the value of certain feature
weights low as the exponents on features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller.
Smaller-valued weights lead to simpler hypotheses, which are the most generalizable.
Unregularized weights with several higher-order polynomials in the feature sets tend to overfit
the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters
tend to increase in magnitude. This is appropriate because an excess of features relative to
training set examples leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden
layer of smaller dimensions than the input. The regularized autoencoders use a loss function that
helps the model to have other properties besides copying input to the output. We can generally
find two types of regularized autoencoder: the denoising autoencoder and the sparse
autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add
random noise to the input and recover it to the original form by removing noise from the input
data. This prevents the autoencoder from copying the data from input to output because it
contains random noise. We ask it to subtract the noise and produce meaningful underlying data.
This is called a denoising autoencoder.
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder
will not get the original images, but autoencoders are trained in such a way that they will remove noise
and generate the original images.
The only difference between implementing the denoising autoencoder and the normal
autoencoder is a change in input data. The rest of the implementation is the same for both the
autoencoders. Below is the difference between training the autoencoder.
Training simple autoencoder:
autoencoder.fit(x_train, x_train)
Sparse autoencoders
Another way of regularizing the autoencoder is by using a sparsity constraint. In this way of
regularization, only fraction nodes are allowed to do forward and backward propagation. These
nodes have non-zero values and are called active nodes.
To do so, we add a penalty term to the loss function, which helps to activate the fraction of
nodes. This forces the autoencoder to represent each input as a combination of a small number of
nodes and demands it to discover interesting structures in the data. This method is efficient even
if the code size is large because only a small subset of the nodes will be active.
Denoising in Autoencoders:
Denoising Autoencoders are neural network models that remove noise from corrupted or
noisy data by learning to reconstruct the initial data from its noisy cou nterpart. We train the
model to minimize the disparity between the original and reconstructed data. We can stack
Additionally, tailor this architecture to handle a variety of data formats, including images,
audio, and text. Additionally, customise the noise, such as including salt -and-pepper or
Gaussian noise. As the DAE reconstructs the image, it effectively learns the input features,
Denoising Autoencoder reduces the likelihood of learning the identity function compared to
a regular autoencoder.
Learning Objectives
• An overview of denoising automatic encoders (DAEs) and their use in obtaining a low-
• We will also cover aspects of DAE architecture, including encoder and decoder components.
• Examining their performance can provide insight into their role in reconstructing the
Denoising autoencoders are a specific type of neural network that enables unsupervised
original version of the input signal corrupted by noise. This capability proves valuable in
problems such as image recognition or fraud detection, where the goal is to recover the
• Encoder: This component maps the input data into a low-dimensional representation or
encoding.
• Decoder: This component returns the encoding to the original data space.
During the training phase, present the autoencoder with a set of clean i nput examples along
with their corresponding noisy versions. The objective is to learn a task using an encoder -
decoder architecture that efficiently transforms noisy input into clean output.
Architecture of DAE
Encoder
• The encoder creates a neural network equipped with one or more hidden layers.
• Its purpose is to receive noisy input data and generate an encoding, which represents a low -
• Decoder acts as an expansion function, which is responsible for reconstructing the original
• It takes as input the encoding generated by the encoder and reconstructs the original data.
• Like encoders, decoders are implemented as neural networks featuring one or more hidden
layers.
•
During the training phase, present the denoising autoencoder (DAE) with a collection of
clean input examples along with their respective noisy counterparts. The objective is to
acquire a function that maps a noisy input to a relatively clean output using an encoder -
evaluate the disparity between the clean input and the reconstructed output. A DAE is
trained by minimizing this loss through the use of backpropagation, which involves
Examples
• Image Denoising: DAEs are effective in removing noise from images, such as Gaussian
• Data Imputation: To reconstruct missing values from available data by learning, DAEs can
• Data Compression: DAEs can compress data by obtaining a concise representation of the
model to reconstruct normal data and then flag challenging inputs as potentially abnormal.
Denoising Autoencoders solve this problem by corrupting the data on purpose
by randomly turning some of the input values to zero. In general, the
percentage of input nodes which are being set to zero is about 50%. Other
sources suggest a lower count, such as 30%. It depends on the amount of data
and input nodes you have.
• Feature Learning: They can learn to extract useful features that are invariant
to the type of corruption applied, which can be beneficial for tasks such as
classification or recognition.
• Data Preprocessing:
They can be used to preprocess noisy data for other machine learning algorithms,
effectively cleaning the data before it is used for training.
• Image Processing:
In computer vision, denoising autoencoders can be used for tasks such as image
denoising, inpainting, and super-resolution.
Advantages:
Limitations:
• The choice of noise and its level can greatly affect the performance of the
model and may require careful tuning.
• Like other neural networks, denoising autoencoders can be computationally
intensive to train, especially for large datasets or complex architectures.
• While they can remove noise, they may also lose some detail or relevant
information in the data if not properly regularized.
Sparse Autoencoders :
Sparse Autoencoders are a variant of autoencoders, which are neural
networks trained to reconstruct their input data. However, unlike traditional
autoencoders, sparse autoencoders are designed to be sensitive to specific
types of high-level features in the data, while being insensitive to most other
features. This is achieved by imposing a sparsity constraint on the hidden units during training,
which forces the autoencoder to respond to unique statistical features of the dataset it is trained
on.
How do Sparse Autoencoders work?
Sparse Autoencoders consist of an encoder, a decoder, and a loss function. The encoder is used
to compress the input into a latent-space representation, and the decoder is used to reconstruct
the input from this representation. The sparsity constraint is typically enforced by adding a
penalty term to the loss function that encourages the activations of the hidden units to be sparse.
The sparsity constraint can be implemented in various ways, such as by using a sparsity penalty,
a sparsity regularizer, or a sparsity proportion. The sparsity penalty is a term added to the loss
function that penalizes the network for having non-sparse activations. The sparsity regularizer is
a function that encourages the network to have sparse activations. The sparsity proportion is a
hyperparameter that determines the desired level of sparsity in the activations.
Furthermore, Sparse Autoencoders can be used to pretrain deep neural networks. Pretraining a
deep neural network with a sparse autoencoder can help the network learn a good initial set of
weights, which can improve the performance of the network on a subsequent supervised
learning task.
• Anomaly detection: Sparse autoencoders can be used to learn a normal representation of the data,
and then detect anomalies as data points that have a high reconstruction error.
• Denoising: Sparse autoencoders can be used to learn a clean representation of the data, and then
reconstruct the clean data from a noisy input.
• Dimensionality reduction: Sparse autoencoders can be used to learn a lower-dimensional
representation of the data, which can be used for visualization or to reduce the computational
complexity of subsequent tasks.
• Pretraining deep neural networks: Sparse autoencoders can be used to pretrain the weights of a
deep neural network, which can improve the performance of the network on a subsequent
supervised learning task.
• Sparse Autoencoders are one of the valuable types of Autoencoders. The idea behind
Sparse Autoencoders is that we can achieve an information bottleneck (same information
with fewer neurons) without reducing the number of neurons in the hidden layers. The
number of neurons in the hidden layer can be greater than the number in the input layer.
• We achieve this by imposing a sparsity constraint on the learning. According to the
sparsity constraint, only some percentage of nodes can be active in a hidden layer. The
neurons with output close to 1 are active, whereas the neurons close to 0 are in-active
neurons.
• More specifically, we penalize the loss function such that only a few neurons are active in
a layer. We force the autoencoder to represent the input information in fewer neurons by
reducing the number of neurons. Also, we can increase the code size because only a few
neurons are active, corresponding to a layer.
The "contractive" aspect of CAEs comes from the fact that they are regularized to be
insensitive to slight variations in the input data. This is achieved by adding a penalty
to the loss function during training, which forces the model to learn a representation
that is robust to small changes or noise in the input. The penalty is typically the
Frobenius norm of the Jacobian matrix of the encoder activations with respect to the
input and encourages the learned representations to contract around the training data.
The training process involves minimizing a loss function that has two terms. The first
term is the reconstruction loss, which measures the difference between the original
input and the reconstructed output. The second term is the regularization term, which
measures the sensitivity of the encoded representations to the input. By penalizing the
sensitivity, the CAE learns to produce encodings that do not change much when the
input is perturbed slightly, leading to more robust features.
• Feature Learning: CAEs can learn to capture the most salient features in the
data, which can then be used for various downstream tasks such as
classification or clustering.
• Dimensionality Reduction:
Like other autoencoders, CAEs can reduce the dimensionality of data, which is
useful for visualization or as a preprocessing step for other algorithms that
perform poorly with high-dimensional data.
Contractive Autoencoders
A contractive autoencoder is considered an unsupervised deep learning technique. It helps a
neural network to encode unlabeled training data. The idea behind that is to make the
autoencoders robust small changes in the training dataset.
We use autoencoders to learn a representation, or encoding, for a set of unlabeled data. It is
usually the first step towards dimensionality reduction or generating new data models.
Contractive autoencoder targets to learn invariant representations to unimportant transformations
for the given data.
Working of Contractive Autoencoders
A contractive autoencoder is less sensitive to slight variations in the training dataset. We can
achieve this by adding a penalty term or regularizer to whatever cost or objective function the
algorithm is trying to minimize. The result reduces the learned representation's sensitivity
towards the training input. This regularizer needs to conform to the Frobenius norm of the
Jacobian matrix for the encoder activation sequence concerning the input.
If this value is zero, we don't observe any change in the learned hidden representations as we
change input values. But if the value is huge, then the learned model is unstable as the input
values change.
We generally employ Contractive autoencoders as one of several other autoencoder nodes. It is
in active mode only when other encoding schemes fail to label a data point.