Regularization in Deep Learning (1)
Regularization in Deep Learning (1)
Dropout
What is Overfitting?
Simply put, when a model trains on sample data for an excessively long time or
information, from the dataset. The model becomes "overfitted" and unable to
generalize successfully to new data when it memorizes the noise and fits the
training set too closely. A model won't be able to carry out the classification or
prediction tasks that it was designed for if it can't generalize successfully to new
data.
What is Regularization?
When completely new data from the problem domain is fed as an input into deep learning
models, regularization is a collection of strategies that can assist prevent overfitting in neural
networks and improve their accuracy by modifying the learning procedure slightly such that the
model generalizes more successfully. The model then performs better on the unobserved data as
a result.
Why Regularization?
Through Regularization the bigger coefficient input parameters receive a "penalty", which
ultimately reduces the variance of the model, and particularly in deep learning the nodes' weight
matrices are penalized. With regularization, a more optimized and better accurate model for
When modeling the data, a low bias and high variance scenario is referred to as
overfitting. To handle this, regularization techniques trade more bias for less
variance. Effective regularization is one that strikes the optimal balance between
bias and variation, with the final result being a notable decrease in variance at
the least possible cost to bias. This would imply low variation without
the assumption that the least weights could result in simpler models and help
prevent overfitting.
Techniques of Regularization
regularization helps in making deep learning models better and more effective,
now let's shift our focus to the techniques that we need to use for regularization
in deep learning.
L1 Regularization
Essentially, the L1 regularizer searches for parameter vectors that minimize the
parameter vector's norm (the length of the vector). The main issue here is how
to best optimize the parameters of a single neuron, a single layer neural network
there are many features. Even so, we benefit from the computational advantage
Here the lambda is the regularization parameter. Here we penalize the absolute
regularization techniques come very handily when we are trying to compress the
L2 Regularization
By limiting the coefficient and maintaining all the variables, L2 regularization helps solve
predictors may be estimated using L2 regression, and based on that, the unimportant predictors
can be penalized.
leads the weights to decay towards zero (but not exactly zero), it is also known
as weight decay.
method is used and Ridge Regression is the term used if the L2 regularization
method is employed.
absolute terms. With this form of regularization, sparse models with few
zero and be dropped from the model. Coefficient values are closer to zero when
simpler to understand.
Apart from this, there are a few other factors where the L1 regularization
1. L1 regularization can add the penalty term to the cost function by taking the absolute
value of the weight parameters into account. On the other hand, the squared value of the
2. In order to avoid overfitting, L2 regularization makes estimates for the data mean instead
non-differentiable function and includes an absolute value, does not. Due to this, L1
Dropout Regularization
Dropout is a regularization method in which certain neurons are disregarded at random. They
"drop out" at random. This means that any weight changes are not applied to the neuron on the
backward trip and that their effect on the activation of downstream neurons is temporally erased
on the forward pass. Neuron weights inside a neural network find their place in the network as it
learns.
which, if it goes too far, might produce a fragile model that is overly dependent
FacebookTwitterEmailInstapaperShare
A Radial Basis Function network is an artificial forward single hidden layer
feed neural network that uses in the field of mathematical modeling as activation
functions.
The output of the RBF network is a linear combination of neuron parameters and
radial basis functions of the inputs. This network is used in time series prediction,
This theorem justify the use of a linear output layer in an RBF network. According
to this theorem, the transformation from the input space to the feature space is
nonlinear and the dimensionality of the feature space is high compared to that of
the input space, so there is a high likelihood that a non separable pattern
classification task in the input space is transformed into a linearly separable one in
It requires every input vector to be mapped exactly onto the corresponding target
vector. The interpolation problem is to determine the real coefficient and the
polynomial term. The function is called a radial basis function if the interpolation
The supervised part of the training procedure for the RBF network is concerned
with determining suitable values for the weight connections between the hidden
and the output unit layers. The learning of a neural network is viewed as a hyper
reasons-
mapping uniquely.
Regularization theory-
mapping function. It involves adding to the error function an extra term which is
designed to penalize mappings which are not smooth. Instead of restricting the
number of hidden units, an alternative approach for preventing over fitting in RBF
RBF network can be seen as special case of regularization network. RBF network
Unknown weights and the error variance are estimated by regularization. The
parameter in the RBF network and regularization parameter is given as β=Өα2 and
α2 is error variance.
RBF are embedded in a two layer neural network, where each hidden unit
implements radial activated function. The output unit implement weighted sum of
hidden unit outputs. While the input into RBF network is nonlinear, the output is
often linear. Owing to their nonlinear approximation capabilities, RBF network are
Dissimilarities-
An RBF network has a single hidden layer, whereas multilayer perceptron can
The theory of kernel regression provides another viewpoint for the use of RBF
regression function for noisy data using kernel density estimation technique. The
Learning strategies:
Common learning strategies are orthogonal least squares method and hybrid
learning method. In OLS the hidden neuron, RBF centers, are selected one by one
Ans- Both network are non linear feedforward network, universal approximators,
Ans- RBF has single hidden layer. MLPs have more than one hidden layer.
ii) RBF employing supervised procedure for selecting fixed number of RBF
centers
low dimensional space. It tells us that we can map the input space to a high
Regularization Networks
https://www.youtube.com/watch?v=u1OpsSwOGe0
RBF MLP
RBF MLP
--- ---
2. In RBF hidden layer computation nodes are 2. MLP follows the common computational
3. In RBF hidden layer is non-linear and output 3. In MLP hidden layer and output layer is
4. The argument of RBF activation function 4. Each hidden unit computes the inner product
computes Euclidean norm between input vector of input vector and synaptic vector.
and centre.
output mapping.
7. In RBFN, the hidden nodes operate differently 7. In MLP, the hidden nodes share a common
i.e. they have different models. model not necessary the same activation
function.
8. In RBF network we take differece of input 8. In MLP network we take product of input
RBF MLP
10. RBFN does faster training process. 10. MLP is slower in training process.
11. RBFN is slow when practically used. 11. MLP is faster when practically used.
SOM’s architecture :
Self organizing maps have two layers, the first one is the input layer and the
• Competetion
• Cooperation
• Adaptation
Let’s explain those processes.
1)Competetion :
In the example below, in each neuron of the output layer we will have a
We compute distance between each neuron (neuron from the output layer)
and the input data, and the neuron with the lowest distance will be the
We will update the vector of the winner neuron in the final process
(adaptation) but it is not the only one, also it’s neighbor will be updated.
depends on two factor : time ( time incremented each new input data) and
distance between the winner neuron and the other neuron (How far is the
The image below show us how the winner neuron’s ( The most green one
factors.
Time and distance factors
3) Adaptation:
After choosing the winner neuron and it’s neighbors we compute neurons
update. Those choosen neurons will be updated but not the same update,
more the distance between neuron and the input data grow less we adjust it
The winner neuron and it’s neighbors will be updated using this formula:
This learning rate indicates how much we want to adjust our weights.
After time t (positive infinite), this learning rate will converge to zero so we
and the other neuron (they are proportionally reversed : d increase make
h(t) decrease) and the neighborhood size wich itself depends on time (
representation from the input space of the training samples. This representation is known as a
map.
In this article, we will be going through a Beginner’s guide to a popular Self Organizing Map
- The Kohonen Map. We will start with understanding what Self-Organizing Maps are.
A sort of artificial neural network called a self-organizing map, often known as a Kohonen map
unsupervised learning methodology and uses a competitive learning algorithm to train its
network. To minimize complex issues for straightforward interpretation, SOM is utilized for
mapping and clustering (or dimensionality reduction) procedures to map multidimensional data
onto lower-dimensional spaces. The output layer and the input layer are the two layers that make
Now that we have discussed what SOMs are, we will now be discussing how the Kohonen Maps
work.
Consider an input set with the dimensions (m, n), where m represents n represents the number of
features present in each example and the number of training examples. The weights of size (n,
C), where C is the number of clusters, are first initialized. The winning vector (the weight vector
with the shortest distance from the training example, for example, the Euclidean distance) is then
updated after iterating over the input data for each training example. Weight update guidelines
Here, i stands for the ith feature of the training example, j is the winning vector, alpha is the
learning rate at time t, and k is the input data’s kth training example. The SOM network is
trained, and trained weights are utilized to cluster new examples. A new example is included in
Algorithm
• Step 3: For each node on the map, repeat steps 4 and 5 once more.
• Step 4: Find the distance in Euclid between the input vector x(t), and the weight vector wij connected
• Step 5: Keep an eye on the node that produces the least t-distance.
• Step 6: Make a global Best Matching Unit calculation (BMU). It refers to the node that is closest to all
• Step 7: Find the BMU's topological neighborhood and its radius in the Kohonen Map.
Note: Steps 2 through 9 represent the training phase, whereas step 1 represents the initiation
phase.
Here,
X → input vector
The neighborhood function's radius, o(t), determines how far neighbor nodes in the 2D grid are
t → current iteration
w → Weight vector
The neighborhood function, which represents the distance between nodes I j and the BMU, is
β_ij.
Quantization
Self-Organizing Maps for Vector Quantization: A powerful technique for
the number of colors used in the image while maintaining its overall
2. Data clustering: SOMs can be used to group similar data points together,
making it easier to identify patterns and trends in large datasets. This can
from complex data, such as images or audio signals. These features can
A company case study that demonstrates the use of SOMs for vector
data in a lower-dimensional space. They use unsupervised learning to create a grid of nodes,
where each node represents a prototype vector. During the training process, the algorithm
adjusts the prototype vectors to better represent the input data. The result is a compressed
representation of the data, where similar data points are grouped together in the lower-
dimensional space.
What are the advantages of using Self-Organizing Maps for vector quantization?
The advantages of using Self-Organizing Maps for vector quantization include: 1. Data
compression: SOMs can significantly reduce the size of data by approximating it with a
smaller set of representative vectors, making it more manageable and efficient to process. 2.
Visualization: By representing high-dimensional data in a lower-dimensional space, SOMs
make it easier to visualize complex data patterns and relationships. 3. Unsupervised learning:
SOMs do not require labeled data for training, making them suitable for applications where
labeled data is scarce or expensive to obtain. 4. Robustness: SOMs are less sensitive to noise
and outliers in the data, making them more robust in real-world applications. 5. Adaptability:
SOMs can be easily adapted to different types of data and problems, making them a versatile
What are the challenges in using Self-Organizing Maps for vector quantization?
Computational complexity: The training process for SOMs can be computationally intensive,
especially for large datasets and high-dimensional data. 2. Parameter selection: Choosing the
appropriate parameters, such as the size of the map and the learning rate, can significantly
impact the performance of the SOM. 3. Lack of a global optimum: SOMs do not guarantee
Interpretability: While SOMs can provide a visual representation of the data, interpreting the
Image compression using Self-Organizing Maps works by reducing the number of colors
used in the image while maintaining its overall appearance. During the training process, the
SOM learns a set of representative colors (prototype vectors) from the input image. The
original colors in the image are then replaced with the closest representative colors from the
trained SOM. This results in a compressed image with a smaller color palette, leading to
Yes, there are several alternatives to Self-Organizing Maps for vector quantization,
data into K clusters, where each cluster is represented by a centroid. 2. Principal Component
Analysis (PCA): A linear dimensionality reduction technique that projects data onto a lower-
Quantization using Lattice Quantizers: A method that uses a predefined lattice structure to
Autoencoders: A type of neural network that learns to compress and reconstruct input data,
often used for dimensionality reduction and feature extraction. Each of these alternatives has
its own strengths and weaknesses, and the choice of method depends on the specific problem
dimensionality reduction in machine learning. It is a statistical process that converts the observations
of correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It is one of
the popular tools that is used for exploratory data analysis and predictive modeling. It is a technique
to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good
split between the classes, and hence it reduces the dimensionality. Some real-world applications of
PCA are image processing, movie recommendation system, optimizing the power allocation in
o Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features present in
the dataset. Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
has the most importance, and n PC will have the least importance.
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the
Now we will represent our dataset into a structure. Such as we will represent the two-dimensional
matrix of independent variable X. Here each row corresponds to the data items, and the column
corresponds to the Features. The number of columns is the dimensions of the dataset.
In this step, we will standardize our dataset. Such as in a particular column, the features with high
variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide each
data item in a column with the standard deviation of the column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions of the axes with high information. And the
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from
largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues.
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the
resultant matrix Z*, each observation is the linear combination of original features. Each column of
The new feature set has occurred, so we will decide here what to keep and what to remove. It means,
we will only keep the relevant or important features in the new dataset, and unimportant features will
be removed out.
o It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.
critical to perform standardization prior to PCA, is that the latter is quite sensitive
regarding the variances of the initial variables. That is, if there are large differences
between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between
0 and 100 will dominate over a variable that ranges between 0 and 1), which will
lead to biased results. So, transforming the data to comparable scales can prevent
this problem.
Mathematically, this can be done by subtracting the mean and dividing by the
Once the standardization is done, all the variables will be transformed to the same
scale.
The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
correlated in such a way that they contain redundant information. So, in order to
dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables. For example, for a 3-dimensional data set with 3 variables x, y,
Dimensional Data.
the main diagonal (Top left to bottom right) we actually have the variances of each
entries of the covariance matrix are symmetric with respect to the main diagonal,
which means that the upper and the lower triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the
correlated)
Now that we know that the covariance matrix is not more than a table that
summarizes the correlations between all the possible pairs of variables, let’s move
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to
What you first need to know about eigenvectors and eigenvalues is that they
always come in pairs, so that every eigenvector has an eigenvalue. Also, their
number is equal to the number of dimensions of the data. For example, for a 3-
dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3
corresponding eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal
actually the directions of the axes where there is the most variance (most
information) and that we call Principal Components. And eigenvalues are simply
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that
the eigenvector that corresponds to the first principal component (PC1) is v1 and
the one that corresponds to the second principal component (PC2) is v2.
find that PC1 and PC2 carry respectively 96 percent and 4 percent of the variance
of the data.
Step 4: Create a Feature Vector
As we saw in the previous step, computing the eigenvectors and ordering them by
order of significance. In this step, what we do is, to choose whether to keep all
these components or discard those of lesser significance (of low eigenvalues), and
form with the remaining ones a matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards
(components) out of n, the final data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a feature
Or discard the eigenvector v2, which is the one of lesser significance, and form a
consequently cause a loss of information in the final data set. But given that v2 was
carrying only 4 percent of the information, the loss will be therefore not important
and we will still have 96 percent of the information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to keep all the
components or discard the ones of lesser significance, depending on what you are
looking for. Because if you just want to describe your data in terms of new
In the previous steps, apart from standardization, you do not make any changes on
the data, you just select the principal components and form the feature vector, but
the input data set remains always in terms of the original axes (i.e, in terms of the
initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using
the eigenvectors of the covariance matrix, to reorient the data from the original
axes to the ones represented by the principal components (hence the name