Unit Ii
Unit Ii
Regularization for Deep Learning: Parameter Norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Data set Augmentation, Noise
Robustness, SemiSupervised Learning, Multi-Task Learning, Early Stopping, Parameter Tying and
Parameter Sharing, Sparse Representations, Bagging and Other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, Tangent Prop, and Manifold Tangent Classifier.
Many strategies used in machine learning are explicitly designed to reduce the test error,
possibly at the expense of increased training error. These strategies are known collectively as
regularization.
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty
term, Ω, relative to the standard objective function J(x; θ). Setting α to 0 results in no
regularization. Larger values of α correspond to more regularization.
L2 Parameter Regularization
The L2 parameter norm penalty commonly known as weight decay. This regularization strategy
drives the weights closer to the origin by adding a regularization term Ω(θ) = ½||w||2 2 to the
objective function. In other academic communities, L2 regularization is also known as ridge
regression or Tikhonov regularization.
To simplify the presentation, we assume no bias parameter, so θ is just w. Such a model has the
following total objective function:
To take a single gradient step to update the weights, we perform this update:
Written another way, the update is:
L1 Regularization
While L2 weight decay is the most common form of weight decay, there are other ways to
penalize the size of the model parameters. Another option is to use L1 regularization.
Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more data. Of
course, in practice, the amount of data we have is limited. One way to get around this problem is
to create fake data and add it to the training set.
For some machine learning tasks, it is reasonably straightforward to create new fake data.
Dataset augmentation has been a particularly effective technique for a specific classification
problem: object recognition. Images are high dimensional and include an enormous variety of
factors of variation, many of which can be easily simulated. Operations like translating the
training images a few pixels in each direction can often greatly improve generalization, even if
the model has already been designed to be partially translation invariant by using the convolution
and pooling techniques. Many other operations such as rotating the image or scaling the image
have also proven quite effective.
One must be careful not to apply transformations that would change the correct class. For
example, optical character recognition tasks require recognizing the difference between ‘b’ and
‘d’ and the difference between ‘6’ and ‘9’, so horizontal flips and 180◦ rotations are not
appropriate ways of augmenting datasets for these tasks.
Noise Robustness
The use of noise applied to the inputs as a dataset augmentation strategy. For some models, the
addition of noise with infinitesimal variance at the input of the model is equivalent to imposing a
penalty on the norm of the weights. In the general case, it is important to remember that noise
injection can be much more powerful than simply shrinking the parameters, especially when the
noise is added to the hidden units.
Semi-Supervised Learning
In the paradigm of semi-supervised learning, both unlabeled examples from P(x) and labeled
examples from P (x, y) are used to estimate P (y | x) or predict y from x.
Instead of having separate unsupervised and supervised components in the model, one can
construct models in which a generative model of either P (x) or P(x, y) shares parameters with a
discriminative model of P(y | x).
Multi-Task Learning
Multi-task learning is a way to improve generalization by pooling the examples (which can be
seen as soft constraints imposed on the parameters) arising out of several tasks. In the same way
that additional training examples put more pressure on the parameters of the model towards
values that generalize well, when part of a model is shared across tasks, that part of the model is
more constrained towards good values (assuming the sharing is justified), often yielding better
generalization.
Multi-task learning can be cast in several ways in deep learning frameworks and this figure
illustrates the common situation where the tasks share a common input but involve different
target random variables. The lower layers of a deep network (whether it is supervised and
feedforward or includes a generative component with downward arrows) can be shared across
such tasks, while task-specific parameters (associated respectively with the weights into and
from h(1) and h(2)) can be learned on top of those yielding a shared representation h(shared).
Early Stopping
When training large models with sufficient representational capacity to overfit the task, we often
observe that training error decreases steadily over time, but validation set error begins to rise
again.
Fig. shows an example of this behavior. This behavior occurs very reliably.
Observe that the training objective decreases consistently over time, but the validation set
average loss eventually begins to increase again, forming an asymmetric U-shaped curve.
This means we can obtain a model with better validation set error (and thus, hopefully better test
set error) by returning to the parameter setting at the point in time with the lowest validation set
error. Instead of running our optimization algorithm until we reach a (local) minimum of
validation error, we run it until the error on the validation set has not improved for some amount
of time. Every time the error on the validation set improves, we store a copy of the model
parameters. When the training algorithm terminates, we return these parameters, rather than the
latest parameters.This strategy is known as early stopping.
Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by
combining several models. The idea is to train several different models separately, then have all
of the models vote on the output for test examples. This is an example of a general strategy in
machine learning called model averaging. Techniques employing this strategy are known as
ensemble methods.
Suppose we train an ‘8’ detector on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’.
Suppose we make two different resampled datasets. The bagging training procedure is to
construct each of these datasets by sampling with replacement. The first dataset omits the ‘9’ and
repeats the ‘8’. On this dataset, the detector learns that a loop on top of the digit corresponds to
an ‘8’. On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detector learns
that a loop on the bottom of the digit corresponds to an ‘8’. Each of these individual
classification rules is brittle, but if we average their output then the detector is robust, achieving
maximal confidence only when both loops of the ‘8’ are present.
Dropout
Dropout trains an ensemble consisting of all sub-networks that can be constructed by removing
non-output units from an underlying base network. Here, we begin with a base network with two
visible units and two hidden units. There are sixteen possible subsets of these four units. We
show all sixteen subnetworks that may be formed by dropping out different subsets of units from
the original network. In this small example, a large proportion of the resulting networks have no
input units or no path connecting the input to the output. This problem becomes insignificant for
networks with wider layers, where the probability of dropping all possible paths from inputs to
outputs becomes smaller.
Adversarial Training
In many cases, neural networks have begun to reach human performance when evaluated
on an i.i.d. test set. It is natural therefore to wonder whether these models have obtained a
true human-level understanding of these tasks. In order to probe the level of
understanding a network has of the underlying task, we can search for examples that the
model misclassifies. It was found that even neural networks that perform at human level
accuracy have a nearly 100% error rate on examples that are intentionally constructed by
using an optimization procedure to search for an input x near a data point x such that
the model output is very different at x. In many cases, x can be so similar to x that a
human observer cannot tell the difference between the original example and the
adversarial example, but the network can make highly different predictions.
Adversarial examples have many implications, for example, in computer security, that is
beyond the scope of this chapter. However, they are interesting in the context of
regularization because one can reduce the error rate on the original i.i.d. test set via
adversarial training—training on adversarially perturbed examples from the training set.
One of the early attempts to take advantage of the manifold hypothesis is the tangent
distance algorithm. It is a non-parametric nearest-neighbor algorithm in which the metric
used is not the generic Euclidean distance but one that is derived from knowledge of the
manifolds near which probability concentrates. It is assumed that we are trying to classify
examples and that examples on the same manifold share the same category.
The tangent prop algorithm trains a neural net classifier with an extra penalty to make
each output f (x) of the neural net locally invariant to known factors of variation. These
factors of variation correspond to movement along the manifold near which examples of
the same class concentrate.
Tangent propagation is closely related to dataset augmentation. In both cases, the user of
the algorithm encodes his or her prior knowledge of the task by specifying a set of
transformations that should not alter the output of the network.
Tangent propagation is also related to double backprop and adversarial training.
The manifold tangent classifier, eliminates the need to know the tangent vectors a priori.
Illustration of the main idea of the tangent prop algorithm and manifold tangent classifier, which
both regularize the classifier output function f(x). Each curve represents the manifold for a
different class, illustrated here as a one-dimensional manifold embedded in a two-dimensional
space.