0% found this document useful (0 votes)
171 views8 pages

Unit Ii

This document discusses various regularization techniques for deep learning models. It covers parameter norm penalties like L1 and L2 regularization, which add penalties to the objective function based on the size of the model parameters. It also discusses techniques like dataset augmentation, noise injection, early stopping, dropout, and adversarial training which help reduce overfitting and improve generalization. The goal of regularization is to reduce test error by limiting the capacity of models during training.

Uploaded by

nikhilsinha789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views8 pages

Unit Ii

This document discusses various regularization techniques for deep learning models. It covers parameter norm penalties like L1 and L2 regularization, which add penalties to the objective function based on the size of the model parameters. It also discusses techniques like dataset augmentation, noise injection, early stopping, dropout, and adversarial training which help reduce overfitting and improve generalization. The goal of regularization is to reduce test error by limiting the capacity of models during training.

Uploaded by

nikhilsinha789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT-II

Regularization for Deep Learning: Parameter Norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Data set Augmentation, Noise
Robustness, SemiSupervised Learning, Multi-Task Learning, Early Stopping, Parameter Tying and
Parameter Sharing, Sparse Representations, Bagging and Other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, Tangent Prop, and Manifold Tangent Classifier.

Regularization for Deep Learning


A central problem in machine learning is how to make an algorithm that will perform well not
just on the training data, but also on new inputs.

Many strategies used in machine learning are explicitly designed to reduce the test error,
possibly at the expense of increased training error. These strategies are known collectively as
regularization.

Parameter Norm Penalties


Many regularization approaches are based on limiting the capacity of models, such as neural
networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to
the objective function J. We denote the regularized objective function by J˜:

where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty
term, Ω, relative to the standard objective function J(x; θ). Setting α to 0 results in no
regularization. Larger values of α correspond to more regularization.

L2 Parameter Regularization

The L2 parameter norm penalty commonly known as weight decay. This regularization strategy
drives the weights closer to the origin by adding a regularization term Ω(θ) = ½||w||2 2 to the
objective function. In other academic communities, L2 regularization is also known as ridge
regression or Tikhonov regularization.

To simplify the presentation, we assume no bias parameter, so θ is just w. Such a model has the
following total objective function:

with the corresponding parameter gradient

To take a single gradient step to update the weights, we perform this update:
Written another way, the update is:

L1 Regularization

While L2 weight decay is the most common form of weight decay, there are other ways to
penalize the size of the model parameters. Another option is to use L1 regularization.

Formally, L1 regularization on the model parameter w is defined as:

Thus, the regularized objective function J˜(w; X, y) is given by

with the corresponding gradient (actually, sub-gradient):

where sign(w) is simply the sign of w applied element-wise.

Norm Penalties as Constrained Optimization

Consider the cost function regularized by a parameter norm penalty:

we can minimize a function subject to constraints by constructing a generalized Lagrange


function, consisting of the original objective function plus a set of penalties. Each penalty is a
product between a coefficient, called a Karush–Kuhn–Tucker (KKT) multiplier, and a function
representing whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than
some constant k, we could construct a generalized Lagrange function

Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more data. Of
course, in practice, the amount of data we have is limited. One way to get around this problem is
to create fake data and add it to the training set.

For some machine learning tasks, it is reasonably straightforward to create new fake data.
Dataset augmentation has been a particularly effective technique for a specific classification
problem: object recognition. Images are high dimensional and include an enormous variety of
factors of variation, many of which can be easily simulated. Operations like translating the
training images a few pixels in each direction can often greatly improve generalization, even if
the model has already been designed to be partially translation invariant by using the convolution
and pooling techniques. Many other operations such as rotating the image or scaling the image
have also proven quite effective.
One must be careful not to apply transformations that would change the correct class. For
example, optical character recognition tasks require recognizing the difference between ‘b’ and
‘d’ and the difference between ‘6’ and ‘9’, so horizontal flips and 180◦ rotations are not
appropriate ways of augmenting datasets for these tasks.

Noise Robustness

The use of noise applied to the inputs as a dataset augmentation strategy. For some models, the
addition of noise with infinitesimal variance at the input of the model is equivalent to imposing a
penalty on the norm of the weights. In the general case, it is important to remember that noise
injection can be much more powerful than simply shrinking the parameters, especially when the
noise is added to the hidden units.

Semi-Supervised Learning

In the paradigm of semi-supervised learning, both unlabeled examples from P(x) and labeled
examples from P (x, y) are used to estimate P (y | x) or predict y from x.

Instead of having separate unsupervised and supervised components in the model, one can
construct models in which a generative model of either P (x) or P(x, y) shares parameters with a
discriminative model of P(y | x).

Multi-Task Learning

Multi-task learning is a way to improve generalization by pooling the examples (which can be
seen as soft constraints imposed on the parameters) arising out of several tasks. In the same way
that additional training examples put more pressure on the parameters of the model towards
values that generalize well, when part of a model is shared across tasks, that part of the model is
more constrained towards good values (assuming the sharing is justified), often yielding better
generalization.
Multi-task learning can be cast in several ways in deep learning frameworks and this figure
illustrates the common situation where the tasks share a common input but involve different
target random variables. The lower layers of a deep network (whether it is supervised and
feedforward or includes a generative component with downward arrows) can be shared across
such tasks, while task-specific parameters (associated respectively with the weights into and
from h(1) and h(2)) can be learned on top of those yielding a shared representation h(shared).

Early Stopping

When training large models with sufficient representational capacity to overfit the task, we often
observe that training error decreases steadily over time, but validation set error begins to rise
again.
Fig. shows an example of this behavior. This behavior occurs very reliably.

Observe that the training objective decreases consistently over time, but the validation set
average loss eventually begins to increase again, forming an asymmetric U-shaped curve.
This means we can obtain a model with better validation set error (and thus, hopefully better test
set error) by returning to the parameter setting at the point in time with the lowest validation set
error. Instead of running our optimization algorithm until we reach a (local) minimum of
validation error, we run it until the error on the validation set has not improved for some amount
of time. Every time the error on the validation set improves, we store a copy of the model
parameters. When the training algorithm terminates, we return these parameters, rather than the
latest parameters.This strategy is known as early stopping.

Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by
combining several models. The idea is to train several different models separately, then have all
of the models vote on the output for test examples. This is an example of a general strategy in
machine learning called model averaging. Techniques employing this strategy are known as
ensemble methods.

Suppose we train an ‘8’ detector on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’.
Suppose we make two different resampled datasets. The bagging training procedure is to
construct each of these datasets by sampling with replacement. The first dataset omits the ‘9’ and
repeats the ‘8’. On this dataset, the detector learns that a loop on top of the digit corresponds to
an ‘8’. On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detector learns
that a loop on the bottom of the digit corresponds to an ‘8’. Each of these individual
classification rules is brittle, but if we average their output then the detector is robust, achieving
maximal confidence only when both loops of the ‘8’ are present.

Dropout

Dropout provides a computationally inexpensive but powerful method of regularizing a broad


family of models.
To a first approximation, dropout can be thought of as a method of making bagging practical for
ensembles of very many large neural networks.
Bagging involves training multiple models, and evaluating multiple models on each test
example. This seems impractical when each model is a large neural network, since training and
evaluating such networks is costly in terms of runtime and memory.
dropout trains the ensemble consisting of all sub-networks that can be formed by removing non-
output units from an underlying base network, as illustrated in Fig

Dropout trains an ensemble consisting of all sub-networks that can be constructed by removing
non-output units from an underlying base network. Here, we begin with a base network with two
visible units and two hidden units. There are sixteen possible subsets of these four units. We
show all sixteen subnetworks that may be formed by dropping out different subsets of units from
the original network. In this small example, a large proportion of the resulting networks have no
input units or no path connecting the input to the output. This problem becomes insignificant for
networks with wider layers, where the probability of dropping all possible paths from inputs to
outputs becomes smaller.
Adversarial Training

In many cases, neural networks have begun to reach human performance when evaluated
on an i.i.d. test set. It is natural therefore to wonder whether these models have obtained a
true human-level understanding of these tasks. In order to probe the level of
understanding a network has of the underlying task, we can search for examples that the
model misclassifies. It was found that even neural networks that perform at human level
accuracy have a nearly 100% error rate on examples that are intentionally constructed by
using an optimization procedure to search for an input x near a data point x such that
the model output is very different at x. In many cases, x can be so similar to x that a
human observer cannot tell the difference between the original example and the
adversarial example, but the network can make highly different predictions.

A demonstration of adversarial example generation applied to GoogLeNet on ImageNet. By


adding an imperceptibly small vector whose elements are equal to the sign of the elements of the
gradient of the cost function with respect to the input, we can change GoogLeNet’s classification
of the image. Reproduced with permission from .

Adversarial examples have many implications, for example, in computer security, that is
beyond the scope of this chapter. However, they are interesting in the context of
regularization because one can reduce the error rate on the original i.i.d. test set via
adversarial training—training on adversarially perturbed examples from the training set.

Tangent Distance, Tangent Prop, and ManifoldTangent Classifier

One of the early attempts to take advantage of the manifold hypothesis is the tangent
distance algorithm. It is a non-parametric nearest-neighbor algorithm in which the metric
used is not the generic Euclidean distance but one that is derived from knowledge of the
manifolds near which probability concentrates. It is assumed that we are trying to classify
examples and that examples on the same manifold share the same category.
The tangent prop algorithm trains a neural net classifier with an extra penalty to make
each output f (x) of the neural net locally invariant to known factors of variation. These
factors of variation correspond to movement along the manifold near which examples of
the same class concentrate.
Tangent propagation is closely related to dataset augmentation. In both cases, the user of
the algorithm encodes his or her prior knowledge of the task by specifying a set of
transformations that should not alter the output of the network.
Tangent propagation is also related to double backprop and adversarial training.
The manifold tangent classifier, eliminates the need to know the tangent vectors a priori.

Illustration of the main idea of the tangent prop algorithm and manifold tangent classifier, which
both regularize the classifier output function f(x). Each curve represents the manifold for a
different class, illustrated here as a one-dimensional manifold embedded in a two-dimensional
space.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy