DL_UNIT_3_NOTES
DL_UNIT_3_NOTES
٠
١
UNIT – III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms
Deep learning allows the computer to build complex concepts out of simpler
concepts.
Below figure shows how a deep learning system can represent the concept of an image of
a person by combining simpler concepts, such as corners and contours, which are in turn defined
in terms of edges. The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function
mapping some set of input values to output values.
39
٠
١
There are two main ways of measuring the depth of a model. The first view is based
on the number of sequential instructions that must be executed to evaluate the architecture. Above
figure illustrates how this choice of language can give two different measurements for the same
architecture. Another approach, used by deep probabilistic models, regards the depth of a model
as being not the depth of the computational graph but the depth of the graph describing how
concepts are related to each other.
• Deep learning has had a long and rich history, but has gone by many names reflecting
different philosophical viewpoints, and has waxed and waned in popularity.
40
٠
١
• Deep learning has become more useful as the amount of available training data has
increased.
• Deep learning models have grown in size over time as computer infrastructure (both
hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with increasing accuracy
over time.
Broadly speaking, there have been three waves of development of deep learning:
deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006.
Fig: This figure shows two of the three historical waves of artificial neural nets research,
as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural
networks” according to Google Books.
One may wonder why deep learning has only recently become recognized as a
crucial technology though the first experiments with artificial neural networks were conducted in
the 1950s. As our computers are increasingly networked together, it becomes easier to centralize
these records and curate them into a dataset appropriate for machine learning applications. As of
2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve
acceptable performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10 million labeled
41
٠
١
examples. Working successfully with datasets smaller than this is an important research area,
focusing in particular on how we can take advantage of large quantities of unlabeled examples,
with unsupervised or semi-supervised learning.
Another key reason that neural networks are wildly successful today after enjoying
comparatively little success since the 1980s is that we have the computational resources to run
much larger models today. The increase in model size over time, due to the availability of faster
CPUs, the advent of general purpose GPUs, faster network connectivity and better software
infrastructure for distributed computing, is one of the most important trends in the history of deep
learning. This trend is generally expected to continue well into the future.
These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations used to define f, and
finally to the output y. There are no feedback connections in which outputs of the model are fed
back into itself.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is associated
with a directed acyclic graph describing how the functions are composed together. For
example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) =
f(3)(f (2)(f(1) (x ))). These chain structures are the most commonly used structures of neural
networks. In this case, f (1) is called the first layer of the network, f (2) is called the second layer,
and so on. The overall length of the chain gives the depth of the model. It is from this terminology
that the name “deep learning” arises. The final layer of a feedforward network is called the output
layer. The learning algorithm must decide how to use these layers to best implement an
approximation of f∗. Because the training data does not show the desired output for each of these
layers, these layers are called hidden layers.
Finally, these networks are called neural because they are loosely inspired by
neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of
these hidden layers determines the width of the model.
Feedforward networks have introduced the concept of a hidden layer, and this
requires us to choose the activation functions that will be used to compute the hidden layer values.
We must also design the architecture of the network, including how many layers the network
42
٠
١
should contain, how these layers should be connected to each other, and how many units should
be in each layer. Learning in deep neural networks requires computing the gradients of complicated
functions. We present the back-propagation algorithm and its modern generalizations, which can
be used to efficiently compute these gradients.
It has a single hidden layer containing two units. (Left)In this style, we draw every
unit as a node in the graph. This style is very explicit and unambiguous but for networks larger
than this example it can consume too much space. (Right)In this style, we draw a node in the graph
for each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that describe the
relationship between two layers. Here, we indicate that a matrix W describes the mapping from x
to h, and a vector w describes the mapping from h to y.
Gradient-Based Learning
Designing and training a neural network is not much different from training any
other machine learning model with gradient descent. Computing the gradient is slightly more
complicated for a neural network, but can still be done efficiently and exactly.
As with other machine learning models, to apply gradient-based learning we must
choose a cost function, and we must choose how to represent the output of the model.
Cost Functions
An important aspect of the design of a deep neural network is the choice of the cost
function. Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.
43
٠
١
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply
use the principle of maximum likelihood. This means we use the cross-entropy between the
training data and the model’s predictions as the cost function.
The total cost function used to train a neural network will often combine one of the
primary cost functions described here with a regularization term.
Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood. This means
that the cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by
J(θ) = −E x,y∼pˆdata logpmodel(y|x)
Output Units
The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
Any kind of neural network unit that may be used as an output can also be used as
a hidden unit. we suppose that the feedforward network provides a set of hidden features defined
by h = f (x ;θ ). The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.
Linear Units for Gaussian Output Distributions
One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity. These are often just called linear units.
Given features h, a layer of linear output units produces a vector yˆ = WTh+b
.Linear output layers are often used to produce the mean of a conditional
Gaussian distribution:
p(y|x) = N(y;yˆ,I).
Hidden Units
The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default choice
of hidden unit. The design process consists of trial and error, intuiting that a kind of hidden unit
may work well, and then training a network with that kind of hidden unit and evaluating its
performance on a validation set.
Some of the hidden units included in this list are not actually differentiable at all
input points. For example, the rectified linear function g(z) = max{0,z} is not differentiable at z =
0. This may seem like it invalidates g for use with a gradient based learning algorithm.
44
٠
١
Unless indicated otherwise, most hidden units can be described as accepting a
vector of inputs x, computing an affine transformation z = W T x + b, and then applying an element-
wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form
of the activation function .
45
٠
١
One possibility is to not have an activation g(z) at all. One can also think of this as using
the identity function as the activation function. We have already seen that a linear unit can be
useful as the output of a neural network. It may also be used as a hidden unit.
Softmax units are another kind of unit that is usually used as an output but may sometimes
be used as a hidden unit. Softmax units naturally represent a probability distribution over a discrete
variable with k possible values, so they may be used as a kind of switch.
A few other reasonably common hidden unit types include:
• Radial basis function or RBF unit: hi= exp(− 1/σ2i ||W:,i – x||2). This function becomes
more active as x approaches a template W:,i. Because it saturates to for most , it can be difficult to
optimize.
• Softplus: g(a) = ζ(a) = log(1+ea). This is a smooth version of the rectifier for function
approximation and for the conditional distributions of undirected probabilistic models.
• Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the latter, it is
bounded, g(a) = max(−1 , min(1,a)).
Architecture Design
The word architecture refers to the overall structure of the network: how many units
it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural network
architectures arrange these layers in a chain structure, with each layer being a function of the layer
that preceded it. In this structure, the first layer is given by
h(1)= g(1)(W(1)Tx + b(1))
the second layer is given by
h(2)= g(2)(W(2)T h(1) + b(2))
and so on.
In these chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer. The ideal network architecture for a
task must be found via experimentation guided by monitoring the validation set error.
Universal Approximation Properties and Depth
A linear model, mapping from features to outputs via matrix multiplication, can by
definition represent only linear functions. It has the advantage of being easy to train because many
loss functions result in convex optimization problems when applied to linear models.
46
٠
١
The universal approximation theorem states that a feedforward network with a
linear output layer and at least one hidden layer with any “squashing” activation function (such as
the logistic sigmoid activation function) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired non-zero amount of error, provided that the
network is given enough hidden units.
The universal approximation theorem means that regardless of what function we
are trying to learn, we know that a large MLP will be able to represent this function.
In summary, a feedforward network with a single layer is sufficient to represent any
function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In
many circumstances, using deeper models can reduce the number of units required to represent the
desired function and can reduce the amount of generalization error.
i.e., exponential in the depth . In the case of maxout networks with filters per l k unit, the
number of linear regions is
47
٠
١
(l-1)+d
o(k )
Figure: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses.
48
٠
١
it produces a scalar cost J (θ). The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to then flow backwards through
the network, in order to compute the gradient..
The term back-propagation is often misunderstood as meaning the whole learning
algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method
for computing the gradient, while another algorithm, such as stochastic gradient descent, is used
to perform learning using this gradient.
Computational Graphs
To describe the back-propagation algorithm more precisely, it is helpful to have a
more precise language. computational graph Many ways of formalizing computation as graphs are
possible. Here, we use each node in the graph to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type. To formalize our graphs, we also need
to introduce the idea of an operation. An operation is a simple function of one or more variables.
49
٠
١
Chain Rule of Calculus
Back-propagation is an algorithm that computes the chain rule, with a specific order
of operations that is highly efficient. Let x be a real number, and let f and g both be functions
mapping from a real number to a real number. Suppose that y = g(x) and z = f(g(x)) = f(y). Then
the chain rule states that
dz/dx = (dz/ dy) (dy/dx ).
Recursively Applying the Chain Rule to Obtain Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that produced that
scalar.
Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose whether
to store these subexpressions or to recompute them several times. An example of how these
repeated subexpressions arise is given in figure .
Figure 6.9: A computational graph that results in repeated subexpressions when computing
the gradient.
Symbol-to-Symbol Derivatives
50
٠
١
Algebraic expressions and computational graphs both operate on symbols, or
variables that do not have specific values. These algebraic and graph-based representations are
called symbolic representations. When we actually use or train a neural network, we must assign
specific values to these symbols. We replace a symbolic input to the network x with a specific
numeric value, such as [1.2,3.765,−1.8]T.
Another approach is to take a computational graph and add additional nodes to the
graph that provide a symbolic description of the desired derivatives.
General Back-Propagation
51
٠
١
More formally, each node in the graph G corresponds to a variable. To achieve
maximum generality, we describe this variable as being a tensor V. Tensor can in general have
any number of dimensions. They subsume scalars, vectors, and matrices.
The back-propagation algorithm itself does not need to know any differentiation
rules. It only needs to call each operation’s bprop rules with the right arguments. Formally,
op.bprop(inputs,X,G) must return
Here, inputs is a list of inputs that are supplied to the operation, op.f is the
mathematical function that the operation implements, X is the input whose gradient we
wish to compute, and G is the gradient on the output of the operation.
Complications
Most software implementations need to support operations that can return more
than one tensor. For example, if we wish to compute both the maximum value in a tensor and the
index of that value, it is best to compute both in a single pass through memory, so it is most efficient
to implement this procedure as a single operation with two outputs.
52
٠
١
The deep learning community has been somewhat isolated from the broader
computer science community and has largely developed its own cultural attitudes concerning how
to perform differentiation. More generally, the field of automatic differentiation is concerned
with how to compute derivatives algorithmically.
The back-propagation algorithm described here is only one approach to automatic
differentiation. It is a special case of a broader class of techniques called reverse mode
accumulation. Other approaches evaluate the subexpressions of the chain rule in different orders.
In general, determining the order of evaluation that results in the lowest computational cost is a
difficult problem. Finding the optimal sequence of operations to compute the gradient is NP-
complete (Naumann, 2008), in the sense that it may require simplifying algebraic expressions into
their least expensive form.
53