0% found this document useful (0 votes)
15 views16 pages

DL_UNIT_3_NOTES

The document provides an overview of deep learning, emphasizing its historical development, key concepts, and architecture design. It explains the structure and function of deep feedforward networks, gradient-based learning, and the importance of hidden units and activation functions. The text also highlights the evolution of deep learning, the significance of increasing dataset and model sizes, and the role of cost functions in training neural networks.

Uploaded by

21ag1a6652
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

DL_UNIT_3_NOTES

The document provides an overview of deep learning, emphasizing its historical development, key concepts, and architecture design. It explains the structure and function of deep feedforward networks, gradient-based learning, and the importance of hidden units and activation functions. The text also highlights the evolution of deep learning, the significance of increasing dataset and model sizes, and the role of cost functions in training neural networks.

Uploaded by

21ag1a6652
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|45190830

Neural Networks AND DEEP Learning Notes-1

Computer science Engineering (Sri Shakthi Institute of Engineering and Technology)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Sai Patibandla (ping2saas145@gmail.com)
lOMoARcPSD|45190830

٠
١
UNIT – III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms

Introduction to Deep Learning


Deep learning is a sub-field of machine learning dealing with algorithms inspired by
the structure and function of the brain called artificial neural networks. In other words, it mirrors
the functioning of our brains. Deep learning algorithms are similar to how nervous system
structured where each neuron connected each other and passing information.

Example of different representations: suppose we want to separate two categories of data


by drawing a line between them in a scatterplot.

Deep learning allows the computer to build complex concepts out of simpler
concepts.
Below figure shows how a deep learning system can represent the concept of an image of
a person by combining simpler concepts, such as corners and contours, which are in turn defined
in terms of edges. The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function
mapping some set of input values to output values.

39

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١

Figure 1.2: Illustration of a deep learning model.

Figure 1.3: Illustration of computational graphs mapping an input to an output where


each node performs an operation.

There are two main ways of measuring the depth of a model. The first view is based
on the number of sequential instructions that must be executed to evaluate the architecture. Above
figure illustrates how this choice of language can give two different measurements for the same
architecture. Another approach, used by deep probabilistic models, regards the depth of a model
as being not the depth of the computational graph but the depth of the graph describing how
concepts are related to each other.

Historical Trends in Deep learning


It is easiest to understand deep learning with some historical context. Rather than
providing a detailed history of deep learning, we identify a few key trends:

• Deep learning has had a long and rich history, but has gone by many names reflecting
different philosophical viewpoints, and has waxed and waned in popularity.

40

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
• Deep learning has become more useful as the amount of available training data has
increased.
• Deep learning models have grown in size over time as computer infrastructure (both
hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with increasing accuracy
over time.

The Many Names and Changing Fortunes of Neural Networks

Broadly speaking, there have been three waves of development of deep learning:
deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006.

Some of the earliest learning algorithms we recognize today were intended to be


computational models of biological learning, i.e. models of how learning happens or could happen
in the brain. As a result, one of the names that deep learning has gone by is artificial neural
networks (ANNs).

Fig: This figure shows two of the three historical waves of artificial neural nets research,
as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural
networks” according to Google Books.

Increasing Dataset Sizes

One may wonder why deep learning has only recently become recognized as a
crucial technology though the first experiments with artificial neural networks were conducted in
the 1950s. As our computers are increasingly networked together, it becomes easier to centralize
these records and curate them into a dataset appropriate for machine learning applications. As of
2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve
acceptable performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10 million labeled

41

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
examples. Working successfully with datasets smaller than this is an important research area,
focusing in particular on how we can take advantage of large quantities of unlabeled examples,
with unsupervised or semi-supervised learning.

Increasing Model Sizes

Another key reason that neural networks are wildly successful today after enjoying
comparatively little success since the 1980s is that we have the computational resources to run
much larger models today. The increase in model size over time, due to the availability of faster
CPUs, the advent of general purpose GPUs, faster network connectivity and better software
infrastructure for distributed computing, is one of the most important trends in the history of deep
learning. This trend is generally expected to continue well into the future.

Deep Feed - forward networks

Deep feedforward networks, also often called feedforward neural networks, or


multilayer perceptrons (MLPs), are the quintessential deep learning models. The goal of a
feedforward network is to approximate some function f ∗. For example, for a classifier, y = f ∗(x)
maps an input x to a category y. A feedforward network defines a mapping y = f (x; θ) and learns
the value of the parameters θ that result in the best function approximation.

These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations used to define f, and
finally to the output y. There are no feedback connections in which outputs of the model are fed
back into itself.

Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is associated
with a directed acyclic graph describing how the functions are composed together. For
example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) =
f(3)(f (2)(f(1) (x ))). These chain structures are the most commonly used structures of neural
networks. In this case, f (1) is called the first layer of the network, f (2) is called the second layer,
and so on. The overall length of the chain gives the depth of the model. It is from this terminology
that the name “deep learning” arises. The final layer of a feedforward network is called the output
layer. The learning algorithm must decide how to use these layers to best implement an
approximation of f∗. Because the training data does not show the desired output for each of these
layers, these layers are called hidden layers.

Finally, these networks are called neural because they are loosely inspired by
neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of
these hidden layers determines the width of the model.

Feedforward networks have introduced the concept of a hidden layer, and this
requires us to choose the activation functions that will be used to compute the hidden layer values.
We must also design the architecture of the network, including how many layers the network

42

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
should contain, how these layers should be connected to each other, and how many units should
be in each layer. Learning in deep neural networks requires computing the gradients of complicated
functions. We present the back-propagation algorithm and its modern generalizations, which can
be used to efficiently compute these gradients.

Figure : An example of a feedforward network, drawn in two different styles. Specifically,


this is the feedforward network we use to solve the XOR example.

It has a single hidden layer containing two units. (Left)In this style, we draw every
unit as a node in the graph. This style is very explicit and unambiguous but for networks larger
than this example it can consume too much space. (Right)In this style, we draw a node in the graph
for each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that describe the
relationship between two layers. Here, we indicate that a matrix W describes the mapping from x
to h, and a vector w describes the mapping from h to y.

Gradient-Based Learning
Designing and training a neural network is not much different from training any
other machine learning model with gradient descent. Computing the gradient is slightly more
complicated for a neural network, but can still be done efficiently and exactly.
As with other machine learning models, to apply gradient-based learning we must
choose a cost function, and we must choose how to represent the output of the model.

Cost Functions
An important aspect of the design of a deep neural network is the choice of the cost
function. Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.

43

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply
use the principle of maximum likelihood. This means we use the cross-entropy between the
training data and the model’s predictions as the cost function.
The total cost function used to train a neural network will often combine one of the
primary cost functions described here with a regularization term.
 Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood. This means
that the cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by
J(θ) = −E x,y∼pˆdata logpmodel(y|x)

Output Units

The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
Any kind of neural network unit that may be used as an output can also be used as
a hidden unit. we suppose that the feedforward network provides a set of hidden features defined
by h = f (x ;θ ). The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.
 Linear Units for Gaussian Output Distributions

One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity. These are often just called linear units.
Given features h, a layer of linear output units produces a vector yˆ = WTh+b
.Linear output layers are often used to produce the mean of a conditional
Gaussian distribution:
p(y|x) = N(y;yˆ,I).
Hidden Units

The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default choice
of hidden unit. The design process consists of trial and error, intuiting that a kind of hidden unit
may work well, and then training a network with that kind of hidden unit and evaluating its
performance on a validation set.
Some of the hidden units included in this list are not actually differentiable at all
input points. For example, the rectified linear function g(z) = max{0,z} is not differentiable at z =
0. This may seem like it invalidates g for use with a gradient based learning algorithm.

44

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
Unless indicated otherwise, most hidden units can be described as accepting a
vector of inputs x, computing an affine transformation z = W T x + b, and then applying an element-
wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form
of the activation function .

Rectified Linear Units and Their Generalizations


Rectified linear units use the activation function g(z) = max{0,z }.
Rectified linear units are easy to optimize because they are so similar to linear units. The
only difference between a linear unit and a rectified linear unit is that a rectified linear unit outputs
zero across half its domain.
Rectified linear units are typically used on top of an affine transformation:
h=g(WT x + b)
One drawback to rectified linear units is that they cannot learn via gradient based methods
on examples for which their activation is zero.
Logistic Sigmoid and Hyperbolic Tangent
Prior to the introduction of rectified linear units, most neural networks used the
logistic sigmoid activation function
g (z)=σ (z)
or the hyperbolic tangent activation function
g (z) = tanh(z)
These activation functions are closely related because tanh(z)=2σ(2z)-1
Sigmoidal activation functions are more common in settings other than feedforward
networks. Recurrent networks, many probabilistic models, and some autoencoders have additional
requirements that rule out the use of piecewise linear activation functions and make sigmoidal
units more appealing despite the drawbacks of saturation.
Other Hidden Units
Many other types of hidden units are possible, but are used less frequently. In general, a
wide variety of differentiable functions perform perfectly well. Many unpublished activation
functions perform just as well as the popular ones.

45

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
One possibility is to not have an activation g(z) at all. One can also think of this as using
the identity function as the activation function. We have already seen that a linear unit can be
useful as the output of a neural network. It may also be used as a hidden unit.
Softmax units are another kind of unit that is usually used as an output but may sometimes
be used as a hidden unit. Softmax units naturally represent a probability distribution over a discrete
variable with k possible values, so they may be used as a kind of switch.
A few other reasonably common hidden unit types include:
• Radial basis function or RBF unit: hi= exp(− 1/σ2i ||W:,i – x||2). This function becomes
more active as x approaches a template W:,i. Because it saturates to for most , it can be difficult to
optimize.
• Softplus: g(a) = ζ(a) = log(1+ea). This is a smooth version of the rectifier for function
approximation and for the conditional distributions of undirected probabilistic models.
• Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the latter, it is
bounded, g(a) = max(−1 , min(1,a)).

Architecture Design
The word architecture refers to the overall structure of the network: how many units
it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural network
architectures arrange these layers in a chain structure, with each layer being a function of the layer
that preceded it. In this structure, the first layer is given by
h(1)= g(1)(W(1)Tx + b(1))
the second layer is given by
h(2)= g(2)(W(2)T h(1) + b(2))
and so on.
In these chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer. The ideal network architecture for a
task must be found via experimentation guided by monitoring the validation set error.
Universal Approximation Properties and Depth
A linear model, mapping from features to outputs via matrix multiplication, can by
definition represent only linear functions. It has the advantage of being easy to train because many
loss functions result in convex optimization problems when applied to linear models.

46

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
The universal approximation theorem states that a feedforward network with a
linear output layer and at least one hidden layer with any “squashing” activation function (such as
the logistic sigmoid activation function) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired non-zero amount of error, provided that the
network is given enough hidden units.
The universal approximation theorem means that regardless of what function we
are trying to learn, we know that a large MLP will be able to represent this function.
In summary, a feedforward network with a single layer is sufficient to represent any
function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In
many circumstances, using deeper models can reduce the number of units required to represent the
desired function and can reduce the amount of generalization error.

Figure: An intuitive, geometric explanation of the exponential advantage of deeper rectifier


networks
More precisely, the main theorem in Montufar et al. states that the number of linear
regions carved out by a deep rectifier network with d inputs, depth , and units per hidden layer, is

i.e., exponential in the depth . In the case of maxout networks with filters per l k unit, the
number of linear regions is

47

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
(l-1)+d
o(k )

Other Architectural Considerations


Many neural network architectures have been developed for specific tasks.
Specialized architectures for computer vision called convolutional networks. Feedforward
networks may also be generalized to the 9 recurrent neural networks for sequence processing.
Many architectures build a main chain but then add extra architectural features to
it, such as skip connections going from layer i to layer i+2 or higher. These skip connections make
it easier for the gradient to flow from output layers to layers nearer the input.
Another key consideration of architecture design is exactly how to connect a pair
of layers to each other. In the default neural network layer described by a linear transformation via
a matrix W, every input unit is connected to every output unit.

Figure: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses.

Back-Propagation and Other Differentiation Algorithms


When we use a feedforward neural network to accept an input x and produce an
output ˆy, information flows forward through the network. The inputs x provide the initial
information that then propagates up to the hidden units at each layer and finally produces yˆ. This
is called forward propagation. During training, forward propagation can continue onward until

48

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
it produces a scalar cost J (θ). The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to then flow backwards through
the network, in order to compute the gradient..
The term back-propagation is often misunderstood as meaning the whole learning
algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method
for computing the gradient, while another algorithm, such as stochastic gradient descent, is used
to perform learning using this gradient.
Computational Graphs
To describe the back-propagation algorithm more precisely, it is helpful to have a
more precise language. computational graph Many ways of formalizing computation as graphs are
possible. Here, we use each node in the graph to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type. To formalize our graphs, we also need
to introduce the idea of an operation. An operation is a simple function of one or more variables.

Figure: Examples of computational graphs

49

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
Chain Rule of Calculus
Back-propagation is an algorithm that computes the chain rule, with a specific order
of operations that is highly efficient. Let x be a real number, and let f and g both be functions
mapping from a real number to a real number. Suppose that y = g(x) and z = f(g(x)) = f(y). Then
the chain rule states that
dz/dx = (dz/ dy) (dy/dx ).
Recursively Applying the Chain Rule to Obtain Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that produced that
scalar.
Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose whether
to store these subexpressions or to recompute them several times. An example of how these
repeated subexpressions arise is given in figure .

Figure 6.9: A computational graph that results in repeated subexpressions when computing
the gradient.

The back-propagation algorithm is designed to reduce the number of common


subexpressions without regard to memory.

Symbol-to-Symbol Derivatives

50

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
Algebraic expressions and computational graphs both operate on symbols, or
variables that do not have specific values. These algebraic and graph-based representations are
called symbolic representations. When we actually use or train a neural network, we must assign
specific values to these symbols. We replace a symbolic input to the network x with a specific
numeric value, such as [1.2,3.765,−1.8]T.

Figure: An example of the symbol-to-symbol approach to computing derivatives. In


this approach, the back-propagation algorithm does not need to ever access any actual
specific numeric values. Instead, it adds nodes to a computational graph describing how
to compute these derivatives.

Some approaches to back-propagation take a computational graph and a set of


numerical values for the inputs to the graph, then return a set of numerical
values describing the gradient at those input values. We call this approach “symbol-to-
number” differentiation.

Another approach is to take a computational graph and add additional nodes to the
graph that provide a symbolic description of the desired derivatives.

General Back-Propagation

The back-propagation algorithm is very simple. To compute the gradient of some


scalar z with respect to one of its ancestors x in the graph, we begin by observing that the gradient
with respect to z is given by dz/dz = 1. We can then compute the gradient with respect to each
parent of z in the graph by multiplying the current gradient by the Jacobian of the operation that
produced z. We continue multiplying by Jacobians traveling backwards through the graph in this
way until we reach x. For any node that may be reached by going backwards from z through two
or more paths, we simply sum the gradients arriving from different paths at that node.

51

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١
More formally, each node in the graph G corresponds to a variable. To achieve
maximum generality, we describe this variable as being a tensor V. Tensor can in general have
any number of dimensions. They subsume scalars, vectors, and matrices.

We assume that each variable is associated with the following V subroutines:


 get_operation(V): This returns the operation that computes V, represented
by the edges coming into V in the computational graph. For example, there may be a Python
or C++ class representing the matrix multiplication operation, and the get_operation function.
Suppose we have a variable that is created by matrix multiplication, C = AB. Then
get_operation(V) returns a pointer to an instance of the corresponding C++ class.
 get_consumers(V, G): This returns the list of variables that are children of
V in the computational graph G.
 get_inputs(V, G): This returns the list of variables that are parents of V
in the computational graph G.

The back-propagation algorithm itself does not need to know any differentiation
rules. It only needs to call each operation’s bprop rules with the right arguments. Formally,
op.bprop(inputs,X,G) must return

Here, inputs is a list of inputs that are supplied to the operation, op.f is the
mathematical function that the operation implements, X is the input whose gradient we
wish to compute, and G is the gradient on the output of the operation.

Software implementations of back-propagation usually provide both the operations


and their bprop methods, so that users of deep learning software libraries are able to back-
propagate through graphs built using common operations like matrix multiplication, exponents,
logarithms, and so on. Software engineers who build a new implementation of back-propagation
or advanced users who need to add their own operation to an existing library must usually derive
the op.bprop method for any new operations manually.

Complications

Most software implementations need to support operations that can return more
than one tensor. For example, if we wish to compute both the maximum value in a tensor and the
index of that value, it is best to compute both in a single pass through memory, so it is most efficient
to implement this procedure as a single operation with two outputs.

We have not described how to control the memory consumption of back


propagation.Back-propagation often involves summation of many tensors together. In the
naive approach, each of these tensors would be computed separately, then all of them would be
added in a second step. The naive approach has an overly high memory bottleneck that can be
avoided by maintaining a single buffer and adding each value to that buffer as it is computed.

52

Downloaded by Sai Patibandla (ping2saas145@gmail.com)


lOMoARcPSD|45190830

٠
١

Real-world implementations of back-propagation also need to handle various data


types, such as 32-bit floating point, 64-bit floating point, and integer values.The policy for
handling each of these types takes special care to design.

Some operations have undefined gradients, and it is important to track these


cases and determine whether the gradient requested by the user is undefined.

Various other technicalities make real-world differentiation more


complicated.These technicalities are not insurmountable, and this chapter has described the key
intellectual tools needed to compute derivatives, but it is important to be aware that many more
subtleties exist.

Differentiation outside the Deep Learning Community

The deep learning community has been somewhat isolated from the broader
computer science community and has largely developed its own cultural attitudes concerning how
to perform differentiation. More generally, the field of automatic differentiation is concerned
with how to compute derivatives algorithmically.
The back-propagation algorithm described here is only one approach to automatic
differentiation. It is a special case of a broader class of techniques called reverse mode
accumulation. Other approaches evaluate the subexpressions of the chain rule in different orders.
In general, determining the order of evaluation that results in the lowest computational cost is a
difficult problem. Finding the optimal sequence of operations to compute the gradient is NP-
complete (Naumann, 2008), in the sense that it may require simplifying algebraic expressions into
their least expensive form.

53

Downloaded by Sai Patibandla (ping2saas145@gmail.com)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy