0% found this document useful (0 votes)
52 views30 pages

Lecture Slides

Uploaded by

imad chougle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views30 pages

Lecture Slides

Uploaded by

imad chougle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Deep Feedforward

Networks
Lecture slides for Chapter 6 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last updated 2016-10-04
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
XOR is not linearly separable
Original x space Learn

1 1

h2
x2

0 0

0 1 0
x1

Figure 6.1, left


(Goodfellow 2017)
Rectified Linear Activation
g(z) = max{0, z}

0
z
Figure 6.3
(Goodfellow 2017)
DEEP FEEDFORWARD NETWORKS
Network Diagrams
y y

h1 h2 h

x1 x2 x

Figure 6.2
n example of a feedforward network, drawn in two different style
(Goodfellow 2017)
eded. The activation function
is needed. The activation function g is
g typically
is chosen
typically chosen to
tobe
be a
a function
function
lement-wise, with h = g(x > W + c ). In modern neural networks,
ied element-wise, withi hi = g(x> W :,i +i c ). In modern neural networks,
:,i i
mmendation
recommendation is tois use thethe
to use rectified
rectified linear unit,ororReLU
linearunit, ReLU (Jarrett
(Jarrett
r Nair
and and
max{0,
z)
Hinton, 2010;
Hinton,
depicted
z}, z},
= max{0,
Solving XOR
Glorot
2010;
depicted
Glorotet et
in figure 6.3.
in figure
2011a),
al.,al.,
6.3.
2011a),defined
definedbybythe
the activation
activation

specify
now our our
specify complete network
complete as as
network
f (x; W , c, w, b) =>w > max{0, W
> > x + c} + b. (6.3)
f (x; W , c, w, b) = w max{0, W x + c} + b.

then specify
specify a solution
a solution to the
to the XORXOR problem.Let
problem. Let
  1 1
W = 1 1 , (6.4)
W = , (6.4)
1 11 1
  0
c= 0 , (6.5)
c= 1, (6.5)
1
 1
w= 1 , (6.6)
w= 2, (6.6)
2

now walk through how the model processes a batch of inputs. Let X (Goodfellow 2017)
Solving XOR
Original x space Learned h space

1 1

h2
x2

0 0

0 1 0 1 2
x1 h1
Figure 6.1
.1: Solving the XOR problem by learning a representation. The bold n (Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Gradient-Based Learning
• Specify

• Model

• Cost

• Design model and cost so cost is smooth

• Minimize cost using gradient descent or related


techniques

(Goodfellow 2017)
Conditional Distributions and
g Conditional Distributions with Maximum Likelihood

Cross-Entropy
l networks are trained using maximum likelihood. This means
on is simply the negative log-likelihood, equivalently described
y between the training data and the model distribution. This
en by

J(✓) = Ex,y⇠p̂data log pmodel (y | x). (6.12)

m of the cost function changes from model to model, depending


m of log pmodel . The expansion of the above equation typically
that do not depend on the model parameters and may be dis-
e, as we saw in section 5.5.1, if pmodel (y | x) = N (y; f (x; ✓), I),
mean squared error cost,
1 2
J(✓) = E ||y f (x; ✓)|| + const, (6.13)
(Goodfellow 2017)
Output Types
Output Output Cost
Output Type
Distribution Layer Function
Binary cross-
Binary Bernoulli Sigmoid
entropy
Discrete cross-
Discrete Multinoulli Softmax
entropy
Gaussian cross-
Continuous Gaussian Linear
entropy (MSE)
Mixture of Mixture
Continuous Cross-entropy
Gaussian Density
See part III: GAN,
Continuous Arbitrary Various
VAE, FVBN

(Goodfellow 2017)
R 6. DEEP FEEDFORWARD NETWORKS

Mixture Density Outputs


y

x
Figure 6.4
.4: Samples drawn from a neural network with a mixture density out
(Goodfellow 2017)
Don’t mix and match
Sigmoid output with target of 1
(z)
Cross-entropy loss
MSE loss

1.0

0.5

0.0
3 2 1 0 1 2 3
z

(Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Hidden units
• Use ReLUs, 90% of the time

• For RNNs, see Chapter 10

• For some research projects, get creative

• Many hidden units perform comparably to ReLUs.


New hidden units that perform comparably are
rarely interesting.

(Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Architecture Basics
APTER 6. DEEP FEEDFORWARD NETWORKS

y y

h1 h2 h
Depth

x1 x2 x

Width
ure 6.2: An example of a feedforward network, drawn in two different styles
is the feedforward network we use to solve the XOR example. It has a s
r containing two units. (Left) In this style, we draw every unit as a node
(Goodfellow 2017)
Universal Approximator
Theorem
• One hidden layer is enough to represent (not learn)
an approximation of any function to an arbitrary
degree of accuracy

• So why deeper?

• Shallow net may need (exponentially) more width

• Shallow net may overfit more

(Goodfellow 2017)
network with absolute value rectification creates mirror images of the function

Exponential Representation
mputed on top of some hidden unit, with respect to the input of that hidden
it. Each hidden unit specifies where to fold the input space in order to create
rror responses (on both sides of the absolute value nonlinearity). By composing
Advantage of Depth
ese folding operations, we obtain an exponentially large number of piecewise
ear regions which can capture all kinds of regular (e.g., repeating) patterns.

gure 6.5: An intuitive, geometric explanation


Figure 6.5of the exponential advantage of deeper
tifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
it has the same output for every pair of mirror points in its input. The mirror axis
symmetry is given by the hyperplane defined by the weights and bias of the unit. A
nction computed on top of that unit (the green decision surface) will be a mirror image
a simpler pattern across that axis of symmetry. (Center)The function can be obtained
folding the space around the axis of symmetry. (Right)Another repeating pattern can
(Goodfellow 2017)
Better Generalization with
Greater Depth
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

96.5
96.0
Test accuracy (percent)

95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Number of hidden layers Layers
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbersFigure 6.6 of addresses. Data from Goodfellow
from photographs
(Goodfellow 2017)
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
Large, Shallow Models Overfit More
97
3, convolutional
Test accuracy (percent)

96
3, fully connected
95 11, convolutional
94

93

92

91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108

Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test Figure 6.7 The legend indicates the depth
set performance. of 2017)
(Goodfellow
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Back-Propagation
the chain rule states that
dz
dx
=
dz dy
dy dx
. (6.44)

WeDEEP
APTER 6. can generalize this beyond
FEEDFORWARD the scalar case. Suppose that x 2 Rm , y 2 Rn ,
NETWORKS
g maps Back-propagation is “just the chain rule” of calculus
• from R m to Rn , and f maps from Rn to R. If y = g(x) and z = f (y), then

@z X @z @yj
chain rule states that = . (6.45)
@x @y @x
dz i dz jdy j i
= . (6.44)
dx dy dx
In vector notation, this may be equivalently written as
We can generalize this beyond the scalar case.
✓n ◆> Suppose that x 2 R m , y 2 Rn ,

maps from R to R , and f mapsrfrom


m n
z=
R @yto R.rIf z,
y = g(x) and z = f (y),(6.46)
then
x y
@x
@z X @z @yj
@yBut it’s a particular
where• @x
=
is the n ⇥ m Jacobian @yimplementation
@xi matrix of
j g.
@x i
. of the (6.45)
chain rule
j
From this we see that the gradient of a variable x can be obtained by multiplying
@y
a Jacobian
vector notation,matrix @x bybe
this may a gradient
equivalently The back-propagation
ry z. written as algorithm consists
• Uses
of performing such adynamic programming
Jacobian-gradient product for each(table
operationfilling)
in the graph.
✓ ◆>
@y
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
rx z = ry z, (6.46)
mensionality, not merely to vectors. @x Conceptually, this is exactly the same as
• Avoids recomputing repeated subexpressions
back-propagation with vectors. The only difference is how the numbers are ar-
@y
ranged
ere @x is theinna ⇥
grid
m to form a tensor.
Jacobian matrix We could imagine flattening each tensor into
of g.
a vector before we run back-propagation, computing a vector-valued gradient,
From this we•seeSpeed
that the vs memory
gradient tradeoff
of a variable x can be obtained by multiplying
and then reshaping
@y the gradient back into a tensor. In this rearranged view,
acobian matrix @x by a gradient ry z. The back-propagation algorithm consists
back-propagation is still just multiplying Jacobians by gradients.
performing such a Jacobian-gradient product for each operation in the graph.
To denote the gradient of a value z with respect to a tensor X, we write rX z, (Goodfellow 2017)
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
Simple Back-Prop Example
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Compute loss

y y
Compute activations

Compute derivatives
w
Forward prop

Back-prop
h1 h2 h

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different st (Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Computation Graphs

Multiplication
Logistic regression
z u(1) u(2)
+
dot

x y x w b

(a) (b)

H u(2) u(3)

ReLU layer relu


sum

(1) (2) ŷ (1)


Linear regression
U U u

matmul
+
dot
sqr and weight decay
X W b x w

(c) (d)

Figure 6.8
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to (Goodfellow 2017)
compute z = xy. (b)The graph for the logistic regression prediction ŷ = x> w + b .
w

Repeated Subexpressions
computational graph that results in repeated subexpressions when computing
Let w 2 R be the input to the graph. We use the same function f : R ! R
on that we apply at every step of a chain: x = f (w), y = f (x), z = f (y).
w
z
, we apply
z equation 6.44 and obtain:

f
@z
(6.50)
y
@w
@z @y @x
= (6.51)
f @y @x @w
0 0 0
=f (y)f (x)f (w) (6.52)
x
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
f
suggests an implementation in which we compute the value of f (w) only
e it in the variable
w x. This is the approach taken by the back-propagation
alternative approach isBack-prop
suggested byavoids computing
equation 6.53, wherethis twice
the subexpression
more
aph than
Figure
that once.
results 6.9
in In the subexpressions
repeated alternative approach, f (w) is recomputed each time
when computing
When
e inputthe memory
to the graph. required tosame
We use the storefunction
the value
f : Rof!these
R expressions is low, the
(Goodfellow 2017)
Symbol-to-Symbol
DEEP FEEDFORWARD NETWORKS

Differentiation
z z
Figure 6.10
f f

f0 dz
y y
dy

f f

f0 ⇥
dy dz
x x
dx dx

f f

f0 ⇥
dx dz
w w
dw dw
(Goodfellow 2017)
APTER 6. DEEP FEEDFORWARD NETWORKS
Neural Network Loss Function
JMLE J

cross_entropy +

U (2) y u(8)
matmul

H W (2) U (5) u(6) u(7)


sqr sum +
relu

U (1)

matmul

Figure 6.11
X W (1) U (3) u(4)
sqr sum

(Goodfellow 2017)
ure 6.11: The computational graph used to compute the cost used to train our example
ods. Krylov methods are a set of iterative techniques for
rations like approximately inverting a matrix or finding
eigenvectors or eigenvalues, without using any operation
Hessian-vector Products
or products.
lov methods on the Hessian, we only need to be able to
tween the Hessian matrix H and an arbitrary vector v. A
ue (Christianson, 1992) for doing so is to compute
h i
>
Hv = rx (rx f (x)) v . (6.59)

mputations in this expression may be computed automati-


e software library. Note that the outer gradient expression
function of the inner gradient expression.
r produced by a computational graph, it is important to
tic differentiation software should not differentiate through
d v.
(Goodfellow 2017)
Questions

(Goodfellow 2017)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy