Lecture Slides
Lecture Slides
Networks
Lecture slides for Chapter 6 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last updated 2016-10-04
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
XOR is not linearly separable
Original x space Learn
1 1
h2
x2
0 0
0 1 0
x1
0
z
Figure 6.3
(Goodfellow 2017)
DEEP FEEDFORWARD NETWORKS
Network Diagrams
y y
h1 h2 h
x1 x2 x
Figure 6.2
n example of a feedforward network, drawn in two different style
(Goodfellow 2017)
eded. The activation function
is needed. The activation function g is
g typically
is chosen
typically chosen to
tobe
be a
a function
function
lement-wise, with h = g(x > W + c ). In modern neural networks,
ied element-wise, withi hi = g(x> W :,i +i c ). In modern neural networks,
:,i i
mmendation
recommendation is tois use thethe
to use rectified
rectified linear unit,ororReLU
linearunit, ReLU (Jarrett
(Jarrett
r Nair
and and
max{0,
z)
Hinton, 2010;
Hinton,
depicted
z}, z},
= max{0,
Solving XOR
Glorot
2010;
depicted
Glorotet et
in figure 6.3.
in figure
2011a),
al.,al.,
6.3.
2011a),defined
definedbybythe
the activation
activation
specify
now our our
specify complete network
complete as as
network
f (x; W , c, w, b) =>w > max{0, W
> > x + c} + b. (6.3)
f (x; W , c, w, b) = w max{0, W x + c} + b.
then specify
specify a solution
a solution to the
to the XORXOR problem.Let
problem. Let
1 1
W = 1 1 , (6.4)
W = , (6.4)
1 11 1
0
c= 0 , (6.5)
c= 1, (6.5)
1
1
w= 1 , (6.6)
w= 2, (6.6)
2
now walk through how the model processes a batch of inputs. Let X (Goodfellow 2017)
Solving XOR
Original x space Learned h space
1 1
h2
x2
0 0
0 1 0 1 2
x1 h1
Figure 6.1
.1: Solving the XOR problem by learning a representation. The bold n (Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Gradient-Based Learning
• Specify
• Model
• Cost
(Goodfellow 2017)
Conditional Distributions and
g Conditional Distributions with Maximum Likelihood
Cross-Entropy
l networks are trained using maximum likelihood. This means
on is simply the negative log-likelihood, equivalently described
y between the training data and the model distribution. This
en by
(Goodfellow 2017)
R 6. DEEP FEEDFORWARD NETWORKS
x
Figure 6.4
.4: Samples drawn from a neural network with a mixture density out
(Goodfellow 2017)
Don’t mix and match
Sigmoid output with target of 1
(z)
Cross-entropy loss
MSE loss
1.0
0.5
0.0
3 2 1 0 1 2 3
z
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Hidden units
• Use ReLUs, 90% of the time
(Goodfellow 2017)
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
Architecture Basics
APTER 6. DEEP FEEDFORWARD NETWORKS
y y
h1 h2 h
Depth
x1 x2 x
Width
ure 6.2: An example of a feedforward network, drawn in two different styles
is the feedforward network we use to solve the XOR example. It has a s
r containing two units. (Left) In this style, we draw every unit as a node
(Goodfellow 2017)
Universal Approximator
Theorem
• One hidden layer is enough to represent (not learn)
an approximation of any function to an arbitrary
degree of accuracy
• So why deeper?
(Goodfellow 2017)
network with absolute value rectification creates mirror images of the function
Exponential Representation
mputed on top of some hidden unit, with respect to the input of that hidden
it. Each hidden unit specifies where to fold the input space in order to create
rror responses (on both sides of the absolute value nonlinearity). By composing
Advantage of Depth
ese folding operations, we obtain an exponentially large number of piecewise
ear regions which can capture all kinds of regular (e.g., repeating) patterns.
96.5
96.0
Test accuracy (percent)
95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Number of hidden layers Layers
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbersFigure 6.6 of addresses. Data from Goodfellow
from photographs
(Goodfellow 2017)
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
Large, Shallow Models Overfit More
97
3, convolutional
Test accuracy (percent)
96
3, fully connected
95 11, convolutional
94
93
92
91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108
Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test Figure 6.7 The legend indicates the depth
set performance. of 2017)
(Goodfellow
Roadmap
• Example: Learning XOR
• Gradient-Based Learning
• Hidden Units
• Architecture Design
• Back-Propagation
(Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Back-Propagation
the chain rule states that
dz
dx
=
dz dy
dy dx
. (6.44)
WeDEEP
APTER 6. can generalize this beyond
FEEDFORWARD the scalar case. Suppose that x 2 Rm , y 2 Rn ,
NETWORKS
g maps Back-propagation is “just the chain rule” of calculus
• from R m to Rn , and f maps from Rn to R. If y = g(x) and z = f (y), then
@z X @z @yj
chain rule states that = . (6.45)
@x @y @x
dz i dz jdy j i
= . (6.44)
dx dy dx
In vector notation, this may be equivalently written as
We can generalize this beyond the scalar case.
✓n ◆> Suppose that x 2 R m , y 2 Rn ,
Compute loss
y y
Compute activations
Compute derivatives
w
Forward prop
Back-prop
h1 h2 h
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different st (Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Computation Graphs
ŷ
Multiplication
Logistic regression
z u(1) u(2)
+
dot
⇥
x y x w b
(a) (b)
H u(2) u(3)
matmul
+
dot
sqr and weight decay
X W b x w
(c) (d)
Figure 6.8
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to (Goodfellow 2017)
compute z = xy. (b)The graph for the logistic regression prediction ŷ = x> w + b .
w
Repeated Subexpressions
computational graph that results in repeated subexpressions when computing
Let w 2 R be the input to the graph. We use the same function f : R ! R
on that we apply at every step of a chain: x = f (w), y = f (x), z = f (y).
w
z
, we apply
z equation 6.44 and obtain:
f
@z
(6.50)
y
@w
@z @y @x
= (6.51)
f @y @x @w
0 0 0
=f (y)f (x)f (w) (6.52)
x
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
f
suggests an implementation in which we compute the value of f (w) only
e it in the variable
w x. This is the approach taken by the back-propagation
alternative approach isBack-prop
suggested byavoids computing
equation 6.53, wherethis twice
the subexpression
more
aph than
Figure
that once.
results 6.9
in In the subexpressions
repeated alternative approach, f (w) is recomputed each time
when computing
When
e inputthe memory
to the graph. required tosame
We use the storefunction
the value
f : Rof!these
R expressions is low, the
(Goodfellow 2017)
Symbol-to-Symbol
DEEP FEEDFORWARD NETWORKS
Differentiation
z z
Figure 6.10
f f
f0 dz
y y
dy
f f
f0 ⇥
dy dz
x x
dx dx
f f
f0 ⇥
dx dz
w w
dw dw
(Goodfellow 2017)
APTER 6. DEEP FEEDFORWARD NETWORKS
Neural Network Loss Function
JMLE J
cross_entropy +
U (2) y u(8)
matmul
⇥
U (1)
matmul
Figure 6.11
X W (1) U (3) u(4)
sqr sum
(Goodfellow 2017)
ure 6.11: The computational graph used to compute the cost used to train our example
ods. Krylov methods are a set of iterative techniques for
rations like approximately inverting a matrix or finding
eigenvectors or eigenvalues, without using any operation
Hessian-vector Products
or products.
lov methods on the Hessian, we only need to be able to
tween the Hessian matrix H and an arbitrary vector v. A
ue (Christianson, 1992) for doing so is to compute
h i
>
Hv = rx (rx f (x)) v . (6.59)
(Goodfellow 2017)