0% found this document useful (0 votes)

52 views30 pages

Lecture Slides

Uploaded by

imad chougle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views30 pages

Lecture Slides

Uploaded by

imad chougle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Deep Feedforward

Networks
Lecture slides for Chapter 6 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last updated 2016-10-04
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
XOR is not linearly separable
Original x space Learn

1 1

h2
x2

0 0

0 1 0
x1

Figure 6.1, left

(Goodfellow 2017)
Rectified Linear Activation
g(z) = max{0, z}

0
z
Figure 6.3
(Goodfellow 2017)
DEEP FEEDFORWARD NETWORKS
Network Diagrams
y y

h1 h2 h

x1 x2 x

Figure 6.2
n example of a feedforward network, drawn in two diﬀerent style
(Goodfellow 2017)
eded. The activation function
is needed. The activation function g is
g typically
is chosen
typically chosen to
tobe
be a
a function
function
lement-wise, with h = g(x > W + c ). In modern neural networks,
ied element-wise, withi hi = g(x> W :,i +i c ). In modern neural networks,
:,i i
mmendation
recommendation is tois use thethe
to use rectified
rectified linear unit,ororReLU
linearunit, ReLU (Jarrett
(Jarrett
r Nair
and and
max{0,
z)
Hinton, 2010;
Hinton,
depicted
z}, z},
= max{0,
Solving XOR
Glorot
2010;
depicted
Glorotet et
in figure 6.3.
in figure
2011a),
al.,al.,
6.3.
2011a),defined
definedbybythe
the activation
activation

specify
now our our
specify complete network
complete as as
network
f (x; W , c, w, b) =>w > max{0, W
> > x + c} + b. (6.3)
f (x; W , c, w, b) = w max{0, W x + c} + b.

then specify
specify a solution
a solution to the
to the XORXOR problem.Let
problem. Let
  1 1
W = 1 1 , (6.4)
W = , (6.4)
1 11 1
  0
c= 0 , (6.5)
c= 1, (6.5)
1
 1
w= 1 , (6.6)
w= 2, (6.6)
2

now walk through how the model processes a batch of inputs. Let X (Goodfellow 2017)
Solving XOR
Original x space Learned h space

1 1

h2
x2

0 0

0 1 0 1 2
x1 h1
Figure 6.1
.1: Solving the XOR problem by learning a representation. The bold n (Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Gradient-Based Learning
• Specify

• Model

• Cost

• Design model and cost so cost is smooth

• Minimize cost using gradient descent or related

techniques

(Goodfellow 2017)
Conditional Distributions and
g Conditional Distributions with Maximum Likelihood

Cross-Entropy
l networks are trained using maximum likelihood. This means
on is simply the negative log-likelihood, equivalently described
y between the training data and the model distribution. This
en by

J(✓) = Ex,y⇠p̂data log pmodel (y | x). (6.12)

m of the cost function changes from model to model, depending

m of log pmodel . The expansion of the above equation typically
that do not depend on the model parameters and may be dis-
e, as we saw in section 5.5.1, if pmodel (y | x) = N (y; f (x; ✓), I),
mean squared error cost,
1 2
J(✓) = E ||y f (x; ✓)|| + const, (6.13)
(Goodfellow 2017)
Output Types
Output Output Cost
Output Type
Distribution Layer Function
Binary cross-
Binary Bernoulli Sigmoid
entropy
Discrete cross-
Discrete Multinoulli Softmax
entropy
Gaussian cross-
Continuous Gaussian Linear
entropy (MSE)
Mixture of Mixture
Continuous Cross-entropy
Gaussian Density
See part III: GAN,
Continuous Arbitrary Various
VAE, FVBN

(Goodfellow 2017)
R 6. DEEP FEEDFORWARD NETWORKS

Mixture Density Outputs

x
Figure 6.4
.4: Samples drawn from a neural network with a mixture density out
(Goodfellow 2017)
Don’t mix and match
Sigmoid output with target of 1
(z)
Cross-entropy loss
MSE loss

1.0

0.5

0.0
3 2 1 0 1 2 3
z

(Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Hidden units
• Use ReLUs, 90% of the time

• For RNNs, see Chapter 10

• For some research projects, get creative

• Many hidden units perform comparably to ReLUs.

New hidden units that perform comparably are
rarely interesting.

(Goodfellow 2017)
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
Architecture Basics
APTER 6. DEEP FEEDFORWARD NETWORKS

y y

h1 h2 h
Depth

x1 x2 x

Width
ure 6.2: An example of a feedforward network, drawn in two diﬀerent styles
is the feedforward network we use to solve the XOR example. It has a s
r containing two units. (Left) In this style, we draw every unit as a node
(Goodfellow 2017)
Universal Approximator
Theorem
• One hidden layer is enough to represent (not learn)
an approximation of any function to an arbitrary
degree of accuracy

• So why deeper?

• Shallow net may need (exponentially) more width

• Shallow net may overfit more

(Goodfellow 2017)
network with absolute value rectification creates mirror images of the function

Exponential Representation
mputed on top of some hidden unit, with respect to the input of that hidden
it. Each hidden unit specifies where to fold the input space in order to create
rror responses (on both sides of the absolute value nonlinearity). By composing
Advantage of Depth
ese folding operations, we obtain an exponentially large number of piecewise
ear regions which can capture all kinds of regular (e.g., repeating) patterns.

gure 6.5: An intuitive, geometric explanation

Figure 6.5of the exponential advantage of deeper
tifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
it has the same output for every pair of mirror points in its input. The mirror axis
symmetry is given by the hyperplane defined by the weights and bias of the unit. A
nction computed on top of that unit (the green decision surface) will be a mirror image
a simpler pattern across that axis of symmetry. (Center)The function can be obtained
folding the space around the axis of symmetry. (Right)Another repeating pattern can
(Goodfellow 2017)
Better Generalization with
Greater Depth
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

96.5
96.0
Test accuracy (percent)

95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Number of hidden layers Layers
Figure 6.6: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbersFigure 6.6 of addresses. Data from Goodfellow
from photographs
(Goodfellow 2017)
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
Large, Shallow Models Overfit More
97
3, convolutional
Test accuracy (percent)

96
3, fully connected
95 11, convolutional
94

91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108

Figure 6.7: Deeper models tend to perform better. This is not merely because the model is
larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as eﬀective at increasing test Figure 6.7 The legend indicates the depth
set performance. of 2017)
(Goodfellow
Roadmap
• Example: Learning XOR

• Gradient-Based Learning

• Hidden Units

• Architecture Design

• Back-Propagation

(Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Back-Propagation
the chain rule states that
dz
dx
=
dz dy
dy dx
. (6.44)

WeDEEP
APTER 6. can generalize this beyond
FEEDFORWARD the scalar case. Suppose that x 2 Rm , y 2 Rn ,
NETWORKS
g maps Back-propagation is “just the chain rule” of calculus
• from R m to Rn , and f maps from Rn to R. If y = g(x) and z = f (y), then

@z X @z @yj
chain rule states that = . (6.45)
@x @y @x
dz i dz jdy j i
= . (6.44)
dx dy dx
In vector notation, this may be equivalently written as
We can generalize this beyond the scalar case.
✓n ◆> Suppose that x 2 R m , y 2 Rn ,

maps from R to R , and f mapsrfrom

m n
z=
R @yto R.rIf z,
y = g(x) and z = f (y),(6.46)
then
x y
@x
@z X @z @yj
@yBut it’s a particular
where• @x
=
is the n ⇥ m Jacobian @yimplementation
@xi matrix of
j g.
@x i
. of the (6.45)
chain rule
j
From this we see that the gradient of a variable x can be obtained by multiplying
@y
a Jacobian
vector notation,matrix @x bybe
this may a gradient
equivalently The back-propagation
ry z. written as algorithm consists
• Uses
of performing such adynamic programming
Jacobian-gradient product for each(table
operationfilling)
in the graph.
✓ ◆>
@y
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
rx z = ry z, (6.46)
mensionality, not merely to vectors. @x Conceptually, this is exactly the same as
• Avoids recomputing repeated subexpressions
back-propagation with vectors. The only diﬀerence is how the numbers are ar-
@y
ranged
ere @x is theinna ⇥
grid
m to form a tensor.
Jacobian matrix We could imagine flattening each tensor into
of g.
a vector before we run back-propagation, computing a vector-valued gradient,
From this we•seeSpeed
that the vs memory
gradient tradeoﬀ
of a variable x can be obtained by multiplying
and then reshaping
@y the gradient back into a tensor. In this rearranged view,
acobian matrix @x by a gradient ry z. The back-propagation algorithm consists
back-propagation is still just multiplying Jacobians by gradients.
performing such a Jacobian-gradient product for each operation in the graph.
To denote the gradient of a value z with respect to a tensor X, we write rX z, (Goodfellow 2017)
Usually we apply the back-propagation algorithm to tensors of arbitrary di-
Simple Back-Prop Example
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Compute loss

y y
Compute activations

Compute derivatives
w
Forward prop

Back-prop
h1 h2 h

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two diﬀerent st (Goodfellow 2017)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Computation Graphs
ŷ

Multiplication
Logistic regression
z u(1) u(2)
+
dot
⇥

x y x w b

(a) (b)

H u(2) u(3)

ReLU layer relu

sum
⇥

(1) (2) ŷ (1)

Linear regression
U U u

matmul
+
dot
sqr and weight decay
X W b x w

Figure 6.8
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to (Goodfellow 2017)
compute z = xy. (b)The graph for the logistic regression prediction ŷ = x> w + b .
w

Repeated Subexpressions
computational graph that results in repeated subexpressions when computing
Let w 2 R be the input to the graph. We use the same function f : R ! R
on that we apply at every step of a chain: x = f (w), y = f (x), z = f (y).
w
z
, we apply
z equation 6.44 and obtain:

f
@z
(6.50)
y
@w
@z @y @x
= (6.51)
f @y @x @w
0 0 0
=f (y)f (x)f (w) (6.52)
x
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
f
suggests an implementation in which we compute the value of f (w) only
e it in the variable
w x. This is the approach taken by the back-propagation
alternative approach isBack-prop
suggested byavoids computing
equation 6.53, wherethis twice
the subexpression
more
aph than
Figure
that once.
results 6.9
in In the subexpressions
repeated alternative approach, f (w) is recomputed each time
when computing
When
e inputthe memory
to the graph. required tosame
We use the storefunction
the value
f : Rof!these
R expressions is low, the
(Goodfellow 2017)
Symbol-to-Symbol
DEEP FEEDFORWARD NETWORKS

Diﬀerentiation
z z
Figure 6.10
f f

f0 dz
y y
dy

f f

f0 ⇥
dy dz
x x
dx dx

f f

f0 ⇥
dx dz
w w
dw dw
(Goodfellow 2017)
APTER 6. DEEP FEEDFORWARD NETWORKS
Neural Network Loss Function
JMLE J

cross_entropy +

U (2) y u(8)
matmul
⇥

H W (2) U (5) u(6) u(7)

sqr sum +
relu

U (1)

matmul

Figure 6.11
X W (1) U (3) u(4)
sqr sum

(Goodfellow 2017)
ure 6.11: The computational graph used to compute the cost used to train our example
ods. Krylov methods are a set of iterative techniques for
rations like approximately inverting a matrix or finding
eigenvectors or eigenvalues, without using any operation
Hessian-vector Products
or products.
lov methods on the Hessian, we only need to be able to
tween the Hessian matrix H and an arbitrary vector v. A
ue (Christianson, 1992) for doing so is to compute
h i
>
Hv = rx (rx f (x)) v . (6.59)

mputations in this expression may be computed automati-

e software library. Note that the outer gradient expression
function of the inner gradient expression.
r produced by a computational graph, it is important to
tic diﬀerentiation software should not diﬀerentiate through
d v.
(Goodfellow 2017)
Questions

(Goodfellow 2017)

6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
L05 Slides.mlp2
No ratings yet
L05 Slides.mlp2
21 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-lib.org)-226-228
No ratings yet
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-lib.org)-226-228
3 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
No ratings yet
Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
38 pages
978-3-030-41068-1 (1)-133-188
No ratings yet
978-3-030-41068-1 (1)-133-188
56 pages
First
No ratings yet
First
92 pages
BackProp in Recurrent NNs
100% (1)
BackProp in Recurrent NNs
10 pages
L06 Slides.mlp3
No ratings yet
L06 Slides.mlp3
26 pages
Lec 03 Deep Networks 1
No ratings yet
Lec 03 Deep Networks 1
53 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
Deep Neural Networks_2 (7)
No ratings yet
Deep Neural Networks_2 (7)
55 pages
DHSCH 6
No ratings yet
DHSCH 6
30 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
26 pages
Module 3_Modified
No ratings yet
Module 3_Modified
106 pages
RBF
No ratings yet
RBF
45 pages
Ch06 Deep Feedforward Networks
No ratings yet
Ch06 Deep Feedforward Networks
90 pages
Ảnh màn hình 2025-04-10 lúc 10.10.40
No ratings yet
Ảnh màn hình 2025-04-10 lúc 10.10.40
63 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
L04 Slides.mlp1
No ratings yet
L04 Slides.mlp1
22 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
S2_7_NN
No ratings yet
S2_7_NN
39 pages
DL-2
No ratings yet
DL-2
62 pages
AND Operation Using NN
No ratings yet
AND Operation Using NN
53 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
DL2A
No ratings yet
DL2A
4 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Lecture Five Radial-Basis Function Networks: Associate Professor
No ratings yet
Lecture Five Radial-Basis Function Networks: Associate Professor
64 pages
Unit 3
No ratings yet
Unit 3
110 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Bitwise Neural Network
No ratings yet
Bitwise Neural Network
5 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Chapter 7
No ratings yet
Chapter 7
31 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Main
No ratings yet
Main
25 pages
Module 2
No ratings yet
Module 2
44 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
M3 L4 RNN Regularization
No ratings yet
M3 L4 RNN Regularization
24 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
MODULE 2 DL SNOTES P1
No ratings yet
MODULE 2 DL SNOTES P1
16 pages
Deep Feedforward Networks Application To Patter Recognition
No ratings yet
Deep Feedforward Networks Application To Patter Recognition
5 pages
ANN
No ratings yet
ANN
7 pages
Dave Reed: Connectionist Approach To AI
No ratings yet
Dave Reed: Connectionist Approach To AI
26 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
The Functions of Deep Learning: Gilbert Strang
No ratings yet
The Functions of Deep Learning: Gilbert Strang
1 page
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
ML Unit-5
No ratings yet
ML Unit-5
11 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture Slides

Uploaded by

Lecture Slides

Uploaded by

Deep Feedforward

Figure 6.1, left

• Design model and cost so cost is smooth

• Minimize cost using gradient descent or related

J(✓) = Ex,y⇠p̂data log pmodel (y | x). (6.12)

m of the cost function changes from model to model, depending

Mixture Density Outputs

• For RNNs, see Chapter 10

• For some research projects, get creative

• Many hidden units perform comparably to ReLUs.

• Shallow net may need (exponentially) more width

• Shallow net may overfit more

gure 6.5: An intuitive, geometric explanation

maps from R to R , and f mapsrfrom

ReLU layer relu

(1) (2) ŷ (1)

H W (2) U (5) u(6) u(7)

mputations in this expression may be computed automati-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.