0% found this document useful (0 votes)
2 views62 pages

Lecture NN Part1

The document outlines a course on Deep Learning, specifically focusing on Feedforward Neural Networks, their training methods, and activation functions. It discusses concepts such as forward propagation, loss functions, and techniques like dropout and batch normalization. Additionally, it highlights the limitations of single neurons in solving non-linearly separable problems and introduces the softmax function for multi-class classification.

Uploaded by

melatdagnu1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views62 pages

Lecture NN Part1

The document outlines a course on Deep Learning, specifically focusing on Feedforward Neural Networks, their training methods, and activation functions. It discusses concepts such as forward propagation, loss functions, and techniques like dropout and batch normalization. Additionally, it highlights the limitations of single neurons in solving non-linearly separable problems and introduces the softmax function for multi-class classification.

Uploaded by

melatdagnu1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

10707

Deep Learning
Russ Salakhutdinov
Machine Learning Department
rsalakhu@cs.cmu.edu
http://www.cs.cmu.edu/~rsalakhu/10707/

Neural Networks I
Neural Networks Online Course
• Disclaimer: Much of the material and slides for this lecture were
borrowed from Hugo Larochelle’s class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/

• Hugo’s class covers


many other topics:
convolutional networks,
neural language model,
Boltzmann machines,
autoencoders, sparse
coding, etc.

• We will use his


material for some of the
other lectures.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
Université de Sherbrooke September
Math for my slides “Feedforward 6, 2012
neural network”.
P
Artificial
.larochelle@usherbrooke.ca Neuron
Abstract
Abstract
Math for my slides > neural network”.
“Feedforward
a(x) = b +
Math
• Neuron for my w x
i slides
pre-activation = b + w
i “Feedforward
i (or x network”.
neural
input activation):
SeptemberPP 6, 2012 P >Abstract
• •a(x)
h(x) a(x)==b b+
= g(a(x))
Math +my=
for i
wx iix=
g(b
slides
w
i i
=b +
+ b+w i
ww
>
i“Feedforward
x ixxneural
i ) network”.
P PP
••h(x)
xh(x)
x1•• Neuron
d b
a(x) = =g(a(x))
= wg(a(x))
b + w =w=g(b
xg(b
+
= +
b + w
wi w
>xii x
x ) i)
output activation:
1 d
i i i i i
Abstract P
• x
w • •x1h(x) x
1 xd = b w
d bg(a(x)) w
w1 1 wd=dg(b + i wi xi )
eedforward neural network”.
• •ww
x1 xd b w 1 w d
{ where
>
= b •+•{w{w arexthe weights (parameters)
g(·)P b is the bias term
+ • ••g(·)
{wisibx
g(·)
i
bi ) the activation function
called
h(x) = g(a(x))
• ••h(x)
g(·) ==
h(x) bg(a(x))
g(a(x))
⇣⇣ P ⌘ ⌘
Abstract
Abstract
Math
Mathfor
formy Artificial Neuron
myslides
slides“Feedforward
“Feedforwardneural
neuralnetwork”.
network”.
PP >
••Output
• a(x) ==bb++ ofiiw
a(x)activation wiixxneuron:
the ii =
=bb+ w>xx
+w
PP
•• h(x)
h(x)==g(a(x))
g(a(x))==g(b g(b+ + iiwwiixxii))
•• xx11 xxdd bb ww11 wwdd

•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x
• x1 xd b w 1 w d
Activation Function
• w
• Linear activation function:
• {
• g(a) = a
Ø No nonlinear transformation
1
Ø No input squashing • sigm(a) = 1+exp( a)

exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) + W(1) x a(

(out) (2) (2) >


• o(x) = g (b +w
• w
Activation
• {
Function
• Sigmoid activation function:
• g(a) = a
1
Ø Squashes the neuron’s • g(a) = sigm(a) = 1+exp( a)
output between 0 and 1
exp(a) exp( a) e
Ø Always positive • g(a) = tanh(a) = exp(a)+exp( a) = e

Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))

(1) (1) (1)
• a(x) = b +W x a(x)i = bi

(out) (2) (2) >


• o(x) = g (b +w x)
• w • {

Activation
• { Function
• g(a) =a
• g(a)
• Hyperbolic tangent = a activation
(‘‘tanh’’) • g(a) = function:
sigm(a) = 1
1+exp( a)
1
• g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) e
• g(a) = tanh(a) exp(a)+exp( a) = e
Ø Squashes the neuron’s
activation between exp(a) exp( a) exp(2a) 1
• g(a) -1= tanh(a)
• =
g(·) b
exp(a)+exp( a) = exp(2a)+1
and 1
• g(·) b • h(x) = g(a(x))
Ø Can be positive or
negative ⇣
• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)
Ø Bounded ⇣ ⌘
(1) (1) (1) P (1)
Ø
• a(x)
Strictly increasing
= b + W x a(x) i = b i (2) >j W i,j xj
(out) (2)
• o(x) = g (b + w x)
(wrong plot) (out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Activation Function exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
Ø Bounded below by 0 • g(a) = reclin(a) = max(0, a)
(always non-negative)

Ø Tends to produce units


• g(·) b
with sparse activities
• h(x) = g(a(x))
Ø Not upper bounded ⇣
(1
Ø Strictly increasing • a(x) = b(1) + W(1) x a(x)i = bi

(out) (2) (2) >


• o(x) = g (b +w x)
• g(a) = tanh(a)
Decision Boundary of a Neuron
• g(a) = max(0, a

• Binary classification: • g(a) = reclin(a)


- • p(y
With sigmoid, one can interpret neuron as estimating = 1|x)
- Interpret as a logistic classifier
• g(·) b
Decision boundary
(1) (1)
• Wi,j bi xj

- If activation is greater than • h(x) = g(a(x))


0.5, predict 1
- Otherwise predict 0 • a(x) = b(1) + W

• f (x) = o b(2) +
(from Pascal Vincent’s slides)
Same idea can be applied
• p(y = c|x)
to a tanh activation
Capacity of a Single Neuron
• Can solve linearly separable problems.
21

OR (x1 , x2 ) AND (x1 , x2 ) AND (x1 , x2 )


1 1 1
, x2 )

, x2 )

, x2 )
0 0 0

0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
1 , x2 )

1 1
)
1 1 1

, x2 )

, x2 )

, x2 )
0 Capacity of
0 a Single Neuron
0

• Can not solve non-linearly separable


0 1
problems.
0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )

AND (x1 , x2 )
1 1
, x2 )

0
?
0

0 1 0 1
(x1 AND (x1 , x2 )

Figure
• Need1.8to– Exemple
transform de modélisation de XOR
the input into par unrepresentation.
a better réseau à une couche cachée. En
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x2 )
Remember
et• AND basis
(x1 , x2 ). En bas, functions!
on présente l’illustration de la fonction XOR(x1 , x2 ) en fonc-
tion des valeurs de x1 et x2 (à gauche), puis de AND (x1 , x2 ) et AND (x1 , x2 ) (à droite).
• g(·) b exp(a)+exp( a) exp(2a)+1
= max(0, a)
• g(a)(1) = max(0, a)
• g(a) = max(0, a)
Single i Hidden Layer Neural Net
(1) (2) (2)
= reclin(a)• =W max(0,
i,j a)b x j h(x) i w i b
b • g(a) = reclin(a) = max(0, a)
• • •g(a)
Hiddenh(x) ==reclin(a)
layer g(a(x)) = max(0, a)
pre-activation:
(1) (2)
bi • xj p(y
h(x)=i w1|x)
i b (2)
⇣ P ⌘
• p(y
= g(a(x))• a(x) = b
= 1|x) (1) (1)
+ W x a(x)i = bi + j Wi,j xj
(1) (1)

• g(·)
(1) • g(·)
b
⇣ P ⌘
= b + W x ba(x)i = b(2)
(1) (1) (1)
i + Wi,j>xj
j (2)
• f(1)
(x) =(1)
o(b +w x)(2) (2)
• W
= o(b(2)•+ w
> b
(1)x) i(1)
i,j
(2) x j h(x) i w i(2) b (2)
W
• Hidden b x h(x)
i,jlayer iactivation:
j i w i b
• h(x) = g(a(x))
• h(x) = g(a(x)) ⇣ ⌘
(1) (1) ⇣ (1) P (1) ⌘
• a(x)
• Output
= b + W x a(x) i = b i(1)+ P W x
(1) j
• a(x)layer = bactivation:
i,j
+ W(1) x a(x)i = bi1 + jj Wi,j xj
(1) 1
⇣ >

• f (x) = o ⇣b(2) + w (2)
>h ⌘x
(1)
• f (x) = o b(2) + w(2) x
Output activation
function
(1) (1) (2) (2)
• Wi,j bi xj h(x)i wi b• a(x) = b(1) + W(1
Softmax Activation Function
• h(x) = g(a(x)) ⇣
‣ ⇣
Remember multi-way classification: • f(1) P o b(1) +⌘w
(x) = (2)
(1) (1)
• a(x) = b + W x a(x) = b + j Wi,j xj
- We need multiple outputs (1 output per iclass) i

- We need to estimate ⌘ • p(y = c|x)
conditional probability:
>
- •Discriminative b(2) + w(2) x
f (x) = o Learning
• o(a) = softmax(a)

• p(y = c|x)
Softmax activation function at the output
h i>
• o(a) = softmax(a) = Pexp(a1 ) . . . Pexp(a C)
c exp(ac ) c exp(ac )

- strictly positive
- sums to one
1
‣ Predict class with the highest estimated class conditional
probability.
• p(y = c|x)
h i>
•• o(a)
p(y == c|x) Multilayer
softmax(a) = Neural
Pexp(a1 )
...
c exp(ac )
Net
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)

• h(k) (x) = g(a(k) (x))


Capacity of Neural Nets
• Consider a single layer neural network 2
Réseaux de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j -1
0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1

Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
R
Capacity of Neural Nets
• La puissance expressive des réseaux de neurones
• Consider a single layer neural network

z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

(from Pascal Vincent’s slides)


Capacity of Neural Nets
• Consider a single layer neural network

(from Pascal Vincent’s slides)


Universal Approximation
• Universal Approximation Theorem (Hornik, 1991):

- “a single hidden layer neural network with a linear output


unit can approximate any continuous function arbitrarily well,
given enough hidden units’’

• This applies for sigmoid, tanh and many other activation


functions.

• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
✓)
Training
d
Dtest
• Empirical Risk Minimization:

1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize


classification error.

ØLoss function can sometimes be viewed as a surrogate for


what we want to optimize (e.g. upper bound)
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)


• f (x) = p(y = c|x)
• l(f (x ; ✓), y )
• ✓ ⌘ {W(1) , b(1) , . . . , W(L+1) (t)
, b (L+1)
(t)
}
(t)
• l(f (x ; ✓), y )
Loss
(t) Function
• r✓ l(f (x ; ✓), y )

• ⌦(✓)
• Let
• rus✓ l(f
start
(xby(t)considering
; ✓), y (t) ) a classification problem with a
softmax output layer. • r✓ ⌦(✓)
• ⌦(✓)
• We need to estimate: • f (x)c = p(y = c|x)
• -rWe✓ ⌦(✓)
can maximize the
(t)log-probability
(t) of the correct class given

(t)
x y(t)
an input: log p(y
• f (x)c = p(y = c|x) = c|x ) P
• l(f (x), y) = c 1(y=c) log f (x)c = log f
(t) (t)
• x y
• Alternatively, we can•minimize the negative log-likelihood:
P
• l(f (x), y) = c 1(y=c) log f (x)c = log f (x)y = @
• f (x
• As seen before, this is also known as a cross-entropy entropy
function for multi-class classification problem. @
log f (x)y
f (x)c
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)


• f (x) = p(y = c|x)
• p(y = c|x)
h i>
•• o(a)
Multilayer
p(y = softmax(a) =Neural .Net:
= c|x) Pexp(a1 )..
c exp(ac )
Reminder
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)
Softmax
• h(k) (x) = g(a (k) activation
(x))
function
Feedforward neural network

Gradient Computation
Hugo Larochelle
Département d’informatique
Université de Sherbrooke
x)c =• Losslog fgradient
(x)y y at output hugo.larochelle@usherbrooke.ca
log f (x)c = log f (x)

- Partial derivative: September 13, 2012


@ 1(y=c)
@ @f (x)c log f (x)y = 1(y=c)
f (x)y
log f (x) = y Abstract
f (x)c f (x)y
Math for my slides “Feedforward neural network”.
- Gradient:
• f (x)rf (x) 2 log f (x)y 3
rf (x) (1) log f(1)(x)
1(y=0) y (L+1) (L+1)
• ✓ ⌘ {W 2
1 6 , b , . .
. . , W3 7 , b }
= 41(y=0). . 5
f (x)
1 6
=
• l(f (x (t) y y )
; ✓), (t)
1 .. 7
4 .
(y=C 5
1)
f (x) y
• r✓ l(fe(y) (t)
(x 1; ✓), y ) (t)
= (y=C 1)
f (x)y
• ⌦(✓)
e(y)
= Indicator
• fr(x) y
⌦(✓)

function
Remember:• f (x)c = p(y = c|x)
= @ P A
P
f (x)y P (L+1) (x)
c0 exp(a c0 )(x) 0 )
L+1) (x) 0 ) 2
c exp(a (L+1)
c c0 e
c0
L+1)
(x)y )
Gradient
exp(a 1(y=c)
1 (L+1) y )
Computation
(x)exp(a
(L+1)
(x)
exp(a )
(L+1)
y exp(a
(x) c )
!
(L+1)
(x)y )
= P P P
(L+1)
P (L+1)
P
+1) (x) 0 ) f (x) (L+1)
c0 exp(a (x)c0 ) (L+1) (x)
c0 exp(a (x)c0 )
c0 exp(a (x)c0 ) c0 exp(a c0 )
c y
• Loss gradient ⇣ at output pre-activation ⌘ (L+1)
1 (L+1)
(a(L+1)-(x))= y softmax(a
Partial 1(y=c)
derivative: softmax(a
(L+1)
(x)) y
(x))y(L+1)
softmax(a softmax(a
(x)) c
(x))y
f (x)y
@1
f (x)y f (x)=c (L+1) 1(y=c)logff(x)
(x)y y f (x)y f (x)c
@a f (x)(x) y c
= 1 1(y=c) @ f (x)
= f (x)
c
y
f (x)y a(L+1) (x)c
- 1
Gradient:
@ (L+1)
= (L+1)
softmax(a (x))y
f (x)y a (x)c
ra(L+1) (x) log f (x)y (L+1)
1 @ exp(a (x)y )
= = (e(y)(L+1) f (x)) P
f (x)y a (x)c c0 exp(a(L+1) (x)c0 )
0 ⇣
@ (L+1) (L+1) @
Indicator
1 @ a(L+1) (x)c function
exp(a (x) y ) exp(a (x) y ) a (L+1
= P (L+1) (x) 0 ) P
f (x)y 0 exp(a c exp(
g(x) + h(x) g(x) h(x)
= +
Derivation
g(x)h(x)
x x x

g(x) h(x)
@
log f (x)y = h(x) + g(x)
@a(L+1) (x)c x x x
1 @ g(x)
= f (x)y h(x) g(x) 1 g(x) h(x)
f (x)y @a(L+1) (x)c =
1 @ x x h(x) h(x)2 x
= (L+1)
softmax(a(L+1) (x))y
f (x)y @a (x)c IFT615% Hugo%Larochelle% 33%

1 @ exp(a(L+1) (x)y )
= P
f (x)y @a(L+1) (x)c c0 exp(a(L+1) (x)c0 )
0 ⇣ P ⌘1
@ (L+1) (L+1) @ (L+1)
1 @ @a(L+1) (x)c exp(a (x)y ) exp(a (x)y ) @a(L+1) (x)c c0 exp(a (x)c0 )
= P P A
f (x)y (L+1) 2
c0 exp(a (x)c0 )
c0 exp(a
(L+1) (x) 0 )
c
!
(L+1) (L+1) (L+1)
1 1 (y=c) exp(a (x) y ) exp(a (x)y ) exp(a (x)c )
= P (L+1) (x) 0 )
P (L+1) (x) 0 )
P (L+1) (x) 0 )
f (x)y c0 exp(a c c0 exp(a c c0 exp(a c

1 ⇣ (L+1) (L+1) (L+1)



= 1(y=c) softmax(a (x))y softmax(a (x))y softmax(a (x))c
f (x)y
1
= 1(y=c) f (x)y f (x)y f (x)c
f (x)y
= 1(y=c) f (x)c

ra(L+1) (x) log f (x)y


Gradient Computation
• Loss gradient for hidden layers

- This is getting complicated!


y y c0 c0(L+1) c c c0 (L+1)
= 1(y=c)⇣ softmax(a (x)) y softmax(a
f (x)y 1
Gradient Computation
(L+1)
(L+1) (L
(L+1
= 11(y=c)1softmax(a
(y=c) exp(a (x)yy)ra(L+1)
(x)) exp(a
softmax(a log
1raf(L+1)
=
(x)y(x) P
log f (x)y P (x)c
r
= 1(y=c) fy(x)y c0f (x)y f (x)c =c (e(y)c0 f (x))(L
f (x)
c exp(a (L+1) (x) 0 ) exp(a
f (x)y(e(y)
= 1 f (x))⇣ =
= 1
• Chain Rule: Assume that 1
(y=c) f (x)
a function
y f (x) f (x)
y (L+1)c
= 1 f =
(x) y f (x) 1(y=c) softmax(a (x))y softmax(
• p(a) can be writtenfas
qi (a)
(y=c) (x)ayfunction of
c
= • results
intermediate p(a) qi1(a),fthen:
1(y=c) (x)c
• = 1(y=c) f (x)y f (x)y f (x)c X
f (x)y @p(a) @p(a) @
@p(a)• X @p(a) @qi (a) =
== 1(y=c) f (x)c @a @p(a)
@qi (a)
@a @qi (a) @a ra(L+1) (x)c log f
i
i @a
• = ra(L+1)f(x)
(e(y) (x))
c
lo
• We can invoke
• it by setting: = (e(y) f (x))
ra(L+1) (x)c
qi (a)- a be a hidden unit
p(a)- qi (a) be a pre-activation in = (e(y) f (x
the layer above
•- p(a) beqithe
(a) loss function @p(a) X @p(a) @qi (
=
@p(a) X @p(a) @
@a =
i
@qi (a) @a

(y=c) y
= P P P
f (x)y c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(

@p(a)==
X Gradient Computation
1 ⇣
@p(a)
1(y=c)@q i (a)
softmax(a (L+1)
(x))y softmax(a(L+1) (x))y softma
@a f (x)y@q (a) @a
i1 i
• Loss = gradient at1(y=c)
hidden
f (x)ylayers
f (x)y f (x)c
f (x)y
x)j
- Partial
= derivative:
1(y=c) f (x)c
@
(k)
log f (x)y
@h (x)j
X @ log f (x)y @a(k+1) (x)i ra(L+1) (x) log f (x)y
= (k+1) (x) (k) (x)
i
@a i @h j = (e(y) f (x))
X @ log f (x)y (k+1)
=
• p(a) qi (a) a (k+1) W
i
@a k (x) i
i,j


@p(a) X @p(a) @qi (a)
=
rh(k) (x) log f (x)y @a @qi (a) @a
i
Remember:
(k+1) >
=(k) W (k) rP (k) (klog
a(k+1) (x) 1)
f (x)y
• a (x)i = bi + j Wi,j h (x)j
h= (x)j (y=c)
P P y
P
(L+1) exp(a(L+1) (x)
X @f (x)log
y 0 exp(a
f (x)y @a
c (k+1) (x)c0 )
(x)i c0 c0 ) c0 exp(
=
=i
1 Gradient
@a(k+1) (x) @h(k)Computation

(x)j (x))
1 i softmax(a
(y=c) softmax(a (L+1)
y
(L+1)
(x))y softma
f (x)y
X @ log f (x)y
= 1 W
(k+1)
• Loss =
gradient at1(x)
@a(k+1) hidden
(y=c) layers
f (x)i,j
y f (x)y f (x)c
i f (x)y i
- Gradient
= 1(y=c) f (x)c

rh(k) (x) log f (x)y


(k+1) >
= W ra(k+1) (x) log f (x)y ra(L+1) (x) log f (x)y
= (e(y) f (x))
We already know
• p(a) qi (a) a k how to compute
that

2 @p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
Remember: P
(k) (k) (k) (k 1)
• a (x)i = bi + j Wi,j h (x)j
Gradient Computation
• Loss gradient at hidden layers
(pre-activation)
- Partial derivative:
@
log f (x)y
@a(k) (x)j
@ log f (x)y @h(k) (x)j
=
@h(k) (x)j @a(k) (x)j
@ log f (x)y 0 (k)
= (k)
g (a (x)j )
@h (x)j

Remember:
ra(k) log f (x)y
(x)
• h (x)j = g(a(k) (x)
(k)
> j
)
rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
log f (x)y
a(k) (x) j

Gradient
=
Computation
@ log f (x) y @h
(k)
(x)j
@h(k) (x)j @a(k) (x)j
• Loss gradient at @hidden
log layers
f (x)y 0 (k)
=
(pre-activation) (k)
g (a (x)j )
@h (x)j Let’s look at the gradients
- Gradient: of activation functions.
ra(k) (x) log f (x)y
>
= rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
= rh(k) (x) log f (x)y [. . . , g 0 (a(k) (x)j ), . . . ]

Gradient of the
activation function
@
(k)
log f (x)y
Remember: Wi,j
• h(k) (x)j = g(a(k) (x)j )
@ log f (x)y @a(k) (x)i
• x1 xd b w 1 w d
• rW(k) log f (x)y (= ra(k) (x) log f (x)y h(k 1)
(x
Linear Activation Function
• w Gradient
• rb(k) log f (x)y (= ra(k) (x) log f (x)y
• Linear activation function: • { (k) >
• rh(k 1) (x) log f (x)y (= W ra(k) (x) log f (x)y
• g(a) = a
• raderivative
- Partial (k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y r
1
• sigm(a) = 1+exp( a)
• g 0 (a) = 1
exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =
3
• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) + W(1) x a(

(out) (2) (2) >


• o(x) = g (b +w
• rW(k) log f (x)y (= ra(k) (x) log f (x)y h (x)
• w
Sigmoid
• rb(k) logActivation
f (x)y (= raFunction
• {
(k) (x) log f (x)Gradient
y

(k) >
••Sigmoid
rh(k 1)activation
(x) (x)y (= W
log ffunction: ra(k) (x) log f (x)y
• g(a) = a
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f1 (x)y r a(
• g(a) = sigm(a) = 1+exp( a)
0
•- gPartial
(a) =derivative
a
exp(a) exp( a) e
• g(a) = tanh(a) = =
• g 0 (a) = g(a)(1 g(a)) exp(a)+exp( a) e

• g 0 (a) = 1 g(a)2 • g(·) b

• h(x) = g(a(x))

(1) (1) (1)
• a(x) = b +W x 3i =
a(x) bi

(out) (2) (2) >


• o(x) = g (b +w x)
• rb(k) log• fw(x)y (=• {ra(k) (x) log f (x)y

•Tanh Activation Function rGradient >


rh(k 1) (x) • {log f (x)y • (=
a (x) Wa(k)
log f (x)y
g(a) = (k)

• r
• Hyperbolic • g(a)
log f= a
(x) •
tangent (‘‘tanh’’)
y (=
g(a) = rsigm(a)
activation
1
(x) 1+exp(f (x)
=
h function: log r
a (k 1) (x) (k 1)
a) y
1
• g 0 (a) = a • g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) = e
• g(a) = tanh(a) exp(a)+exp( a) e
- g 0 (a)derivative
• Partial = g(a)(1 g(a))
• g(a) = tanh(a) = exp(a) exp( a)
= exp(2a) 1
• g(·) exp(a)+exp(
b a) exp(2a)+1
• g 0 (a) = 1 g(a)2
• g(·) b • h(x) = g(a(x))

• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)

(1) P 3
(1)

• a(x) = b(1) + W(1) x a(x)i = bi W i,j xj
(out) (2) (2) >j
• o(x) = g (b + w x)
(out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Tanh Activation Function Gradient
exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)

- Partial derivative
• g(a) = reclin(a) = max(0, a)

g 0 (a) = 1a>0 • g(·) b


• h(x) = g(a(x))

(1
• a(x) = b(1) + W(1) x a(x)i = bi

(out) (2) (2) >


• o(x) = g (b +w x)
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)


• f (x) = p(y = c|x)
(y=c) y
= P P P
ra(k) (x) log ff(x)
(x)yy c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(
Gradient Computation
1 ⇣>
rh(k) (x) =log f (x)y 1(y=c)
rasoftmax(a
(k) (x) h
(k)(L+1)
(x) (x))y softmax(a(L+1) (x))y softma
f (x)y
rh(k) (x) log f (x)
1 y [. . . , g 0 (a(k) (x)j ), . . . ]
• Loss = gradient of1(y=c)
parameters
f (x)y f (x)y f (x)c
f (x)y
- Partial
= derivative
1(y=c) f(weights):
(x)c
@
(k)
log f (x)y
@Wi,j
(k) ra(L+1) (x) log f (x)y
@ log f (x)y @a (x)i
= = (e(y) f (x))
@a(k) (x)i (k)
@Wi,j
• p(a) qi (a)@ a k
log f (x)y (k 1)
= hj (x)
• @a(k) (x) i
@p(a) X @p(a) @qi (a)
=
@a @qi (a) @a
rW(k) log f (x)y i
Remember: P
(k) r (k) (k) log f (x)
(k) (k 1) >
= bi + j Wi,jyh(k h1) (x)j (x)
• a (x)ai =(x)
(y=c) y
= P P P
@ f (x)
log @a(k)(L+1)
y f (x)cy0 exp(a (x)i(x)c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(
=
= Gradient Computation
1(k)
@a (x)

i @W
(k) (L+1)
1(y=c) softmax(a
softmax(a
i,j (x))y (L+1)
(x))y softma
f (x)y
@ log1 f (x)y (k 1)
==
• Loss gradient
(k)of1 h(x)
j y (x)
parameters
f f (x)y f (x)c
@a
f (x)y (x)(y=c)
i

- Gradient
= 1(weights):
(y=c) f (x)c
rW(k) log f (x)y
= ra(k) (x) log f (x)y h(k 1)
(x)>
ra(L+1) (x) log f (x)y
= (e(y) f (x))

• p(a) qi (a) a@ k
(k)
log f (x)y
• bi
(k) @p(a) X @p(a) @qi (a)
@ log f (x)y @a (x)i =
= @a @qi (a) @a
@a(k) (x)i @b
(k) i
Remember: P
i
(k) (k) (k 1)
• a(k) (x)i = b@i +log jfW(x)
i,j yh (x)j
=
@a(k) (x)
(y=c)
iP y
= P P
f (x)y c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(

rW(k)
=
1 ⇣
log f (x) Gradient Computation
1(y=c)
y softmax(a
(L+1)
(x))y
softmax(a (L+1)
(x))y softma
f (x)y
= ra(k) (x) 1log f (x)y h(k 1) (x)>
• Loss =
gradient of1(y=c)
parameters
f (x)y f (x)y f (x)c
f (x)y
- Partial
= derivative
1(y=c) f(biases):
(x)c
@
(k)
log f (x)y
@bi
(k) ra(L+1) (x) log f (x)y
@ log f (x)y @a (x)i
= = (e(y) f (x))
@a(k) (x)i (k)
@bi
• p(a) qi (a)@ a klog f (x)y
=
• @a(k) (x)i
@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
rb(k) log f (x)y
Remember: P (k) (k)
• a(k) (x)i== bira+ h(kf (x)
j Wi,jlog
(k) (x)
1)
(x)
yj
(y=c) y
@=
log f P
(x) @a (k)
(x)
(L+1)
P (L+1)
P
f (x)y y
c0 exp(a
i c0 )
(x) c0 exp(a (x)c0 ) c0 exp(
=
=
@a (k) ⇣
1 (x) i Gradient Computation
@b
(k)
1(y=c) softmax(a
i
(L+1)
(x))y softmax(a(L+1) (x))y softma
f (x)y
@ log 1 f (x)y
•=
Loss =
gradient
(k) of1(y=c)
parameters
f (x)y f (x)y f (x)c
@a
f (x)y(x)i
- Gradient
= 1(biases):
(y=c) f (x)c

rb(k) log f (x)y


= ra(k) (x) log f (x)y ra(L+1) (x) log f (x)y
= (e(y) f (x))

• p(a) qi (a) a k


@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
Remember: P
(k) (k) (k) (k 1)
• a (x)i = bi + j Wi,j h (x)j
y fh(x)y@b
bi if (x)log
= ra(k) (x) log i (x)
i
(k) (k)
@b@i @ loglog f
log
f (x)
(x)
f (x)
y y @a
@a (k)(x)i
(x)i
= == y

Backpropagation Algorithm =
@
@@a@a
@a
(k)
@a
(k)
log(x) f(x)
(k)
(k) (x)
(x)
i iy @a @b
(x) i
(k)
@b
(k)
(x)
i(k)i
(k)i
@
log
@ log
f (x)
log f (x)
y f (x)yy @bi
i
bi== @r @a
(k)
(k)f(x)
log
log (k)(x) iyif (x)y
• Perform forward propagation @= log f
b @a
(k)

(x) @a
(x)
(k)
(x)if (x)
= r@a (k)
y (x)i log
= a (k) (x) y
• Compute output gradient (before activation):@a(k) (x)i @bi
(k)
rrb(k) log
logff(x)(x)yy
@ log f (x) by(k)
• ra(L+1) (x) log f (x)y (= (e(y) =f (x))
@a=(k)
rr
= (x) r
ba(k)
(k)
(k)(x)
log flog (x)ffy(x)
log (x)y y
ia (x)
• For
• rk=L+1
W(k) to
log1f (x)y (= ra(k) (x) log f (x)= y r
h(k) (x)
a(k) (x)
> log f (x)y
• •rar(L+1) loglogf (x) y y (=
fw.r.t.
(x) (= (e(y)
(e(y) f f(x))
(x))
- Compute gradients
(x) the hidden
• rb(k) log f (x)y (= ra(k) (x) log f (x)yrb(k) log f (x)y
a(L+1) (x) layer parameters:
• ra(L+1) (x) log f (x)y (= (e(y) =f (x)) ra(k) (x) (k log f (x)
• •rW (k) 1) >y>
rW (k)(k) log logf (x)
f (x) y y (=(= rr (k)
a a (x)
(k) log
log f f(x)
(x) yy hh (x)(x)
(k) > (x)
• r
•• hr
(k 1) log f
(x) log f (x)y (= (x) y (= W
r rloga(k)f(x) (x) loghf(k)(x)
(x)y>
W (k) a (k) (x) y
• •rrbr b(k) log
(L+1) (x)
a(k) flog
log (x) f (x)
f (x)y y y(= (= rr
(= (e(y)
a(k)
a(k) (x) log
(x)
f (x))
logff(x) (x)yy
(k)
• r
•• ar 1) (x) loglog
(k (k)
b (k) f f
(x) (x)
y (=
y (= r a (k)r h
(x)
(k 1)log(x) f (x) logy
f
(k
(x)
1) y >
r a (k 1) (x) h (x)
r
-•Compute log f (x)
gradients (=
y w.r.t. the r a hidden log
> f (x) h (x)
W(k)layer rrabelow:
W (k) (x) (k) > y
•rhr(kh(k1) (x)
1) (x) log logf (x) y y (=
f (x) (= W (k)
a (k) (x) log
(x) logff(x)
(x)yy
>
•• rrb(k)
h(k 1) (x)
log f (x)logy f(=(x)y r(=a(k) (x) Wlog f (x)
(k)
rya(k) (x) log f (x)y
01) (kh(k)
1) (
• •rar(ka(k1) (x)1) (x) log log f (x)
f (x) y y (=(= r rh h
(k(k 1)1) (x)
(x) log
log ff (x)
(x) y y r
[. . .
a ,
(kg (a
(x) (
• rrh(k y y(=
(k) > 3
- •Compute 1)
a(k 1)(x) (x)
log
logf (x)
gradients w.r.t.
f (x) theWhidden
(= rhr layer
(k a 1) (x) below
(k) (x) log
logff(x)(before
(x)y y ra(k 1) (x) h(k) (x
activation):
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y [. . 3. , g 0 (a(k 1)
(x)j ), . . . ]
3
Computational Flow Graph
• Forward propagation can be represented
as an acyclic flow graph

• Forward propagation can be implemented


in a modular way:

Ø Each box can be an object with an fprop


method, that computes the value of the
box given its children

Ø Calling the fprop method of each box in


the right order yields forward propagation
Computational Flow Graph
• Each object also has a bprop method

- it computes the gradient of the loss with


respect to each child box.

- fprop depends on the fprop output of a


box’s children, while bprop depends on the
bprop of a box’s parents

• By calling bprop in the reverse order, we


obtain backpropagation
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)


• f (x) = p(y = c|x)
• g 0 (a) = g(a)(1Weight
g(a)) Decay
0 • g 0 (a) = g(a)(1
2 g(a))
• L2
• regularization:
g (a) = 1 g(a)
0 2⌘
• gP (a)P
=P1 ⇣g(a)
(k)
2 P (k) 2
• ⌦(✓) = k i j Wi,j = k ||W ||F
P P P ⇣ (k)
⌘2 P (k) 2
• r
•(k) ⌦(✓)
⌦(✓)==2Wk(k) i j Wi,j = k ||W ||F
• Gradient:
W
P P P (k) (k)
• ⌦(✓)• =rWk(k) ⌦(✓)
i = 2W j |Wi,j |
P P(k)P (k)
•- Only
rWapplies
(k) ⌦(✓)
• ⌦(✓) =
to = sign(W
weights, not )
biases |W decay)
(weigh |
k i j i,j
•- Can be interpreted
sign(W (k)
)i,j =as1having
0 a Gaussian(k)prior over the weights,
• rW(k)
while performing MAP estimation.
⌦(✓) = sign(W )
- We will later look at Bayesian methods.
• sign(W(k) )i,j = 1 0
Other Regularizers
• Using a more general regularizer, we get:

Lasso Quadratic
• g00 (a) = g(a)(1 g(a))
2
• g (a) =P 1 P g(a)P ⇣ ⌘2 P
L1 Regularization
0 (k) (k) 2
• •⌦(✓)
g (a) == 1 g(a) 2
Pk Pi Pj ⇣ i,j
W =
⌘2 Pk ||W ||F
(k) (k) 2
• ⌦(✓) = P kP P i
⇣ W
j (k)i,j
⌘ 2 P= ||W
k (k) 2 ||F
• •r⌦(✓)
• L1 regularization:
W (k)
=
⌦(✓) k = i2W Wi,j
j (k) = k ||W ||F
(k)
• rW(k) ⌦(✓) P = =
P2W 2W
P(k) (k)
• rW(k) ⌦(✓)
• ⌦(✓) = Pk Pi Pj |Wi,j |
P P P (k)
•• ⌦(✓)
⌦(✓) = = kk i i j |W j |W
(k)
|
i,j |
i,j(k)
• rW(k) ⌦(✓) = sign(W )
• Gradient:
•• r ⌦(✓) = sign(W (k)(k) )
rW W(k)
(k) ⌦(✓) = sign(W )
• sign(W(k) (k)
)i,j = 1 0
•• sign(W
sign(W (k)))i,ji,j==1W 1 i,j
(k)
0>0 1Wi,j (k)
<0

- Only applies to weights, not biases (weigh decay)


- Can be interpreted as having a Laplace prior over the weights, while
performing MAP estimation.

- Unlike L2, L1 will push some weights to be exactly 0.


Bias-Variance Trade-off

Average predictions over all Solutions for individual datasets


Intrinsic variability
datasets differ from the vary around their averages -- how
of the target
optimal regression function. sensitive is the function to the
values.
particular choice of the dataset.

• Trade-off between bias and variance: With very flexible models (high
complexity) we have low bias and high variance; With relatively rigid models
(low complexity) we have high bias and low variance.
• The model with the optimal predictive capabilities has to balance between bias
and variance.
Bias-Variance Trade-off
• Consider the sinusoidal dataset. We generate 100 datasets, each containing
N=25 points, drawn independently from
High variance Low variance

Low bias High bias


Bias-Variance Trade-off
• Generalization error can be seen as the sum of the
(squared) bias and the variance
• 2g 0•(a)g 0= g(a)(1
(a) =1 g(a)2
g(a))
)=1 g(a)
• g 0 (a)
⇣= 1 Initialization
g(a)
⌘P P P2⇣ ⌘2 P
P P P• ⌦(✓)(k) = k P⇣i j ⌘ W2
(k)
i,j 2 =
(k)
k ||W (k) 2
||F
) =• Initialize
k j Wto
i biases
P = k(k)||W P||F (k) 2
i,j0P P
2
0
• ⌦(✓) = k i j Wi,j = k ||W ||F
(a) •=For
g(a)(1 g(a))
weights
• r ⌦(✓) = 2W(k)
W(k)
(k) (k)
0 (k) ⌦(✓)
W (a) = - =
1 Can 2W

g(a) 2 r W (k) ⌦(✓) = 2W
P toP
not initialize weights P tanh (k)
0 with activation
• ⌦(✓) P=P P i (k)j |Wi,j |
P
P PP P
Ø All
•P ⇣ = (k)
gradients
⌦(✓) ⌘would
(k) k
2
k
be zero (saddle
iP j |Wi,j(k)
| point)
) == kk ii jj |W
⌦(✓) Wi,ji,j | = k ||W ||F (k) 2
- Can not
• r
• rW
initialize all ⌦(✓)
(k) weights
⌦(✓) =
=tosign(W
sign(W the )
(k) same value
)
W (k)

r Ø All hidden
(k) ⌦(✓) = 2W
(k) units in
(k) (k)a layer will always behave the same
W ⌦(✓) = sign(W (k) )
• sign(W
W(k) • sign(W )i,j =)1i,j = 1W(k)
Ø Need to break symmetry
1 >0 (k)
Wi,j >0
(k)
i,j Wi,j <0
1W(k) <0
i,j
P P P (k)
⌦(✓) =(k) k i j |W | p p
n(W )
- Sample= 1
(k) i,j(k)
i,j • WW
(k)
• W from
U>0 1
Ub][ W
[ b, (k)p
bb,= b =6 p
b] ,<0
where 6
H k Hk
i,ji,j i,j i,j Hk +Hk H
1 k +Hk 1
rW(k) ⌦(✓) = sign(W(k) ) p
k)
p 6 Sample around 0 and
U
gn(W
j [
(k) b, b] b =
)i,j = 1W(k) >0 Hk1+H (k)
Hk break symmetry
i,j W k <0
i,j 1

p
(k) 6
Wi,j U [ b, b] b = p Size
Hkof h(k) (x)
Hk +Hk 1
• T ⌃= T t (x b
µ)(x b
µ)

Model Selection
• Supervised
• T 1b Machine
learning
⌃= 1 lea
Pex
t( T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x
=
- Train your model on •theTraining
Training set:
Set •Dtrain = {(x
f (x; ✓)
(t) (t)
, y )}
• Supervised learn
- For model selection,•use
f (x; ✓)
Validation • Dvalid Dtest
Set
• Training set: Dtr
ØHyper-parameter search: hidden layer size, learning rate,
number of iterations/epochs, etc.
• f (x; ✓)
valid
• D
- Estimate generalization performance using the Test Set Dtest

• Remember: Generalization is the behavior of the model on


unseen examples.
Early Stopping
• To select the number of epochs, stop training when validation set
error increases (with some look ahead).
Tricks of the Trade:
• Normalizing your (real-valued) data:
Ø for each dimension xi subtract its training set mean
Ø divide each dimension xi by its training set standard deviation
Ø this can speed up training

• Decreasing the learning rate: As we get closer to the optimum,


take smaller update steps:

i. start with large learning rate (e.g. 0.1)


ii. maintain until validation error stops improving
iii. divide learning rate by 2 and go back to (ii)
@f (x) f (x+✏) f (x ✏)
• @x ⇡ 2✏


Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 ↵t < 1 ↵t
Ø can leverage matrix/matrix operations, which are more efficient

• ↵t = 1+ t

• Momentum: = t↵use0.5
• ↵tCan an exponential
<  1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Adapting Learning Rates
• Updates with adaptive learning rates (“one learning rate per
parameter”)
Ø Adagrad: learning rates are scaled by the square root of the
cumulative sum of squared gradients
⇣ ⌘2 (t) r✓ l(f (x(t) ), y (t) )
(t) (t 1)
= + r✓ l(f (x(t) ), y (t) ) r✓ = p
(t) + ✏

Ø RMSProp: instead of cumulative sum, use exponential moving


average
⇣ ⌘2
(t) (t 1)
= + (1 ) r✓ l(f (x(t) ), y (t) )
(t) r✓ l(f (x(t) ), y (t) )
Ø Adam: essentially combines r✓ = p
(t) + ✏
RMSProp with momentum
a (x) •= h b(1) (x)
+W = o(a h
(1)
(x))
(1) •
(1) a (2)
(x) =
(1)b (2)
+ W
(1)
(2) (1)
h
(3) • a (x) =
(3)b + W x• a (x) = b + W x
• h (x) =(2)o(a (1) (x))
=(3) (2) (2) (1)(1)
(1)
a (x) •= h
(2)
• h • b Gradient
(1)
b(3) (x)
+W
(x)b =
(2)
b Checking
g(a
o(a
x (3) (x))
(x)) • h

(3)a
(x)
(x)
=
=
o(a
b (1)
(3) +
(x))
W (1)
x
• h
(3)
h (x) •= h(x) =(1)
o(a g(a
(3)
(x)) (x)) (1)
• h (2) (x)
(x) =
= g(a(2) (x))
g(a
(x)) • h (3)
(x) = o(a (3)
(x))
• To debug your implementation
(1) (3) •(2) of
(2) fprop/bprop,
h (x) =(1) g(a (x)) can compare
(2) you
h (2) (1)
• h(x)(x)
with••=ab
=
g(a• W
(3)g(a
(2)
(2) (x))
(x)) b(1)
finite-difference
W W
(1) approximation
x f (x)
(1) b
h (x) = g(a (x)) • h(2)of (x)the
= gradient:
g(a(2) (x))
(3)
h(1) (x) (2) (1)(1) • h(1) (x) = g(a(1) (x))
• b =
b g(a b
(3) (x))
b(2) (2)
• W
• b (3)
W b(1)
@f (x) W(1) x ff(x+✏)(x)•(3)h(1)(2) f (x
(x) = g(a✏)(1)
(x))
b (3) (3)
• Wb • W
(2) (1)
b (2)•
(3) W (1)
@x
(2) x f
(1)

(x)
• b b2✏ b (1)

• W @f (x) W f (x+✏)
@x ⇡ W f (xx ✏)
2✏
f (x)
• b
(3)
(3)
b (2)
(2) b (1)
(1)
W(3) W (2)
@f (x) @ff(x)W (1)
x
(x+✏) f (x+✏) f (x)
f (x ✏)f (x ✏) • W W W x f (x)
• @x ⇡ @x • ⇡
@f (x) Ø • f (x) x
f (x+✏) f (x ✏)
2✏ ✏be2✏
would the loss •
@f (x)W (3) (2) (1)
f (x+✏) f (x ✏) x f (x)
W W
⇡ • f (x) 2✏ x ✏ • @x ⇡
•@xf (x)Ø x ✏
would be a parameter @f (x)
2✏
f (x+✏) f (x ✏)
f (x) Øx• f✏(x + ✏) would • ⇡
f (x be ✏) the loss you@x
• fif(x) add 2✏
x ✏ to the parameter
f (x + Ø✏) f (x ✏) would be the loss if•you subtract
f (x) x ✏ to the parameter
Debugging on Small Dataset
• Next, make sure your model can overfit on a smaller dataset
(~ 500-1000 examples)
• If not, investigate the following situations:
Ø Are some of the units saturated, even before the first update?
• scale down the initialization of your parameters for these units
• properly normalize the inputs
Ø Is the training error bouncing up and down?
• decrease the learning rate

• This does not mean that you have computed gradients correctly:
Ø You could still overfit with some of the gradients being wrong

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy