0% found this document useful (0 votes)

2 views62 pages

Lecture NN Part1

The document outlines a course on Deep Learning, specifically focusing on Feedforward Neural Networks, their training methods, and activation functions. It discusses concepts such as forward propagation, loss functions, and techniques like dropout and batch normalization. Additionally, it highlights the limitations of single neurons in solving non-linearly separable problems and introduces the softmax function for multi-class classification.

Uploaded by

melatdagnu1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views62 pages

Lecture NN Part1

Uploaded by

melatdagnu1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

10707

Deep Learning
Russ Salakhutdinov
Machine Learning Department
rsalakhu@cs.cmu.edu
http://www.cs.cmu.edu/~rsalakhu/10707/

Neural Networks I
Neural Networks Online Course
• Disclaimer: Much of the material and slides for this lecture were
borrowed from Hugo Larochelle’s class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/

• Hugo’s class covers

many other topics:
convolutional networks,
neural language model,
Boltzmann machines,
autoencoders, sparse
coding, etc.

• We will use his

material for some of the
other lectures.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
Université de Sherbrooke September
Math for my slides “Feedforward 6, 2012
neural network”.
P
Artificial
.larochelle@usherbrooke.ca Neuron
Abstract
Abstract
Math for my slides > neural network”.
“Feedforward
a(x) = b +
Math
• Neuron for my w x
i slides
pre-activation = b + w
i “Feedforward
i (or x network”.
neural
input activation):
SeptemberPP 6, 2012 P >Abstract
• •a(x)
h(x) a(x)==b b+
= g(a(x))
Math +my=
for i
wx iix=
g(b
slides
w
i i
=b +
+ b+w i
ww
>
i“Feedforward
x ixxneural
i ) network”.
P PP
••h(x)
xh(x)
x1•• Neuron
d b
a(x) = =g(a(x))
= wg(a(x))
b + w =w=g(b
xg(b
+
= +
b + w
wi w
>xii x
x ) i)
output activation:
1 d
i i i i i
Abstract P
• x
w • •x1h(x) x
1 xd = b w
d bg(a(x)) w
w1 1 wd=dg(b + i wi xi )
eedforward neural network”.
• •ww
x1 xd b w 1 w d
{ where
>
= b •+•{w{w arexthe weights (parameters)
g(·)P b is the bias term
+ • ••g(·)
{wisibx
g(·)
i
bi ) the activation function
called
h(x) = g(a(x))
• ••h(x)
g(·) ==
h(x) bg(a(x))
g(a(x))
⇣⇣ P ⌘ ⌘
Abstract
Abstract
Math
Mathfor
formy Artificial Neuron
myslides
slides“Feedforward
“Feedforwardneural
neuralnetwork”.
network”.
PP >
••Output
• a(x) ==bb++ ofiiw
a(x)activation wiixxneuron:
the ii =
=bb+ w>xx
+w
PP
•• h(x)
h(x)==g(a(x))
g(a(x))==g(b g(b+ + iiwwiixxii))
•• xx11 xxdd bb ww11 wwdd

•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x
• x1 xd b w 1 w d
Activation Function
• w
• Linear activation function:
• {
• g(a) = a
Ø No nonlinear transformation
1
Ø No input squashing • sigm(a) = 1+exp( a)

exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =

• g(·) b

• h(x) = g(a(x))
⇣
• a(x) = b(1) + W(1) x a(

(out) (2) (2) >

• o(x) = g (b +w
• w
Activation
• {
Function
• Sigmoid activation function:
• g(a) = a
1
Ø Squashes the neuron’s • g(a) = sigm(a) = 1+exp( a)
output between 0 and 1
exp(a) exp( a) e
Ø Always positive • g(a) = tanh(a) = exp(a)+exp( a) = e

Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))
⇣
(1) (1) (1)
• a(x) = b +W x a(x)i = bi

(out) (2) (2) >

• o(x) = g (b +w x)
• w • {

Activation
• { Function
• g(a) =a
• g(a)
• Hyperbolic tangent = a activation
(‘‘tanh’’) • g(a) = function:
sigm(a) = 1
1+exp( a)
1
• g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) e
• g(a) = tanh(a) exp(a)+exp( a) = e
Ø Squashes the neuron’s
activation between exp(a) exp( a) exp(2a) 1
• g(a) -1= tanh(a)
• =
g(·) b
exp(a)+exp( a) = exp(2a)+1
and 1
• g(·) b • h(x) = g(a(x))
Ø Can be positive or
negative ⇣
• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)
Ø Bounded ⇣ ⌘
(1) (1) (1) P (1)
Ø
• a(x)
Strictly increasing
= b + W x a(x) i = b i (2) >j W i,j xj
(out) (2)
• o(x) = g (b + w x)
(wrong plot) (out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Activation Function exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
Ø Bounded below by 0 • g(a) = reclin(a) = max(0, a)
(always non-negative)

Ø Tends to produce units

• g(·) b
with sparse activities
• h(x) = g(a(x))
Ø Not upper bounded ⇣
(1
Ø Strictly increasing • a(x) = b(1) + W(1) x a(x)i = bi

(out) (2) (2) >

• o(x) = g (b +w x)
• g(a) = tanh(a)
Decision Boundary of a Neuron
• g(a) = max(0, a

• Binary classification: • g(a) = reclin(a)

- • p(y
With sigmoid, one can interpret neuron as estimating = 1|x)
- Interpret as a logistic classifier
• g(·) b
Decision boundary
(1) (1)
• Wi,j bi xj

- If activation is greater than • h(x) = g(a(x))

0.5, predict 1
- Otherwise predict 0 • a(x) = b(1) + W
⇣
• f (x) = o b(2) +
(from Pascal Vincent’s slides)
Same idea can be applied
• p(y = c|x)
to a tanh activation
Capacity of a Single Neuron
• Can solve linearly separable problems.
21

OR (x1 , x2 ) AND (x1 , x2 ) AND (x1 , x2 )

1 1 1
, x2 )

, x2 )

, x2 )
0 0 0

0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
1 , x2 )

1 1
)
1 1 1

, x2 )

, x2 )
0 Capacity of
0 a Single Neuron
0

• Can not solve non-linearly separable

0 1
problems.
0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )

AND (x1 , x2 )
1 1
, x2 )

0
?
0

0 1 0 1
(x1 AND (x1 , x2 )

Figure
• Need1.8to– Exemple
transform de modélisation de XOR
the input into par unrepresentation.
a better réseau à une couche cachée. En
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x2 )
Remember
et• AND basis
(x1 , x2 ). En bas, functions!
on présente l’illustration de la fonction XOR(x1 , x2 ) en fonc-
tion des valeurs de x1 et x2 (à gauche), puis de AND (x1 , x2 ) et AND (x1 , x2 ) (à droite).
• g(·) b exp(a)+exp( a) exp(2a)+1
= max(0, a)
• g(a)(1) = max(0, a)
• g(a) = max(0, a)
Single i Hidden Layer Neural Net
(1) (2) (2)
= reclin(a)• =W max(0,
i,j a)b x j h(x) i w i b
b • g(a) = reclin(a) = max(0, a)
• • •g(a)
Hiddenh(x) ==reclin(a)
layer g(a(x)) = max(0, a)
pre-activation:
(1) (2)
bi • xj p(y
h(x)=i w1|x)
i b (2)
⇣ P ⌘
• p(y
= g(a(x))• a(x) = b
= 1|x) (1) (1)
+ W x a(x)i = bi + j Wi,j xj
(1) (1)

• g(·)
(1) • g(·)
b
⇣ P ⌘
= b + W x ba(x)i = b(2)
(1) (1) (1)
i + Wi,j>xj
j (2)
• f(1)
(x) =(1)
o(b +w x)(2) (2)
• W
= o(b(2)•+ w
> b
(1)x) i(1)
i,j
(2) x j h(x) i w i(2) b (2)
W
• Hidden b x h(x)
i,jlayer iactivation:
j i w i b
• h(x) = g(a(x))
• h(x) = g(a(x)) ⇣ ⌘
(1) (1) ⇣ (1) P (1) ⌘
• a(x)
• Output
= b + W x a(x) i = b i(1)+ P W x
(1) j
• a(x)layer = bactivation:
i,j
+ W(1) x a(x)i = bi1 + jj Wi,j xj
(1) 1
⇣ >
⌘
• f (x) = o ⇣b(2) + w (2)
>h ⌘x
(1)
• f (x) = o b(2) + w(2) x
Output activation
function
(1) (1) (2) (2)
• Wi,j bi xj h(x)i wi b• a(x) = b(1) + W(1
Softmax Activation Function
• h(x) = g(a(x)) ⇣
‣ ⇣
Remember multi-way classification: • f(1) P o b(1) +⌘w
(x) = (2)
(1) (1)
• a(x) = b + W x a(x) = b + j Wi,j xj
- We need multiple outputs (1 output per iclass) i
⇣
- We need to estimate ⌘ • p(y = c|x)
conditional probability:
>
- •Discriminative b(2) + w(2) x
f (x) = o Learning
• o(a) = softmax(a)
‣
• p(y = c|x)
Softmax activation function at the output
h i>
• o(a) = softmax(a) = Pexp(a1 ) . . . Pexp(a C)
c exp(ac ) c exp(ac )

- strictly positive
- sums to one
1
‣ Predict class with the highest estimated class conditional
probability.
• p(y = c|x)
h i>
•• o(a)
p(y == c|x) Multilayer
softmax(a) = Neural
Pexp(a1 )
...
c exp(ac )
Net
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)

• h(k) (x) = g(a(k) (x))

Capacity of Neural Nets
• Consider a single layer neural network 2
Réseaux de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j -1
0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1

Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
R
Capacity of Neural Nets
• La puissance expressive des réseaux de neurones
• Consider a single layer neural network

z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

(from Pascal Vincent’s slides)

Capacity of Neural Nets
• Consider a single layer neural network

(from Pascal Vincent’s slides)

Universal Approximation
• Universal Approximation Theorem (Hornik, 1991):

- “a single hidden layer neural network with a linear output

unit can approximate any continuous function arbitrarily well,
given enough hidden units’’

• This applies for sigmoid, tanh and many other activation

functions.

• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:

- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:

- Dropout
- Batch normalization
- Unsupervised Pre-training
✓)
Training
d
Dtest
• Empirical Risk Minimization:

1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize

classification error.

ØLoss function can sometimes be viewed as a surrogate for

what we want to optimize (e.g. upper bound)
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

• f (x) = p(y = c|x)
• l(f (x ; ✓), y )
• ✓ ⌘ {W(1) , b(1) , . . . , W(L+1) (t)
, b (L+1)
(t)
}
(t)
• l(f (x ; ✓), y )
Loss
(t) Function
• r✓ l(f (x ; ✓), y )

• ⌦(✓)
• Let
• rus✓ l(f
start
(xby(t)considering
; ✓), y (t) ) a classification problem with a
softmax output layer. • r✓ ⌦(✓)
• ⌦(✓)
• We need to estimate: • f (x)c = p(y = c|x)
• -rWe✓ ⌦(✓)
can maximize the
(t)log-probability
(t) of the correct class given
•
(t)
x y(t)
an input: log p(y
• f (x)c = p(y = c|x) = c|x ) P
• l(f (x), y) = c 1(y=c) log f (x)c = log f
(t) (t)
• x y
• Alternatively, we can•minimize the negative log-likelihood:
P
• l(f (x), y) = c 1(y=c) log f (x)c = log f (x)y = @
• f (x
• As seen before, this is also known as a cross-entropy entropy
function for multi-class classification problem. @
log f (x)y
f (x)c
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

• f (x) = p(y = c|x)
• p(y = c|x)
h i>
•• o(a)
Multilayer
p(y = softmax(a) =Neural .Net:
= c|x) Pexp(a1 )..
c exp(ac )
Reminder
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

Gradient Computation
Hugo Larochelle
Département d’informatique
Université de Sherbrooke
x)c =• Losslog fgradient
(x)y y at output hugo.larochelle@usherbrooke.ca
log f (x)c = log f (x)

- Partial derivative: September 13, 2012

@ 1(y=c)
@ @f (x)c log f (x)y = 1(y=c)
f (x)y
log f (x) = y Abstract
f (x)c f (x)y
Math for my slides “Feedforward neural network”.
- Gradient:
• f (x)rf (x) 2 log f (x)y 3
rf (x) (1) log f(1)(x)
1(y=0) y (L+1) (L+1)
• ✓ ⌘ {W 2
1 6 , b , . .
. . , W3 7 , b }
= 41(y=0). . 5
f (x)
1 6
=
• l(f (x (t) y y )
; ✓), (t)
1 .. 7
4 .
(y=C 5
1)
f (x) y
• r✓ l(fe(y) (t)
(x 1; ✓), y ) (t)
= (y=C 1)
f (x)y
• ⌦(✓)
e(y)
= Indicator
• fr(x) y
⌦(✓)
✓
function
Remember:• f (x)c = p(y = c|x)
= @ P A
P
f (x)y P (L+1) (x)
c0 exp(a c0 )(x) 0 )
L+1) (x) 0 ) 2
c exp(a (L+1)
c c0 e
c0
L+1)
(x)y )
Gradient
exp(a 1(y=c)
1 (L+1) y )
Computation
(x)exp(a
(L+1)
(x)
exp(a )
(L+1)
y exp(a
(x) c )
!
(L+1)
(x)y )
= P P P
(L+1)
P (L+1)
P
+1) (x) 0 ) f (x) (L+1)
c0 exp(a (x)c0 ) (L+1) (x)
c0 exp(a (x)c0 )
c0 exp(a (x)c0 ) c0 exp(a c0 )
c y
• Loss gradient ⇣ at output pre-activation ⌘ (L+1)
1 (L+1)
(a(L+1)-(x))= y softmax(a
Partial 1(y=c)
derivative: softmax(a
(L+1)
(x)) y
(x))y(L+1)
softmax(a softmax(a
(x)) c
(x))y
f (x)y
@1
f (x)y f (x)=c (L+1) 1(y=c)logff(x)
(x)y y f (x)y f (x)c
@a f (x)(x) y c
= 1 1(y=c) @ f (x)
= f (x)
c
y
f (x)y a(L+1) (x)c
- 1
Gradient:
@ (L+1)
= (L+1)
softmax(a (x))y
f (x)y a (x)c
ra(L+1) (x) log f (x)y (L+1)
1 @ exp(a (x)y )
= = (e(y)(L+1) f (x)) P
f (x)y a (x)c c0 exp(a(L+1) (x)c0 )
0 ⇣
@ (L+1) (L+1) @
Indicator
1 @ a(L+1) (x)c function
exp(a (x) y ) exp(a (x) y ) a (L+1
= P (L+1) (x) 0 ) P
f (x)y 0 exp(a c exp(
g(x) + h(x) g(x) h(x)
= +
Derivation
g(x)h(x)
x x x

g(x) h(x)
@
log f (x)y = h(x) + g(x)
@a(L+1) (x)c x x x
1 @ g(x)
= f (x)y h(x) g(x) 1 g(x) h(x)
f (x)y @a(L+1) (x)c =
1 @ x x h(x) h(x)2 x
= (L+1)
softmax(a(L+1) (x))y
f (x)y @a (x)c IFT615% Hugo%Larochelle% 33%

1 @ exp(a(L+1) (x)y )
= P
f (x)y @a(L+1) (x)c c0 exp(a(L+1) (x)c0 )
0 ⇣ P ⌘1
@ (L+1) (L+1) @ (L+1)
1 @ @a(L+1) (x)c exp(a (x)y ) exp(a (x)y ) @a(L+1) (x)c c0 exp(a (x)c0 )
= P P A
f (x)y (L+1) 2
c0 exp(a (x)c0 )
c0 exp(a
(L+1) (x) 0 )
c
!
(L+1) (L+1) (L+1)
1 1 (y=c) exp(a (x) y ) exp(a (x)y ) exp(a (x)c )
= P (L+1) (x) 0 )
P (L+1) (x) 0 )
P (L+1) (x) 0 )
f (x)y c0 exp(a c c0 exp(a c c0 exp(a c

1 ⇣ (L+1) (L+1) (L+1)

⌘
= 1(y=c) softmax(a (x))y softmax(a (x))y softmax(a (x))c
f (x)y
1
= 1(y=c) f (x)y f (x)y f (x)c
f (x)y
= 1(y=c) f (x)c

ra(L+1) (x) log f (x)y

Gradient Computation
• Loss gradient for hidden layers

- This is getting complicated!

y y c0 c0(L+1) c c c0 (L+1)
= 1(y=c)⇣ softmax(a (x)) y softmax(a
f (x)y 1
Gradient Computation
(L+1)
(L+1) (L
(L+1
= 11(y=c)1softmax(a
(y=c) exp(a (x)yy)ra(L+1)
(x)) exp(a
softmax(a log
1raf(L+1)
=
(x)y(x) P
log f (x)y P (x)c
r
= 1(y=c) fy(x)y c0f (x)y f (x)c =c (e(y)c0 f (x))(L
f (x)
c exp(a (L+1) (x) 0 ) exp(a
f (x)y(e(y)
= 1 f (x))⇣ =
= 1
• Chain Rule: Assume that 1
(y=c) f (x)
a function
y f (x) f (x)
y (L+1)c
= 1 f =
(x) y f (x) 1(y=c) softmax(a (x))y softmax(
• p(a) can be writtenfas
qi (a)
(y=c) (x)ayfunction of
c
= • results
intermediate p(a) qi1(a),fthen:
1(y=c) (x)c
• = 1(y=c) f (x)y f (x)y f (x)c X
f (x)y @p(a) @p(a) @
@p(a)• X @p(a) @qi (a) =
== 1(y=c) f (x)c @a @p(a)
@qi (a)
@a @qi (a) @a ra(L+1) (x)c log f
i
i @a
• = ra(L+1)f(x)
(e(y) (x))
c
lo
• We can invoke
• it by setting: = (e(y) f (x))
ra(L+1) (x)c
qi (a)- a be a hidden unit
p(a)- qi (a) be a pre-activation in = (e(y) f (x
the layer above
•- p(a) beqithe
(a) loss function @p(a) X @p(a) @qi (
=
@p(a) X @p(a) @
@a =
i
@qi (a) @a
•
(y=c) y
= P P P
f (x)y c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(

@p(a)==
X Gradient Computation
1 ⇣
@p(a)
1(y=c)@q i (a)
softmax(a (L+1)
(x))y softmax(a(L+1) (x))y softma
@a f (x)y@q (a) @a
i1 i
• Loss = gradient at1(y=c)
hidden
f (x)ylayers
f (x)y f (x)c
f (x)y
x)j
- Partial
= derivative:
1(y=c) f (x)c
@
(k)
log f (x)y
@h (x)j
X @ log f (x)y @a(k+1) (x)i ra(L+1) (x) log f (x)y
= (k+1) (x) (k) (x)
i
@a i @h j = (e(y) f (x))
X @ log f (x)y (k+1)
=
• p(a) qi (a) a (k+1) W
i
@a k (x) i
i,j

•
@p(a) X @p(a) @qi (a)
=
rh(k) (x) log f (x)y @a @qi (a) @a
i
Remember:
(k+1) >
=(k) W (k) rP (k) (klog
a(k+1) (x) 1)
f (x)y
• a (x)i = bi + j Wi,j h (x)j
h= (x)j (y=c)
P P y
P
(L+1) exp(a(L+1) (x)
X @f (x)log
y 0 exp(a
f (x)y @a
c (k+1) (x)c0 )
(x)i c0 c0 ) c0 exp(
=
=i
1 Gradient
@a(k+1) (x) @h(k)Computation
⇣
(x)j (x))
1 i softmax(a
(y=c) softmax(a (L+1)
y
(L+1)
(x))y softma
f (x)y
X @ log f (x)y
= 1 W
(k+1)
• Loss =
gradient at1(x)
@a(k+1) hidden
(y=c) layers
f (x)i,j
y f (x)y f (x)c
i f (x)y i
- Gradient
= 1(y=c) f (x)c

rh(k) (x) log f (x)y

(k+1) >
= W ra(k+1) (x) log f (x)y ra(L+1) (x) log f (x)y
= (e(y) f (x))
We already know
• p(a) qi (a) a k how to compute
that
•
2 @p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
Remember: P
(k) (k) (k) (k 1)
• a (x)i = bi + j Wi,j h (x)j
Gradient Computation
• Loss gradient at hidden layers
(pre-activation)
- Partial derivative:
@
log f (x)y
@a(k) (x)j
@ log f (x)y @h(k) (x)j
=
@h(k) (x)j @a(k) (x)j
@ log f (x)y 0 (k)
= (k)
g (a (x)j )
@h (x)j

Remember:
ra(k) log f (x)y
(x)
• h (x)j = g(a(k) (x)
(k)
> j
)
rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
log f (x)y
a(k) (x) j

Gradient
=
Computation
@ log f (x) y @h
(k)
(x)j
@h(k) (x)j @a(k) (x)j
• Loss gradient at @hidden
log layers
f (x)y 0 (k)
=
(pre-activation) (k)
g (a (x)j )
@h (x)j Let’s look at the gradients
- Gradient: of activation functions.
ra(k) (x) log f (x)y
>
= rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
= rh(k) (x) log f (x)y [. . . , g 0 (a(k) (x)j ), . . . ]

Gradient of the
activation function
@
(k)
log f (x)y
Remember: Wi,j
• h(k) (x)j = g(a(k) (x)j )
@ log f (x)y @a(k) (x)i
• x1 xd b w 1 w d
• rW(k) log f (x)y (= ra(k) (x) log f (x)y h(k 1)
(x
Linear Activation Function
• w Gradient
• rb(k) log f (x)y (= ra(k) (x) log f (x)y
• Linear activation function: • { (k) >
• rh(k 1) (x) log f (x)y (= W ra(k) (x) log f (x)y
• g(a) = a
• raderivative
- Partial (k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y r
1
• sigm(a) = 1+exp( a)
• g 0 (a) = 1
exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =
3
• g(·) b

• h(x) = g(a(x))
⇣
• a(x) = b(1) + W(1) x a(

(out) (2) (2) >

• o(x) = g (b +w
• rW(k) log f (x)y (= ra(k) (x) log f (x)y h (x)
• w
Sigmoid
• rb(k) logActivation
f (x)y (= raFunction
• {
(k) (x) log f (x)Gradient
y

(k) >
••Sigmoid
rh(k 1)activation
(x) (x)y (= W
log ffunction: ra(k) (x) log f (x)y
• g(a) = a
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f1 (x)y r a(
• g(a) = sigm(a) = 1+exp( a)
0
•- gPartial
(a) =derivative
a
exp(a) exp( a) e
• g(a) = tanh(a) = =
• g 0 (a) = g(a)(1 g(a)) exp(a)+exp( a) e

• g 0 (a) = 1 g(a)2 • g(·) b

• h(x) = g(a(x))
⇣
(1) (1) (1)
• a(x) = b +W x 3i =
a(x) bi

(out) (2) (2) >

• o(x) = g (b +w x)
• rb(k) log• fw(x)y (=• {ra(k) (x) log f (x)y

•Tanh Activation Function rGradient >

rh(k 1) (x) • {log f (x)y • (=
a (x) Wa(k)
log f (x)y
g(a) = (k)

• r
• Hyperbolic • g(a)
log f= a
(x) •
tangent (‘‘tanh’’)
y (=
g(a) = rsigm(a)
activation
1
(x) 1+exp(f (x)
=
h function: log r
a (k 1) (x) (k 1)
a) y
1
• g 0 (a) = a • g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) = e
• g(a) = tanh(a) exp(a)+exp( a) e
- g 0 (a)derivative
• Partial = g(a)(1 g(a))
• g(a) = tanh(a) = exp(a) exp( a)
= exp(2a) 1
• g(·) exp(a)+exp(
b a) exp(2a)+1
• g 0 (a) = 1 g(a)2
• g(·) b • h(x) = g(a(x))
⇣
• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)
⇣
(1) P 3
(1)
⌘
• a(x) = b(1) + W(1) x a(x)i = bi W i,j xj
(out) (2) (2) >j
• o(x) = g (b + w x)
(out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Tanh Activation Function Gradient
exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)

- Partial derivative
• g(a) = reclin(a) = max(0, a)

g 0 (a) = 1a>0 • g(·) b

• h(x) = g(a(x))
⇣
(1
• a(x) = b(1) + W(1) x a(x)i = bi

(out) (2) (2) >

• o(x) = g (b +w x)
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

• f (x) = p(y = c|x)
(y=c) y
= P P P
ra(k) (x) log ff(x)
(x)yy c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(
Gradient Computation
1 ⇣>
rh(k) (x) =log f (x)y 1(y=c)
rasoftmax(a
(k) (x) h
(k)(L+1)
(x) (x))y softmax(a(L+1) (x))y softma
f (x)y
rh(k) (x) log f (x)
1 y [. . . , g 0 (a(k) (x)j ), . . . ]
• Loss = gradient of1(y=c)
parameters
f (x)y f (x)y f (x)c
f (x)y
- Partial
= derivative
1(y=c) f(weights):
(x)c
@
(k)
log f (x)y
@Wi,j
(k) ra(L+1) (x) log f (x)y
@ log f (x)y @a (x)i
= = (e(y) f (x))
@a(k) (x)i (k)
@Wi,j
• p(a) qi (a)@ a k
log f (x)y (k 1)
= hj (x)
• @a(k) (x) i
@p(a) X @p(a) @qi (a)
=
@a @qi (a) @a
rW(k) log f (x)y i
Remember: P
(k) r (k) (k) log f (x)
(k) (k 1) >
= bi + j Wi,jyh(k h1) (x)j (x)
• a (x)ai =(x)
(y=c) y
= P P P
@ f (x)
log @a(k)(L+1)
y f (x)cy0 exp(a (x)i(x)c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(
=
= Gradient Computation
1(k)
@a (x)
⇣
i @W
(k) (L+1)
1(y=c) softmax(a
softmax(a
i,j (x))y (L+1)
(x))y softma
f (x)y
@ log1 f (x)y (k 1)
==
• Loss gradient
(k)of1 h(x)
j y (x)
parameters
f f (x)y f (x)c
@a
f (x)y (x)(y=c)
i

- Gradient
= 1(weights):
(y=c) f (x)c
rW(k) log f (x)y
= ra(k) (x) log f (x)y h(k 1)
(x)>
ra(L+1) (x) log f (x)y
= (e(y) f (x))

• p(a) qi (a) a@ k
(k)
log f (x)y
• bi
(k) @p(a) X @p(a) @qi (a)
@ log f (x)y @a (x)i =
= @a @qi (a) @a
@a(k) (x)i @b
(k) i
Remember: P
i
(k) (k) (k 1)
• a(k) (x)i = b@i +log jfW(x)
i,j yh (x)j
=
@a(k) (x)
(y=c)
iP y
= P P
f (x)y c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(

rW(k)
=
1 ⇣
log f (x) Gradient Computation
1(y=c)
y softmax(a
(L+1)
(x))y
softmax(a (L+1)
(x))y softma
f (x)y
= ra(k) (x) 1log f (x)y h(k 1) (x)>
• Loss =
gradient of1(y=c)
parameters
f (x)y f (x)y f (x)c
f (x)y
- Partial
= derivative
1(y=c) f(biases):
(x)c
@
(k)
log f (x)y
@bi
(k) ra(L+1) (x) log f (x)y
@ log f (x)y @a (x)i
= = (e(y) f (x))
@a(k) (x)i (k)
@bi
• p(a) qi (a)@ a klog f (x)y
=
• @a(k) (x)i
@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
rb(k) log f (x)y
Remember: P (k) (k)
• a(k) (x)i== bira+ h(kf (x)
j Wi,jlog
(k) (x)
1)
(x)
yj
(y=c) y
@=
log f P
(x) @a (k)
(x)
(L+1)
P (L+1)
P
f (x)y y
c0 exp(a
i c0 )
(x) c0 exp(a (x)c0 ) c0 exp(
=
=
@a (k) ⇣
1 (x) i Gradient Computation
@b
(k)
1(y=c) softmax(a
i
(L+1)
(x))y softmax(a(L+1) (x))y softma
f (x)y
@ log 1 f (x)y
•=
Loss =
gradient
(k) of1(y=c)
parameters
f (x)y f (x)y f (x)c
@a
f (x)y(x)i
- Gradient
= 1(biases):
(y=c) f (x)c

rb(k) log f (x)y

= ra(k) (x) log f (x)y ra(L+1) (x) log f (x)y
= (e(y) f (x))

• p(a) qi (a) a k

•
@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
Remember: P
(k) (k) (k) (k 1)
• a (x)i = bi + j Wi,j h (x)j
y fh(x)y@b
bi if (x)log
= ra(k) (x) log i (x)
i
(k) (k)
@b@i @ loglog f
log
f (x)
(x)
f (x)
y y @a
@a (k)(x)i
(x)i
= == y
•
Backpropagation Algorithm =
@
@@a@a
@a
(k)
@a
(k)
log(x) f(x)
(k)
(k) (x)
(x)
i iy @a @b
(x) i
(k)
@b
(k)
(x)
i(k)i
(k)i
@
log
@ log
f (x)
log f (x)
y f (x)yy @bi
i
bi== @r @a
(k)
(k)f(x)
log
log (k)(x) iyif (x)y
• Perform forward propagation @= log f
b @a
(k)

(x) @a
(x)
(k)
(x)if (x)
= r@a (k)
y (x)i log
= a (k) (x) y
• Compute output gradient (before activation):@a(k) (x)i @bi
(k)
rrb(k) log
logff(x)(x)yy
@ log f (x) by(k)
• ra(L+1) (x) log f (x)y (= (e(y) =f (x))
@a=(k)
rr
= (x) r
ba(k)
(k)
(k)(x)
log flog (x)ffy(x)
log (x)y y
ia (x)
• For
• rk=L+1
W(k) to
log1f (x)y (= ra(k) (x) log f (x)= y r
h(k) (x)
a(k) (x)
> log f (x)y
• •rar(L+1) loglogf (x) y y (=
fw.r.t.
(x) (= (e(y)
(e(y) f f(x))
(x))
- Compute gradients
(x) the hidden
• rb(k) log f (x)y (= ra(k) (x) log f (x)yrb(k) log f (x)y
a(L+1) (x) layer parameters:
• ra(L+1) (x) log f (x)y (= (e(y) =f (x)) ra(k) (x) (k log f (x)
• •rW (k) 1) >y>
rW (k)(k) log logf (x)
f (x) y y (=(= rr (k)
a a (x)
(k) log
log f f(x)
(x) yy hh (x)(x)
(k) > (x)
• r
•• hr
(k 1) log f
(x) log f (x)y (= (x) y (= W
r rloga(k)f(x) (x) loghf(k)(x)
(x)y>
W (k) a (k) (x) y
• •rrbr b(k) log
(L+1) (x)
a(k) flog
log (x) f (x)
f (x)y y y(= (= rr
(= (e(y)
a(k)
a(k) (x) log
(x)
f (x))
logff(x) (x)yy
(k)
• r
•• ar 1) (x) loglog
(k (k)
b (k) f f
(x) (x)
y (=
y (= r a (k)r h
(x)
(k 1)log(x) f (x) logy
f
(k
(x)
1) y >
r a (k 1) (x) h (x)
r
-•Compute log f (x)
gradients (=
y w.r.t. the r a hidden log
> f (x) h (x)
W(k)layer rrabelow:
W (k) (x) (k) > y
•rhr(kh(k1) (x)
1) (x) log logf (x) y y (=
f (x) (= W (k)
a (k) (x) log
(x) logff(x)
(x)yy
>
•• rrb(k)
h(k 1) (x)
log f (x)logy f(=(x)y r(=a(k) (x) Wlog f (x)
(k)
rya(k) (x) log f (x)y
01) (kh(k)
1) (
• •rar(ka(k1) (x)1) (x) log log f (x)
f (x) y y (=(= r rh h
(k(k 1)1) (x)
(x) log
log ff (x)
(x) y y r
[. . .
a ,
(kg (a
(x) (
• rrh(k y y(=
(k) > 3
- •Compute 1)
a(k 1)(x) (x)
log
logf (x)
gradients w.r.t.
f (x) theWhidden
(= rhr layer
(k a 1) (x) below
(k) (x) log
logff(x)(before
(x)y y ra(k 1) (x) h(k) (x
activation):
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y [. . 3. , g 0 (a(k 1)
(x)j ), . . . ]
3
Computational Flow Graph
• Forward propagation can be represented
as an acyclic flow graph

• Forward propagation can be implemented

in a modular way:

Ø Each box can be an object with an fprop

method, that computes the value of the
box given its children

Ø Calling the fprop method of each box in

the right order yields forward propagation
Computational Flow Graph
• Each object also has a bprop method

- it computes the gradient of the loss with

respect to each child box.

- fprop depends on the fprop output of a

box’s children, while bprop depends on the
bprop of a box’s parents

• By calling bprop in the reverse order, we

obtain backpropagation
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

• f (x) = p(y = c|x)
• g 0 (a) = g(a)(1Weight
g(a)) Decay
0 • g 0 (a) = g(a)(1
2 g(a))
• L2
• regularization:
g (a) = 1 g(a)
0 2⌘
• gP (a)P
=P1 ⇣g(a)
(k)
2 P (k) 2
• ⌦(✓) = k i j Wi,j = k ||W ||F
P P P ⇣ (k)
⌘2 P (k) 2
• r
•(k) ⌦(✓)
⌦(✓)==2Wk(k) i j Wi,j = k ||W ||F
• Gradient:
W
P P P (k) (k)
• ⌦(✓)• =rWk(k) ⌦(✓)
i = 2W j |Wi,j |
P P(k)P (k)
•- Only
rWapplies
(k) ⌦(✓)
• ⌦(✓) =
to = sign(W
weights, not )
biases |W decay)
(weigh |
k i j i,j
•- Can be interpreted
sign(W (k)
)i,j =as1having
0 a Gaussian(k)prior over the weights,
• rW(k)
while performing MAP estimation.
⌦(✓) = sign(W )
- We will later look at Bayesian methods.
• sign(W(k) )i,j = 1 0
Other Regularizers
• Using a more general regularizer, we get:

Lasso Quadratic
• g00 (a) = g(a)(1 g(a))
2
• g (a) =P 1 P g(a)P ⇣ ⌘2 P
L1 Regularization
0 (k) (k) 2
• •⌦(✓)
g (a) == 1 g(a) 2
Pk Pi Pj ⇣ i,j
W =
⌘2 Pk ||W ||F
(k) (k) 2
• ⌦(✓) = P kP P i
⇣ W
j (k)i,j
⌘ 2 P= ||W
k (k) 2 ||F
• •r⌦(✓)
• L1 regularization:
W (k)
=
⌦(✓) k = i2W Wi,j
j (k) = k ||W ||F
(k)
• rW(k) ⌦(✓) P = =
P2W 2W
P(k) (k)
• rW(k) ⌦(✓)
• ⌦(✓) = Pk Pi Pj |Wi,j |
P P P (k)
•• ⌦(✓)
⌦(✓) = = kk i i j |W j |W
(k)
|
i,j |
i,j(k)
• rW(k) ⌦(✓) = sign(W )
• Gradient:
•• r ⌦(✓) = sign(W (k)(k) )
rW W(k)
(k) ⌦(✓) = sign(W )
• sign(W(k) (k)
)i,j = 1 0
•• sign(W
sign(W (k)))i,ji,j==1W 1 i,j
(k)
0>0 1Wi,j (k)
<0

- Only applies to weights, not biases (weigh decay)

- Can be interpreted as having a Laplace prior over the weights, while
performing MAP estimation.

- Unlike L2, L1 will push some weights to be exactly 0.

Bias-Variance Trade-off

Average predictions over all Solutions for individual datasets

Intrinsic variability
datasets differ from the vary around their averages -- how
of the target
optimal regression function. sensitive is the function to the
values.
particular choice of the dataset.

• Trade-off between bias and variance: With very flexible models (high
complexity) we have low bias and high variance; With relatively rigid models
(low complexity) we have high bias and low variance.
• The model with the optimal predictive capabilities has to balance between bias
and variance.
Bias-Variance Trade-off
• Consider the sinusoidal dataset. We generate 100 datasets, each containing
N=25 points, drawn independently from
High variance Low variance

Low bias High bias

Bias-Variance Trade-off
• Generalization error can be seen as the sum of the
(squared) bias and the variance
• 2g 0•(a)g 0= g(a)(1
(a) =1 g(a)2
g(a))
)=1 g(a)
• g 0 (a)
⇣= 1 Initialization
g(a)
⌘P P P2⇣ ⌘2 P
P P P• ⌦(✓)(k) = k P⇣i j ⌘ W2
(k)
i,j 2 =
(k)
k ||W (k) 2
||F
) =• Initialize
k j Wto
i biases
P = k(k)||W P||F (k) 2
i,j0P P
2
0
• ⌦(✓) = k i j Wi,j = k ||W ||F
(a) •=For
g(a)(1 g(a))
weights
• r ⌦(✓) = 2W(k)
W(k)
(k) (k)
0 (k) ⌦(✓)
W (a) = - =
1 Can 2W
•
g(a) 2 r W (k) ⌦(✓) = 2W
P toP
not initialize weights P tanh (k)
0 with activation
• ⌦(✓) P=P P i (k)j |Wi,j |
P
P PP P
Ø All
•P ⇣ = (k)
gradients
⌦(✓) ⌘would
(k) k
2
k
be zero (saddle
iP j |Wi,j(k)
| point)
) == kk ii jj |W
⌦(✓) Wi,ji,j | = k ||W ||F (k) 2
- Can not
• r
• rW
initialize all ⌦(✓)
(k) weights
⌦(✓) =
=tosign(W
sign(W the )
(k) same value
)
W (k)

r Ø All hidden
(k) ⌦(✓) = 2W
(k) units in
(k) (k)a layer will always behave the same
W ⌦(✓) = sign(W (k) )
• sign(W
W(k) • sign(W )i,j =)1i,j = 1W(k)
Ø Need to break symmetry
1 >0 (k)
Wi,j >0
(k)
i,j Wi,j <0
1W(k) <0
i,j
P P P (k)
⌦(✓) =(k) k i j |W | p p
n(W )
- Sample= 1
(k) i,j(k)
i,j • WW
(k)
• W from
U>0 1
Ub][ W
[ b, (k)p
bb,= b =6 p
b] ,<0
where 6
H k Hk
i,ji,j i,j i,j Hk +Hk H
1 k +Hk 1
rW(k) ⌦(✓) = sign(W(k) ) p
k)
p 6 Sample around 0 and
U
gn(W
j [
(k) b, b] b =
)i,j = 1W(k) >0 Hk1+H (k)
Hk break symmetry
i,j W k <0
i,j 1

p
(k) 6
Wi,j U [ b, b] b = p Size
Hkof h(k) (x)
Hk +Hk 1
• T ⌃= T t (x b
µ)(x b
µ)

Model Selection
• Supervised
• T 1b Machine
learning
⌃= 1 lea
Pex
t( T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x
=
- Train your model on •theTraining
Training set:
Set •Dtrain = {(x
f (x; ✓)
(t) (t)
, y )}
• Supervised learn
- For model selection,•use
f (x; ✓)
Validation • Dvalid Dtest
Set
• Training set: Dtr
ØHyper-parameter search: hidden layer size, learning rate,
number of iterations/epochs, etc.
• f (x; ✓)
valid
• D
- Estimate generalization performance using the Test Set Dtest

• Remember: Generalization is the behavior of the model on

unseen examples.
Early Stopping
• To select the number of epochs, stop training when validation set
error increases (with some look ahead).
Tricks of the Trade:
• Normalizing your (real-valued) data:
Ø for each dimension xi subtract its training set mean
Ø divide each dimension xi by its training set standard deviation
Ø this can speed up training

• Decreasing the learning rate: As we get closer to the optimum,

take smaller update steps:

i. start with large learning rate (e.g. 0.1)

ii. maintain until validation error stops improving
iii. divide learning rate by 2 and go back to (ii)
@f (x) f (x+✏) f (x ✏)
• @x ⇡ 2✏

•
Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 ↵t < 1 ↵t
Ø can leverage matrix/matrix operations, which are more efficient
↵
• ↵t = 1+ t

• Momentum: = t↵use0.5
• ↵tCan an exponential
<  1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Adapting Learning Rates
• Updates with adaptive learning rates (“one learning rate per
parameter”)
Ø Adagrad: learning rates are scaled by the square root of the
cumulative sum of squared gradients
⇣ ⌘2 (t) r✓ l(f (x(t) ), y (t) )
(t) (t 1)
= + r✓ l(f (x(t) ), y (t) ) r✓ = p
(t) + ✏

Ø RMSProp: instead of cumulative sum, use exponential moving

average
⇣ ⌘2
(t) (t 1)
= + (1 ) r✓ l(f (x(t) ), y (t) )
(t) r✓ l(f (x(t) ), y (t) )
Ø Adam: essentially combines r✓ = p
(t) + ✏
RMSProp with momentum
a (x) •= h b(1) (x)
+W = o(a h
(1)
(x))
(1) •
(1) a (2)
(x) =
(1)b (2)
+ W
(1)
(2) (1)
h
(3) • a (x) =
(3)b + W x• a (x) = b + W x
• h (x) =(2)o(a (1) (x))
=(3) (2) (2) (1)(1)
(1)
a (x) •= h
(2)
• h • b Gradient
(1)
b(3) (x)
+W
(x)b =
(2)
b Checking
g(a
o(a
x (3) (x))
(x)) • h
•
(3)a
(x)
(x)
=
=
o(a
b (1)
(3) +
(x))
W (1)
x
• h
(3)
h (x) •= h(x) =(1)
o(a g(a
(3)
(x)) (x)) (1)
• h (2) (x)
(x) =
= g(a(2) (x))
g(a
(x)) • h (3)
(x) = o(a (3)
(x))
• To debug your implementation
(1) (3) •(2) of
(2) fprop/bprop,
h (x) =(1) g(a (x)) can compare
(2) you
h (2) (1)
• h(x)(x)
with••=ab
=
g(a• W
(3)g(a
(2)
(2) (x))
(x)) b(1)
finite-difference
W W
(1) approximation
x f (x)
(1) b
h (x) = g(a (x)) • h(2)of (x)the
= gradient:
g(a(2) (x))
(3)
h(1) (x) (2) (1)(1) • h(1) (x) = g(a(1) (x))
• b =
b g(a b
(3) (x))
b(2) (2)
• W
• b (3)
W b(1)
@f (x) W(1) x ff(x+✏)(x)•(3)h(1)(2) f (x
(x) = g(a✏)(1)
(x))
b (3) (3)
• Wb • W
(2) (1)
b (2)•
(3) W (1)
@x
(2) x f
(1)
⇡
(x)
• b b2✏ b (1)

• W @f (x) W f (x+✏)
@x ⇡ W f (xx ✏)
2✏
f (x)
• b
(3)
(3)
b (2)
(2) b (1)
(1)
W(3) W (2)
@f (x) @ff(x)W (1)
x
(x+✏) f (x+✏) f (x)
f (x ✏)f (x ✏) • W W W x f (x)
• @x ⇡ @x • ⇡
@f (x) Ø • f (x) x
f (x+✏) f (x ✏)
2✏ ✏be2✏
would the loss •
@f (x)W (3) (2) (1)
f (x+✏) f (x ✏) x f (x)
W W
⇡ • f (x) 2✏ x ✏ • @x ⇡
•@xf (x)Ø x ✏
would be a parameter @f (x)
2✏
f (x+✏) f (x ✏)
f (x) Øx• f✏(x + ✏) would • ⇡
f (x be ✏) the loss you@x
• fif(x) add 2✏
x ✏ to the parameter
f (x + Ø✏) f (x ✏) would be the loss if•you subtract
f (x) x ✏ to the parameter
Debugging on Small Dataset
• Next, make sure your model can overfit on a smaller dataset
(~ 500-1000 examples)
• If not, investigate the following situations:
Ø Are some of the units saturated, even before the first update?
• scale down the initialization of your parameters for these units
• properly normalize the inputs
Ø Is the training error bouncing up and down?
• decrease the learning rate

• This does not mean that you have computed gradients correctly:
Ø You could still overfit with some of the gradients being wrong

The Havoc We Wreak An Enemies to Lovers College Bully Romance (The Four Book 3) (Becca Steele [Steele, Becca]) (Z-Library)
No ratings yet
The Havoc We Wreak An Enemies to Lovers College Bully Romance (The Four Book 3) (Becca Steele [Steele, Becca]) (Z-Library)
164 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
8-10. Backpropagation Algorithm
No ratings yet
8-10. Backpropagation Algorithm
233 pages
5_From Linear Models to Multi-layer Perceptrons
No ratings yet
5_From Linear Models to Multi-layer Perceptrons
45 pages
1) deep_learning
No ratings yet
1) deep_learning
60 pages
VNQ1154 - HVH-TVT Corporate Rebranding Initiative - Quotation - 20231121
No ratings yet
VNQ1154 - HVH-TVT Corporate Rebranding Initiative - Quotation - 20231121
9 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
Module 3_Modified
No ratings yet
Module 3_Modified
106 pages
9 Best Network Diagram Mapping and Topology Software
No ratings yet
9 Best Network Diagram Mapping and Topology Software
32 pages
QFE3320 Low Mid-Band TX Front-End
No ratings yet
QFE3320 Low Mid-Band TX Front-End
110 pages
Unit-2
No ratings yet
Unit-2
35 pages
AI_Lec24-25
No ratings yet
AI_Lec24-25
63 pages
Lecture - 05 (Introduction to ANN)
No ratings yet
Lecture - 05 (Introduction to ANN)
27 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
lecture 9-NN- modified
No ratings yet
lecture 9-NN- modified
94 pages
Games For Virtual Team Building: Jason B. Ellis, Kurt Luther, Katherine Bessiere, Wendy A. Kellogg
No ratings yet
Games For Virtual Team Building: Jason B. Ellis, Kurt Luther, Katherine Bessiere, Wendy A. Kellogg
10 pages
FUSION - FS - Personnel Admin - E - HR1 - 005 - E-Mail Notification Alerts On Events - V1.1
No ratings yet
FUSION - FS - Personnel Admin - E - HR1 - 005 - E-Mail Notification Alerts On Events - V1.1
21 pages
unit 4(22516)
No ratings yet
unit 4(22516)
53 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
NN 02
No ratings yet
NN 02
25 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Test Guidance Document V1.0
No ratings yet
Test Guidance Document V1.0
36 pages
26- netinput activation function forward and back propogation
No ratings yet
26- netinput activation function forward and back propogation
41 pages
Integrated Design Project: Passport Automation System
No ratings yet
Integrated Design Project: Passport Automation System
34 pages
Telefon SAMSUNG Galaxy A16, 128GB, 4GB RAM, Dual SIM, Black
No ratings yet
Telefon SAMSUNG Galaxy A16, 128GB, 4GB RAM, Dual SIM, Black
10 pages
Vol ielts 4
No ratings yet
Vol ielts 4
2 pages
FML Unit5
No ratings yet
FML Unit5
21 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Enhancing PDF Interaction For A More Engaging User Experience in Library: Introducing Chatpdf
No ratings yet
Enhancing PDF Interaction For A More Engaging User Experience in Library: Introducing Chatpdf
7 pages
DL_EXP-2_16010422230
No ratings yet
DL_EXP-2_16010422230
6 pages
ENG 522 Syllabus For Web
No ratings yet
ENG 522 Syllabus For Web
11 pages
Entity Framework Core Cheat Sheet.
100% (2)
Entity Framework Core Cheat Sheet.
3 pages
Software Requirements Engineering Training - Problematic Questions
No ratings yet
Software Requirements Engineering Training - Problematic Questions
9 pages
Deep+Learning+Module-02+Search+Creators
No ratings yet
Deep+Learning+Module-02+Search+Creators
15 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
14 pages
Epicor Mid-Size Customers: Beating The Best-in-Class With Low TCO
No ratings yet
Epicor Mid-Size Customers: Beating The Best-in-Class With Low TCO
8 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Primary Clinical Study of Radiomics For Diagnosing Simple Bone Cyst of The Jaw
No ratings yet
Primary Clinical Study of Radiomics For Diagnosing Simple Bone Cyst of The Jaw
8 pages
UNIT V
No ratings yet
UNIT V
26 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Neural Networks - Annotated
No ratings yet
Neural Networks - Annotated
21 pages
Lecture15 NeuronNetworks
No ratings yet
Lecture15 NeuronNetworks
61 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
MLS+1+-+Presentation
No ratings yet
MLS+1+-+Presentation
11 pages
Neural Networks - Annotated
No ratings yet
Neural Networks - Annotated
21 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Pertemuan 10 Teori Biaya Dan Estimasi
No ratings yet
Pertemuan 10 Teori Biaya Dan Estimasi
23 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
CSS List of Tools and Equipment 2023
No ratings yet
CSS List of Tools and Equipment 2023
2 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Neural Networks: Feedforward Neural Network - Capacity of Single Neuron
No ratings yet
Neural Networks: Feedforward Neural Network - Capacity of Single Neuron
5 pages
Shalini Mainframe Resume
No ratings yet
Shalini Mainframe Resume
1 page
Assignment - 4: Q.1. Write A Python Program To Add and Remove Item(s) From A Set ? A.1. Code
No ratings yet
Assignment - 4: Q.1. Write A Python Program To Add and Remove Item(s) From A Set ? A.1. Code
9 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Twelve Different Interpolation Methods: A Case Study of Surfer 8.0
No ratings yet
Twelve Different Interpolation Methods: A Case Study of Surfer 8.0
8 pages
Kuratowski Notes
No ratings yet
Kuratowski Notes
11 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
NNDL
No ratings yet
NNDL
96 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
ZW3D CAD Tips How To Design A A Popular QQ Doll
No ratings yet
ZW3D CAD Tips How To Design A A Popular QQ Doll
10 pages
Neural Network
100% (1)
Neural Network
54 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
ANN Unit IV Notes
No ratings yet
ANN Unit IV Notes
4 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Types: Fault Current Superconducting Devices
No ratings yet
Types: Fault Current Superconducting Devices
2 pages
Choosing Control Points, Rms (Root Mean Square) Error, and Image Rectification
No ratings yet
Choosing Control Points, Rms (Root Mean Square) Error, and Image Rectification
3 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Netup Streamer 3.0: Quick Start
No ratings yet
Netup Streamer 3.0: Quick Start
34 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Control System Design White Paper
100% (1)
Control System Design White Paper
29 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Arial
No ratings yet
Arial
13 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture NN Part1

Uploaded by

Lecture NN Part1

Uploaded by

10707

• Hugo’s class covers

• We will use his

‣ How to train neural nets:

‣ More recent techniques:

‣ How to train neural nets:

‣ More recent techniques:

(out) (2) (2) >

(out) (2) (2) >

Ø Tends to produce units

(out) (2) (2) >

• Binary classification: • g(a) = reclin(a)

- If activation is greater than • h(x) = g(a(x))

OR (x1 , x2 ) AND (x1 , x2 ) AND (x1 , x2 )

• Can not solve non-linearly separable

• h(k) (x) = g(a(k) (x))

(from Pascal Vincent’s slides)

(from Pascal Vincent’s slides)

- “a single hidden layer neural network with a linear output

• This applies for sigmoid, tanh and many other activation

‣ How to train neural nets:

‣ More recent techniques:

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize

ØLoss function can sometimes be viewed as a surrogate for

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

- Partial derivative: September 13, 2012

1 ⇣ (L+1) (L+1) (L+1)

ra(L+1) (x) log f (x)y

- This is getting complicated!

rh(k) (x) log f (x)y

(out) (2) (2) >

• g 0 (a) = 1 g(a)2 • g(·) b

(out) (2) (2) >

•Tanh Activation Function rGradient >

g 0 (a) = 1a>0 • g(·) b

(out) (2) (2) >

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

rb(k) log f (x)y

• Forward propagation can be implemented

Ø Each box can be an object with an fprop

Ø Calling the fprop method of each box in

- it computes the gradient of the loss with

- fprop depends on the fprop output of a

• By calling bprop in the reverse order, we

• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)

- Only applies to weights, not biases (weigh decay)

- Unlike L2, L1 will push some weights to be exactly 0.

Average predictions over all Solutions for individual datasets

Low bias High bias

• Remember: Generalization is the behavior of the model on

• Decreasing the learning rate: As we get closer to the optimum,

i. start with large learning rate (e.g. 0.1)

Ø RMSProp: instead of cumulative sum, use exponential moving

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.