Lecture NN Part1
Lecture NN Part1
Deep Learning
Russ Salakhutdinov
Machine Learning Department
rsalakhu@cs.cmu.edu
http://www.cs.cmu.edu/~rsalakhu/10707/
Neural Networks I
Neural Networks Online Course
• Disclaimer: Much of the material and slides for this lecture were
borrowed from Hugo Larochelle’s class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/
•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x
• x1 xd b w 1 w d
Activation Function
• w
• Linear activation function:
• {
• g(a) = a
Ø No nonlinear transformation
1
Ø No input squashing • sigm(a) = 1+exp( a)
exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =
• g(·) b
• h(x) = g(a(x))
⇣
• a(x) = b(1) + W(1) x a(
Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))
⇣
(1) (1) (1)
• a(x) = b +W x a(x)i = bi
Activation
• { Function
• g(a) =a
• g(a)
• Hyperbolic tangent = a activation
(‘‘tanh’’) • g(a) = function:
sigm(a) = 1
1+exp( a)
1
• g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) e
• g(a) = tanh(a) exp(a)+exp( a) = e
Ø Squashes the neuron’s
activation between exp(a) exp( a) exp(2a) 1
• g(a) -1= tanh(a)
• =
g(·) b
exp(a)+exp( a) = exp(2a)+1
and 1
• g(·) b • h(x) = g(a(x))
Ø Can be positive or
negative ⇣
• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)
Ø Bounded ⇣ ⌘
(1) (1) (1) P (1)
Ø
• a(x)
Strictly increasing
= b + W x a(x) i = b i (2) >j W i,j xj
(out) (2)
• o(x) = g (b + w x)
(wrong plot) (out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Activation Function exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
Ø Bounded below by 0 • g(a) = reclin(a) = max(0, a)
(always non-negative)
, x2 )
, x2 )
0 0 0
0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
1 , x2 )
1 1
)
1 1 1
, x2 )
, x2 )
, x2 )
0 Capacity of
0 a Single Neuron
0
AND (x1 , x2 )
1 1
, x2 )
0
?
0
0 1 0 1
(x1 AND (x1 , x2 )
Figure
• Need1.8to– Exemple
transform de modélisation de XOR
the input into par unrepresentation.
a better réseau à une couche cachée. En
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x2 )
Remember
et• AND basis
(x1 , x2 ). En bas, functions!
on présente l’illustration de la fonction XOR(x1 , x2 ) en fonc-
tion des valeurs de x1 et x2 (à gauche), puis de AND (x1 , x2 ) et AND (x1 , x2 ) (à droite).
• g(·) b exp(a)+exp( a) exp(2a)+1
= max(0, a)
• g(a)(1) = max(0, a)
• g(a) = max(0, a)
Single i Hidden Layer Neural Net
(1) (2) (2)
= reclin(a)• =W max(0,
i,j a)b x j h(x) i w i b
b • g(a) = reclin(a) = max(0, a)
• • •g(a)
Hiddenh(x) ==reclin(a)
layer g(a(x)) = max(0, a)
pre-activation:
(1) (2)
bi • xj p(y
h(x)=i w1|x)
i b (2)
⇣ P ⌘
• p(y
= g(a(x))• a(x) = b
= 1|x) (1) (1)
+ W x a(x)i = bi + j Wi,j xj
(1) (1)
• g(·)
(1) • g(·)
b
⇣ P ⌘
= b + W x ba(x)i = b(2)
(1) (1) (1)
i + Wi,j>xj
j (2)
• f(1)
(x) =(1)
o(b +w x)(2) (2)
• W
= o(b(2)•+ w
> b
(1)x) i(1)
i,j
(2) x j h(x) i w i(2) b (2)
W
• Hidden b x h(x)
i,jlayer iactivation:
j i w i b
• h(x) = g(a(x))
• h(x) = g(a(x)) ⇣ ⌘
(1) (1) ⇣ (1) P (1) ⌘
• a(x)
• Output
= b + W x a(x) i = b i(1)+ P W x
(1) j
• a(x)layer = bactivation:
i,j
+ W(1) x a(x)i = bi1 + jj Wi,j xj
(1) 1
⇣ >
⌘
• f (x) = o ⇣b(2) + w (2)
>h ⌘x
(1)
• f (x) = o b(2) + w(2) x
Output activation
function
(1) (1) (2) (2)
• Wi,j bi xj h(x)i wi b• a(x) = b(1) + W(1
Softmax Activation Function
• h(x) = g(a(x)) ⇣
‣ ⇣
Remember multi-way classification: • f(1) P o b(1) +⌘w
(x) = (2)
(1) (1)
• a(x) = b + W x a(x) = b + j Wi,j xj
- We need multiple outputs (1 output per iclass) i
⇣
- We need to estimate ⌘ • p(y = c|x)
conditional probability:
>
- •Discriminative b(2) + w(2) x
f (x) = o Learning
• o(a) = softmax(a)
‣
• p(y = c|x)
Softmax activation function at the output
h i>
• o(a) = softmax(a) = Pexp(a1 ) . . . Pexp(a C)
c exp(ac ) c exp(ac )
- strictly positive
- sums to one
1
‣ Predict class with the highest estimated class conditional
probability.
• p(y = c|x)
h i>
•• o(a)
p(y == c|x) Multilayer
softmax(a) = Neural
Pexp(a1 )
...
c exp(ac )
Net
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c
•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)
0 1
-1
0
-1
0
-1
1
x1
zk
Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1
0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j -1
0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1
Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
R
Capacity of Neural Nets
• La puissance expressive des réseaux de neurones
• Consider a single layer neural network
z1 x2
x1
y2 z1 y3
y1
y1 y2 y3 y4
y4
x1 x2
• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks
1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer
• ⌦(✓)
• Let
• rus✓ l(f
start
(xby(t)considering
; ✓), y (t) ) a classification problem with a
softmax output layer. • r✓ ⌦(✓)
• ⌦(✓)
• We need to estimate: • f (x)c = p(y = c|x)
• -rWe✓ ⌦(✓)
can maximize the
(t)log-probability
(t) of the correct class given
•
(t)
x y(t)
an input: log p(y
• f (x)c = p(y = c|x) = c|x ) P
• l(f (x), y) = c 1(y=c) log f (x)c = log f
(t) (t)
• x y
• Alternatively, we can•minimize the negative log-likelihood:
P
• l(f (x), y) = c 1(y=c) log f (x)c = log f (x)y = @
• f (x
• As seen before, this is also known as a cross-entropy entropy
function for multi-class classification problem. @
log f (x)y
f (x)c
• ⌦(✓) • ✓ ✓+
• Abstract
P
1X Hugo
Stochastic Gradient Descend
1
• = T Mathr l(f (x (t)
• ; ✓),
{x y2 R) | rrxneural
(t) d
t for my slides “Feedforward
✓ f⌦(✓)
✓ (x) (t)
= 0}
network”.(t)
arg min l(f (x ; ✓),Départemen
y ) + ⌦(
• ✓ ✓+ • ✓v> rT
2
f (x)v > 0 8v h
• f (x)
• Perform updates after seeing each
t x
example: Université
• {x 2 R(t) d
| rx f(1)
(x) =
(t) (1)0} • v > 2
rx(L+1)
f (x)v < 0 8v hugo.larochell
- Initialize
• l(f•:(x✓ ⌘;{W ✓), y , b) , . . . , W ,b (L+1)
}
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
Septem
x ✓
(t) (t)
• ⌦(✓)• l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r l(f
1
P
(x (t)
; ✓), y (t) Math for my
)(t) ; ✓), y (t) slidesepoch
“Feedforw
• = T tr
✓ l(f (x ) r Training
⌦(✓)
• = r✓ l(f (x(t) ✓; ✓), y (t) ) r✓ ⌦(✓) ✓
=
• ⌦(✓) Math for my slides “Feedforward neu
A
• f (x) Iteration of all examples
Math for my slides “Feedforwa
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2
• To train a neuralR d
| rxwe
net, f (x) = 0} • f (x)
need: 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x
✓ (t) • r; ✓),l(fy(x(t)
(t)
) ; ✓), y (t) )
2> • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v (t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
• (x
⌦(✓)(t)
; ✓), y (t)
) (t)
✓ (t)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f f• (x)v
r
x (x), y) <
l(f
✓ =
0
(x 8v
; ✓), (t)
1(y=c))log f (x)c = log f (x)y
y =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
c
•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)
Softmax
• h(k) (x) = g(a (k) activation
(x))
function
Feedforward neural network
Gradient Computation
Hugo Larochelle
Département d’informatique
Université de Sherbrooke
x)c =• Losslog fgradient
(x)y y at output hugo.larochelle@usherbrooke.ca
log f (x)c = log f (x)
g(x) h(x)
@
log f (x)y = h(x) + g(x)
@a(L+1) (x)c x x x
1 @ g(x)
= f (x)y h(x) g(x) 1 g(x) h(x)
f (x)y @a(L+1) (x)c =
1 @ x x h(x) h(x)2 x
= (L+1)
softmax(a(L+1) (x))y
f (x)y @a (x)c IFT615% Hugo%Larochelle% 33%
1 @ exp(a(L+1) (x)y )
= P
f (x)y @a(L+1) (x)c c0 exp(a(L+1) (x)c0 )
0 ⇣ P ⌘1
@ (L+1) (L+1) @ (L+1)
1 @ @a(L+1) (x)c exp(a (x)y ) exp(a (x)y ) @a(L+1) (x)c c0 exp(a (x)c0 )
= P P A
f (x)y (L+1) 2
c0 exp(a (x)c0 )
c0 exp(a
(L+1) (x) 0 )
c
!
(L+1) (L+1) (L+1)
1 1 (y=c) exp(a (x) y ) exp(a (x)y ) exp(a (x)c )
= P (L+1) (x) 0 )
P (L+1) (x) 0 )
P (L+1) (x) 0 )
f (x)y c0 exp(a c c0 exp(a c c0 exp(a c
@p(a)==
X Gradient Computation
1 ⇣
@p(a)
1(y=c)@q i (a)
softmax(a (L+1)
(x))y softmax(a(L+1) (x))y softma
@a f (x)y@q (a) @a
i1 i
• Loss = gradient at1(y=c)
hidden
f (x)ylayers
f (x)y f (x)c
f (x)y
x)j
- Partial
= derivative:
1(y=c) f (x)c
@
(k)
log f (x)y
@h (x)j
X @ log f (x)y @a(k+1) (x)i ra(L+1) (x) log f (x)y
= (k+1) (x) (k) (x)
i
@a i @h j = (e(y) f (x))
X @ log f (x)y (k+1)
=
• p(a) qi (a) a (k+1) W
i
@a k (x) i
i,j
•
@p(a) X @p(a) @qi (a)
=
rh(k) (x) log f (x)y @a @qi (a) @a
i
Remember:
(k+1) >
=(k) W (k) rP (k) (klog
a(k+1) (x) 1)
f (x)y
• a (x)i = bi + j Wi,j h (x)j
h= (x)j (y=c)
P P y
P
(L+1) exp(a(L+1) (x)
X @f (x)log
y 0 exp(a
f (x)y @a
c (k+1) (x)c0 )
(x)i c0 c0 ) c0 exp(
=
=i
1 Gradient
@a(k+1) (x) @h(k)Computation
⇣
(x)j (x))
1 i softmax(a
(y=c) softmax(a (L+1)
y
(L+1)
(x))y softma
f (x)y
X @ log f (x)y
= 1 W
(k+1)
• Loss =
gradient at1(x)
@a(k+1) hidden
(y=c) layers
f (x)i,j
y f (x)y f (x)c
i f (x)y i
- Gradient
= 1(y=c) f (x)c
Remember:
ra(k) log f (x)y
(x)
• h (x)j = g(a(k) (x)
(k)
> j
)
rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
log f (x)y
a(k) (x) j
Gradient
=
Computation
@ log f (x) y @h
(k)
(x)j
@h(k) (x)j @a(k) (x)j
• Loss gradient at @hidden
log layers
f (x)y 0 (k)
=
(pre-activation) (k)
g (a (x)j )
@h (x)j Let’s look at the gradients
- Gradient: of activation functions.
ra(k) (x) log f (x)y
>
= rh(k) (x) log f (x)y ra(k) (x) h(k) (x)
= rh(k) (x) log f (x)y [. . . , g 0 (a(k) (x)j ), . . . ]
Gradient of the
activation function
@
(k)
log f (x)y
Remember: Wi,j
• h(k) (x)j = g(a(k) (x)j )
@ log f (x)y @a(k) (x)i
• x1 xd b w 1 w d
• rW(k) log f (x)y (= ra(k) (x) log f (x)y h(k 1)
(x
Linear Activation Function
• w Gradient
• rb(k) log f (x)y (= ra(k) (x) log f (x)y
• Linear activation function: • { (k) >
• rh(k 1) (x) log f (x)y (= W ra(k) (x) log f (x)y
• g(a) = a
• raderivative
- Partial (k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y r
1
• sigm(a) = 1+exp( a)
• g 0 (a) = 1
exp(a) exp( a)
• tanh(a) = exp(a)+exp( a) =
3
• g(·) b
• h(x) = g(a(x))
⇣
• a(x) = b(1) + W(1) x a(
(k) >
••Sigmoid
rh(k 1)activation
(x) (x)y (= W
log ffunction: ra(k) (x) log f (x)y
• g(a) = a
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f1 (x)y r a(
• g(a) = sigm(a) = 1+exp( a)
0
•- gPartial
(a) =derivative
a
exp(a) exp( a) e
• g(a) = tanh(a) = =
• g 0 (a) = g(a)(1 g(a)) exp(a)+exp( a) e
• h(x) = g(a(x))
⇣
(1) (1) (1)
• a(x) = b +W x 3i =
a(x) bi
• r
• Hyperbolic • g(a)
log f= a
(x) •
tangent (‘‘tanh’’)
y (=
g(a) = rsigm(a)
activation
1
(x) 1+exp(f (x)
=
h function: log r
a (k 1) (x) (k 1)
a) y
1
• g 0 (a) = a • g(a) = sigm(a) = 1+exp( a) = exp(a) exp( a) = e
• g(a) = tanh(a) exp(a)+exp( a) e
- g 0 (a)derivative
• Partial = g(a)(1 g(a))
• g(a) = tanh(a) = exp(a) exp( a)
= exp(2a) 1
• g(·) exp(a)+exp(
b a) exp(2a)+1
• g 0 (a) = 1 g(a)2
• g(·) b • h(x) = g(a(x))
⇣
• h(x) = g(a(x))
• a(x) = b(1) + W(1) x a(x)i = bi
(1)
⇣
(1) P 3
(1)
⌘
• a(x) = b(1) + W(1) x a(x)i = bi W i,j xj
(out) (2) (2) >j
• o(x) = g (b + w x)
(out) (2) (2) >
• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Tanh Activation Function Gradient
exp(a) exp( a)
• g(a) = tanh(a) = exp(a)+exp( a) =
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
- Partial derivative
• g(a) = reclin(a) = max(0, a)
- Gradient
= 1(weights):
(y=c) f (x)c
rW(k) log f (x)y
= ra(k) (x) log f (x)y h(k 1)
(x)>
ra(L+1) (x) log f (x)y
= (e(y) f (x))
• p(a) qi (a) a@ k
(k)
log f (x)y
• bi
(k) @p(a) X @p(a) @qi (a)
@ log f (x)y @a (x)i =
= @a @qi (a) @a
@a(k) (x)i @b
(k) i
Remember: P
i
(k) (k) (k 1)
• a(k) (x)i = b@i +log jfW(x)
i,j yh (x)j
=
@a(k) (x)
(y=c)
iP y
= P P
f (x)y c0 exp(a(L+1) (x) c0 ) c0 exp(a(L+1) (x) c0 ) c0 exp(
rW(k)
=
1 ⇣
log f (x) Gradient Computation
1(y=c)
y softmax(a
(L+1)
(x))y
softmax(a (L+1)
(x))y softma
f (x)y
= ra(k) (x) 1log f (x)y h(k 1) (x)>
• Loss =
gradient of1(y=c)
parameters
f (x)y f (x)y f (x)c
f (x)y
- Partial
= derivative
1(y=c) f(biases):
(x)c
@
(k)
log f (x)y
@bi
(k) ra(L+1) (x) log f (x)y
@ log f (x)y @a (x)i
= = (e(y) f (x))
@a(k) (x)i (k)
@bi
• p(a) qi (a)@ a klog f (x)y
=
• @a(k) (x)i
@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
rb(k) log f (x)y
Remember: P (k) (k)
• a(k) (x)i== bira+ h(kf (x)
j Wi,jlog
(k) (x)
1)
(x)
yj
(y=c) y
@=
log f P
(x) @a (k)
(x)
(L+1)
P (L+1)
P
f (x)y y
c0 exp(a
i c0 )
(x) c0 exp(a (x)c0 ) c0 exp(
=
=
@a (k) ⇣
1 (x) i Gradient Computation
@b
(k)
1(y=c) softmax(a
i
(L+1)
(x))y softmax(a(L+1) (x))y softma
f (x)y
@ log 1 f (x)y
•=
Loss =
gradient
(k) of1(y=c)
parameters
f (x)y f (x)y f (x)c
@a
f (x)y(x)i
- Gradient
= 1(biases):
(y=c) f (x)c
• p(a) qi (a) a k
•
@p(a) X @p(a) @qi (a)
=
@a i
@qi (a) @a
Remember: P
(k) (k) (k) (k 1)
• a (x)i = bi + j Wi,j h (x)j
y fh(x)y@b
bi if (x)log
= ra(k) (x) log i (x)
i
(k) (k)
@b@i @ loglog f
log
f (x)
(x)
f (x)
y y @a
@a (k)(x)i
(x)i
= == y
•
Backpropagation Algorithm =
@
@@a@a
@a
(k)
@a
(k)
log(x) f(x)
(k)
(k) (x)
(x)
i iy @a @b
(x) i
(k)
@b
(k)
(x)
i(k)i
(k)i
@
log
@ log
f (x)
log f (x)
y f (x)yy @bi
i
bi== @r @a
(k)
(k)f(x)
log
log (k)(x) iyif (x)y
• Perform forward propagation @= log f
b @a
(k)
(x) @a
(x)
(k)
(x)if (x)
= r@a (k)
y (x)i log
= a (k) (x) y
• Compute output gradient (before activation):@a(k) (x)i @bi
(k)
rrb(k) log
logff(x)(x)yy
@ log f (x) by(k)
• ra(L+1) (x) log f (x)y (= (e(y) =f (x))
@a=(k)
rr
= (x) r
ba(k)
(k)
(k)(x)
log flog (x)ffy(x)
log (x)y y
ia (x)
• For
• rk=L+1
W(k) to
log1f (x)y (= ra(k) (x) log f (x)= y r
h(k) (x)
a(k) (x)
> log f (x)y
• •rar(L+1) loglogf (x) y y (=
fw.r.t.
(x) (= (e(y)
(e(y) f f(x))
(x))
- Compute gradients
(x) the hidden
• rb(k) log f (x)y (= ra(k) (x) log f (x)yrb(k) log f (x)y
a(L+1) (x) layer parameters:
• ra(L+1) (x) log f (x)y (= (e(y) =f (x)) ra(k) (x) (k log f (x)
• •rW (k) 1) >y>
rW (k)(k) log logf (x)
f (x) y y (=(= rr (k)
a a (x)
(k) log
log f f(x)
(x) yy hh (x)(x)
(k) > (x)
• r
•• hr
(k 1) log f
(x) log f (x)y (= (x) y (= W
r rloga(k)f(x) (x) loghf(k)(x)
(x)y>
W (k) a (k) (x) y
• •rrbr b(k) log
(L+1) (x)
a(k) flog
log (x) f (x)
f (x)y y y(= (= rr
(= (e(y)
a(k)
a(k) (x) log
(x)
f (x))
logff(x) (x)yy
(k)
• r
•• ar 1) (x) loglog
(k (k)
b (k) f f
(x) (x)
y (=
y (= r a (k)r h
(x)
(k 1)log(x) f (x) logy
f
(k
(x)
1) y >
r a (k 1) (x) h (x)
r
-•Compute log f (x)
gradients (=
y w.r.t. the r a hidden log
> f (x) h (x)
W(k)layer rrabelow:
W (k) (x) (k) > y
•rhr(kh(k1) (x)
1) (x) log logf (x) y y (=
f (x) (= W (k)
a (k) (x) log
(x) logff(x)
(x)yy
>
•• rrb(k)
h(k 1) (x)
log f (x)logy f(=(x)y r(=a(k) (x) Wlog f (x)
(k)
rya(k) (x) log f (x)y
01) (kh(k)
1) (
• •rar(ka(k1) (x)1) (x) log log f (x)
f (x) y y (=(= r rh h
(k(k 1)1) (x)
(x) log
log ff (x)
(x) y y r
[. . .
a ,
(kg (a
(x) (
• rrh(k y y(=
(k) > 3
- •Compute 1)
a(k 1)(x) (x)
log
logf (x)
gradients w.r.t.
f (x) theWhidden
(= rhr layer
(k a 1) (x) below
(k) (x) log
logff(x)(before
(x)y y ra(k 1) (x) h(k) (x
activation):
• ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y [. . 3. , g 0 (a(k 1)
(x)j ), . . . ]
3
Computational Flow Graph
• Forward propagation can be represented
as an acyclic flow graph
Lasso Quadratic
• g00 (a) = g(a)(1 g(a))
2
• g (a) =P 1 P g(a)P ⇣ ⌘2 P
L1 Regularization
0 (k) (k) 2
• •⌦(✓)
g (a) == 1 g(a) 2
Pk Pi Pj ⇣ i,j
W =
⌘2 Pk ||W ||F
(k) (k) 2
• ⌦(✓) = P kP P i
⇣ W
j (k)i,j
⌘ 2 P= ||W
k (k) 2 ||F
• •r⌦(✓)
• L1 regularization:
W (k)
=
⌦(✓) k = i2W Wi,j
j (k) = k ||W ||F
(k)
• rW(k) ⌦(✓) P = =
P2W 2W
P(k) (k)
• rW(k) ⌦(✓)
• ⌦(✓) = Pk Pi Pj |Wi,j |
P P P (k)
•• ⌦(✓)
⌦(✓) = = kk i i j |W j |W
(k)
|
i,j |
i,j(k)
• rW(k) ⌦(✓) = sign(W )
• Gradient:
•• r ⌦(✓) = sign(W (k)(k) )
rW W(k)
(k) ⌦(✓) = sign(W )
• sign(W(k) (k)
)i,j = 1 0
•• sign(W
sign(W (k)))i,ji,j==1W 1 i,j
(k)
0>0 1Wi,j (k)
<0
• Trade-off between bias and variance: With very flexible models (high
complexity) we have low bias and high variance; With relatively rigid models
(low complexity) we have high bias and low variance.
• The model with the optimal predictive capabilities has to balance between bias
and variance.
Bias-Variance Trade-off
• Consider the sinusoidal dataset. We generate 100 datasets, each containing
N=25 points, drawn independently from
High variance Low variance
r Ø All hidden
(k) ⌦(✓) = 2W
(k) units in
(k) (k)a layer will always behave the same
W ⌦(✓) = sign(W (k) )
• sign(W
W(k) • sign(W )i,j =)1i,j = 1W(k)
Ø Need to break symmetry
1 >0 (k)
Wi,j >0
(k)
i,j Wi,j <0
1W(k) <0
i,j
P P P (k)
⌦(✓) =(k) k i j |W | p p
n(W )
- Sample= 1
(k) i,j(k)
i,j • WW
(k)
• W from
U>0 1
Ub][ W
[ b, (k)p
bb,= b =6 p
b] ,<0
where 6
H k Hk
i,ji,j i,j i,j Hk +Hk H
1 k +Hk 1
rW(k) ⌦(✓) = sign(W(k) ) p
k)
p 6 Sample around 0 and
U
gn(W
j [
(k) b, b] b =
)i,j = 1W(k) >0 Hk1+H (k)
Hk break symmetry
i,j W k <0
i,j 1
p
(k) 6
Wi,j U [ b, b] b = p Size
Hkof h(k) (x)
Hk +Hk 1
• T ⌃= T t (x b
µ)(x b
µ)
Model Selection
• Supervised
• T 1b Machine
learning
⌃= 1 lea
Pex
t( T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x
=
- Train your model on •theTraining
Training set:
Set •Dtrain = {(x
f (x; ✓)
(t) (t)
, y )}
• Supervised learn
- For model selection,•use
f (x; ✓)
Validation • Dvalid Dtest
Set
• Training set: Dtr
ØHyper-parameter search: hidden layer size, learning rate,
number of iterations/epochs, etc.
• f (x; ✓)
valid
• D
- Estimate generalization performance using the Test Set Dtest
•
Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 ↵t < 1 ↵t
Ø can leverage matrix/matrix operations, which are more efficient
↵
• ↵t = 1+ t
• Momentum: = t↵use0.5
• ↵tCan an exponential
< 1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Adapting Learning Rates
• Updates with adaptive learning rates (“one learning rate per
parameter”)
Ø Adagrad: learning rates are scaled by the square root of the
cumulative sum of squared gradients
⇣ ⌘2 (t) r✓ l(f (x(t) ), y (t) )
(t) (t 1)
= + r✓ l(f (x(t) ), y (t) ) r✓ = p
(t) + ✏
• W @f (x) W f (x+✏)
@x ⇡ W f (xx ✏)
2✏
f (x)
• b
(3)
(3)
b (2)
(2) b (1)
(1)
W(3) W (2)
@f (x) @ff(x)W (1)
x
(x+✏) f (x+✏) f (x)
f (x ✏)f (x ✏) • W W W x f (x)
• @x ⇡ @x • ⇡
@f (x) Ø • f (x) x
f (x+✏) f (x ✏)
2✏ ✏be2✏
would the loss •
@f (x)W (3) (2) (1)
f (x+✏) f (x ✏) x f (x)
W W
⇡ • f (x) 2✏ x ✏ • @x ⇡
•@xf (x)Ø x ✏
would be a parameter @f (x)
2✏
f (x+✏) f (x ✏)
f (x) Øx• f✏(x + ✏) would • ⇡
f (x be ✏) the loss you@x
• fif(x) add 2✏
x ✏ to the parameter
f (x + Ø✏) f (x ✏) would be the loss if•you subtract
f (x) x ✏ to the parameter
Debugging on Small Dataset
• Next, make sure your model can overfit on a smaller dataset
(~ 500-1000 examples)
• If not, investigate the following situations:
Ø Are some of the units saturated, even before the first update?
• scale down the initialization of your parameters for these units
• properly normalize the inputs
Ø Is the training error bouncing up and down?
• decrease the learning rate
• This does not mean that you have computed gradients correctly:
Ø You could still overfit with some of the gradients being wrong