Derivations For Back Propagation of Multilayer Neural Network
Derivations For Back Propagation of Multilayer Neural Network
Noshaba
April 2, 2018
Consider a multilayer neural network shown in figure 1. There are L layers and ml units in each layer,
where l is the layer number. The following notations will be used for these derivations
[l]
• wij : is the weight of connection between ith unit in layer in layer l-1 and jth unit in layer l 1
1
[0]
• aj : is equal to xj input
[l−1]
mX
[l] [l] [l−1] [l]
zj = wij ai + bj (2)
i=1
[L−1]
mX
[L] [L] [L−1] [L]
zj = wij ai + bj (4)
i=1
[l]
while performing the forward propagation aj will be saved to use in backward prop.
There are many way to find the error here we use this equation, for one instance. This error sums the
square of error of in output
[L]
m
[L]
X
E = 1/2 (ai − yi )2 (5)
i=1
2
Equation 6 can also be written as
∂E [l] [l−1]
[l]
= δj ai (9)
dwij
[l] 2
Where δj is error information
[l]
[l] ∂E ∂aj
δj = [l] [l]
daj dzj
(10)
∂E 0 [l]
= [l]
f (zj )
daj
Pm[L] [L]
∂E ∂1/2 i=1 (ai − yi )2
[L]
= [L]
daj daj (11)
[L]
= (aj − yj )
∂E [L] [L−1]
[L]
= δj .ai (13)
dwij
[L]
Derivative of E wrt bias bj is give by following equation, as the input from this connection is 1
∂E [L]
[L]
= δj (14)
dbj
The 2nd and 3rd partial derivatives can be found from equation (7) and (8). The first one is derived using
[L−1]
chain rule as give in (16). The summation show the contribution of aj is in all the units in layer L. The
first tow partial derivative in (16) are error information of kth unit of layer L.
[L]
m [L] [L]
∂E X ∂E ∂ak ∂zk
[L−1]
= [L] [L] [L−1]
daj k=1 dak dzk daj
[L]
(16)
m
[L] [L]
X
= δk wij
k=1
2 δ [l] = ∂E
j [l]
dzj
3
Plugging values og eq (16), (7) and (8) to eq. (15)
[L]
m
∂E X [L] [L] [L−1]
[L−1]
=[ δk wij ]f 0 (zj )aiL−2
dwij k=1
(17)
[L−1] [L−2]
= δj .ai
∂E [l] [l−1]
[l]
= δj .ai (20)
dwij
∂E [l]
[l]
= δj (21)
dbj
[l] ∂E
∆wij = −α [l]
(22)
dwij
[l] ∂E
∆bj = −α [l]
(23)
dbj
[l] [l] [l]
wij = wij + ∆wij (24)
[l] [l] [l]
bj = bj + ∆bj (25)
4
1.3 Training Algorithm
3
Using Stochastic Gradient Descent to train
3 This algorithm is adapted from Laurene Fausett’s Fundamentals of Neural Networks Section 6.1.2
5
2 Example
Following is the training example on training data given in figure and the network architecture. Using
sigmoid as activation function
1
a = f (z) = (26)
1 + exp(−z)
f 0 (z) = f (z)(1 − f (z)) = a(1 − a) (27)
Step 1:
Initializing all the weights to 1, α = 1
Step 2:
Iteration 1
Training Instance 0
Input x0 [1 1] x0 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 1
Net Z of layer 1
[1] [1]
z1 z2 = 3. 3.
Activation A of layer 1
[1] [1]
a1 a2 = 0.95257413 0.95257413
6
Net Z of layer 2
[2] [2]
z1 z2 = 2.90514825 2.90514825
Activation A of layer 2
[2] [2]
a1 , a2 = 0.94810035 0.94810035
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.0466523 -0.00255378
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00199222 0.00199222
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00199222 -0.00199222 -0.00199222 -0.00199222
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99800778 0.99800778
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00199222 -0.00199222
New bias in layer 1
[1] [1]
b1 , b2 = 0.99800778 0.99800778
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.04443977 0.00243266 -0.04443977 0.00243266
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.95556023 1.00243266 0.95556023 1.00243266
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.0466523 0.00255378
New bias in layer 2
[2] [2]
b1 , b2 = 0.9533477 1.00255378
Training Instance 1
Input x1 [0 0] x1 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 0
Net Z of layer 1
[1] [1]
z1 z2 = 0.99800778 0.99800778
Activation A of layer 1
[1] [1]
a1 a2 = 0.7306667 0.7306667
Net Z of layer 2
[2] [2]
z1 z2 = 2.34973978 2.46744212
Activation A of layer 2
[2] [2]
a1 , a2 = 0.91291354 0.92182764
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.07257882 -0.00563321
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.01253699 0.01253699
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99800778 0.99800778
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.01253699 -0.01253699
New bias in layer 1
[1] [1]
b1 , b2 = 0.98547079 0.98547079
7
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.05303093 0.004116 -0.05303093 0.004116
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.9025293 1.00654866 0.9025293 1.00654866
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.07257882 0.00563321
New bias in layer 2
[2] [2]
b1 , b2 = 0.88076888 1.00818699
Training Instance 2
Input x2 [0 1] x2 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 1
Net Z of layer 1
[1] [1]
z1 z2 = 1.98347856 1.98347856
Activation A of layer 1
[1] [1]
a1 a2 = 0.87905149 0.87905149
Net Z of layer 2
[2] [2]
z1 z2 = 2.46750832 2.7778032
Activation A of layer 2
[2] [2]
a1 , a2 = 0.92183241 0.9414645
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00563255 0.05188326
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00501187 0.00501187
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0.00501187 -0.00501187
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99299591 0.99299591
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00501187 -0.00501187
New bias in layer 1
[1] [1]
b1 , b2 = 0.98045892 0.98045892
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00495131 -0.04560806 0.00495131 -0.04560806
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.9074806 0.96094061 0.9074806 0.96094061
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00563255 -0.05188326
New bias in layer 2
[2] [2]
b1 , b2 = 0.88640143 0.95630373
Training Instance 3
Input x3 [1 0] x3 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 0
Net Z of layer 1
[1] [1]
z1 z2 = 1.9784667 1.9784667
Activation A of layer 1
[1] [1]
a1 a2 = 0.87851762 0.87851762
8
Net Z of layer 2
[2] [2]
z1 z2 = 2.48087682 2.64471024
Activation A of layer 2
[2] [2]
a1 , a2 = 0.92279029 0.93368421
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00550107 0.05781186
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00539616 0.00539616
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00539616 -0.00539616 -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99261161 0.99261161 0.99299591 0.99299591
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00539616 -0.00539616
New bias in layer 1
[1] [1]
b1 , b2 = 0.97506276 0.97506276
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00483278 -0.05078874 0.00483278 -0.05078874
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.91231338 0.91015187 0.91231338 0.91015187
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00550107 -0.05781186
New bias in layer 2
[2] [2]
b1 , b2 = 0.8919025 0.89849187
Iteration 2
Training Instance 0
Input x0 [1 1] x0 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 1
Net Z of layer 1
[1] [1]
z1 z2 = 2.96067028 2.96067028
Activation A of layer 1
[1] [1]
a1 a2 = 0.95076538 0.95076538
Net Z of layer 2
[2] [2]
z1 z2 = 2.62669446 2.62917364
Activation A of layer 2
[2] [2]
a1 , a2 = 0.93255995 0.93271571
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.05865045 -0.00422257
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00232482 0.00232482
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00232482 -0.00232482 -0.00232482 -0.00232482
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.99067109 0.99067109
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00232482 -0.00232482
New bias in layer 1
9
[1] [1]
b1 , b2 = 0.97273794 0.97273794
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.05576282 0.00401467 -0.05576282 0.00401467
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.85655056 0.91416654 0.85655056 0.91416654
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.05865045 0.00422257
New bias in layer 2
[2] [2]
b1 , b2 = 0.83325204 0.90271444
Training Instance 1
Input x1 [0 0] x1 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 0
Net Z of layer 1
[1] [1]
z1 z2 = 0.97273794 0.97273794
Activation A of layer 1
[1] [1]
a1 a2 = 0.72566489 0.72566489
Net Z of layer 2
[2] [2]
z1 z2 = 2.07638938 2.22947156
Activation A of layer 2
[2] [2]
a1 , a2 = 0.88858708 0.90286502
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.08797019 -0.00851872
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.01345021 0.01345021
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.99067109 0.99067109
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.01345021 -0.01345021
New bias in layer 1
[1] [1]
b1 , b2 = 0.95928773 0.95928773
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.06383688 0.00618173 -0.06383688 0.00618173
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.79271368 0.92034827 0.79271368 0.92034827
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.08797019 0.00851872
New bias in layer 2
[2] [2]
b1 , b2 = 0.74528185 0.91123315
Training Instance 2
Input x2 [0 1] x2 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 1
Net Z of layer 1
[1] [1]
z1 z2 = 1.94995882 1.94995882
Activation A of layer 1
10
[1] [1]
a1 a2 = 0.87544215 0.87544215
Net Z of layer 2
[2] [2]
z1 z2 = 2.1332318 2.52265649
Activation A of layer 2
[2] [2]
a1 , a2 = 0.89409142 0.92571494
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.01002869 0.06365844
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00552174 0.00552174
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0.00552174 -0.00552174
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.98514935 0.98514935
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00552174 -0.00552174
New bias in layer 1
[1] [1]
b1 , b2 = 0.95376599 0.95376599
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00877954 -0.05572929 0.00877954 -0.05572929
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.80149322 0.86461899 0.80149322 0.86461899
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.01002869 -0.06365844
New bias in layer 2
[2] [2]
b1 , b2 = 0.75531054 0.84757471
Training Instance 3
Input x3 [1 0] x3 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 0
Net Z of layer 1
[1] [1]
z1 z2 = 1.94405279 1.94405279
Activation A of layer 1
[1] [1]
a1 a2 = 0.87479671 0.87479671
Net Z of layer 2
[2] [2]
z1 z2 = 2.15759781 2.36030639
Activation A of layer 2
[2] [2]
a1 , a2 = 0.89637663 0.91374996
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00962512 0.07201352
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.0059747 0.0059747
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.0059747 -0.0059747 -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.98431209 0.98431209 0.98514935 0.98514935
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.0059747 -0.0059747
New bias in layer 1
11
[1] [1]
b1 , b2 = 0.9477913 0.9477913
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00842002 -0.06299719 0.00842002 -0.06299719
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.80991324 0.80162179 0.80991324 0.80162179
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00962512 -0.07201352
New bias in layer 2
[2] [2]
b1 , b2 = 0.76493566 0.77556118
3 Notes
Activation function
Using tanH, or Relu activation function in hidden layers will give better results. The equations are give as
follow:
ez − e−z
tanh(z) = (28)
ez + e−z
tanh0 (z) = 1 − (tanh(z))2 (29)
Loss/Cost Function
Using Cross Entropy loss function instead on Error function will also work been in finding optimal values.
The derivations will be made according to the Loss/Cost function
Initialization
Initialize the weights and bias randomly
Gradient Descent
This report uses Stochastic Gradient Descent, If you are using Gradient Descent instead the , for loop in
line 3 of algorithm will be omitted. and Cost function will be used (Cost is sum of loss over all training
examples)
Mini Batch
In between gradient descent over all training data and stochastic gradient descent, there is a Mini Batch
version, where training data will be divided in chuck and updates will be made after traversing one all data
in one mini batch. The Cost will be sum or average of Loss over all training example in one mini batch. The
derivations will be made according to the Loss/Cost function
Vectorization The operations can be vectorized by representing W l , bl , z l , al , δ l , ∆W l , x and y in form
of vectors/matrix.
For example if W l is the weight matrix, it will connect layer l-1 and l. Its dimensions are (ml−1 ,ml ). If bl is
the vector of bias for layer l, its dimensions will be (ml ,1). If al−1 are activations of layer l − 1. The net of
layer l can be written in vector form as follow.
zl = W l . al−1 + bl
12
4 Exercise
1. Train a 2 layer neural network for the following data. Using 2 units in hidden layer. The input and
output units will be according to the dimensions of data. Perform 2 iteration. Use alpha=1, and
Sigmoid function as activation in all layers
x1 x2 y
1 -1 0
-1 1 0
1 1 1
-1 -1 0
2. Train the same neural network with relu as activation function for hidden layer and sigmoid for output
layer.
3. Use perceptron rule to train a single perceptron for training data given in following table. Perform
only one iteration i.e. Go once through each training instance and shown the updating of weights. Use
alpha= 1 and step function as activation function.
After you have completed one iteration draw the decision boundary on the in X1 and X2 plan, is there
a need for more iterations if goal is to achieve 100% accuracy on training data? 4
x1 x2 y
0 0 1
0.5 0 1
1 0 1
0.5 1 -1
1 1 -1
0 1 1
4. For data given in question 3, train a 2 layer neural network. Using 2 units in hidden layer. The input
and output units will be according to the dimensions of data. Use activation functions appropriate for
this problem.
5. Derive the equation of gradient descent, if we use Cost of complete set before updating the weights.
The Cost function will be
[L]
N m
j[L]
X X
E = 1/2 (ai − yij )2 (32)
j=1 i=1
Where N is the number of training instances, and yij is ith component of desired output of iith instances,
j[L]
and ai is the ith component of neural net’s output of j th instances
Hint: Use C instead of E
6. Using the dataset given in example (page 6), train a neural network with 2 hidden layers. Perform
one iteration. Using 2 units per hidden layer. The input and output units will be according to the
dimensions of data. Use activation function of your choice.
7. The the vectorized form of net is given in section Notes, find the vectorized equation of al , δ l and
∆W l . Please mention the dimensions of each vector/matrix you use.
8. Given the following network, find the following error information f units marked with ?, when input
x= [1.5 1 2] and desired output is y= -1 . Use tanH as activation function of all units.
4 Perceptron rule was covered in class, please refer to the slides
13
9. For the same network in question 8 find the error required change in weights if input x= [-1.5 -1 -2]
and desired output is y= 1. Use tanH as activation function of all units, and α=0.1.
10. Design a feed forward full connected neural network to classify 28x28 image, into class cat of dog.
What is the dimension of input layer, and output layer? Which activation functions you will use in
output layer and in hidden layer? Assume that converting your image to a 100 dim space and them
to 50 dim and them to 10 Dim will give good performance, how many hidden layers you will use and
what number of units you will use in each layer?
11. Given a task of classifying 28x28 gray scale image of handwritten digits [0 to 9] using feed forward full
connected neural network, what will be the dimension of input layer, and output layer? How will you
convert the output to to class value?
14