0% found this document useful (0 votes)
8 views14 pages

Derivations For Back Propagation of Multilayer Neural Network

this is a tech document

Uploaded by

Eisha Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Derivations For Back Propagation of Multilayer Neural Network

this is a tech document

Uploaded by

Eisha Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Derivations for Back Propagation of Multilayer Neural Network

Noshaba
April 2, 2018

1 Back Propagation of Multilayer Neural Network


This documents give derivations of training equations of multi layered feed forward neural network using
back propagation algorithms.

Consider a multilayer neural network shown in figure 1. There are L layers and ml units in each layer,
where l is the layer number. The following notations will be used for these derivations
[l]
• wij : is the weight of connection between ith unit in layer in layer l-1 and jth unit in layer l 1

• xi : feature vector of ith instance.


• xij : feature j if ith instance.

• yji : j th components desired output of ith instance


0
• yji : j th components neural net’s output of ith input instance, also equal to j th activation of last layer
[L]
aj
[l]
• bj : is bias (w0 ) of unit j in layer l
[l]
• zj : is the net of of unit j in layer l
[l]
• aj : is the activation of unit j at layer l
1 Notation [l] adapted from deeplearning Course on Coursera

1
[0]
• aj : is equal to xj input

• m[l] : number of units in layers l


• ∆ : is change
[l]
• δj : is error info of unit j at layer l
• E : error function
• α: learning rate

1.1 Forward Propagation Equations


Forward propagation process starts from layer 1, each layer takes input from previous layers, finds a acti-
vation, until the final layer. The activation of final layer are the output. Note that each layer might have
different activation function, however each unit in one layer will have same activation function. The equation
for activation of each layer is given by eq. 1.
[l] [l]
aj = f [l] (zj ) (1)

[l−1]
mX
[l] [l] [l−1] [l]
zj = wij ai + bj (2)
i=1

The activations of last layer (layer L )can be written as follow


[L] [L]
aj = f [L] (zj ) (3)

[L−1]
mX
[L] [L] [L−1] [L]
zj = wij ai + bj (4)
i=1
[l]
while performing the forward propagation aj will be saved to use in backward prop.
There are many way to find the error here we use this equation, for one instance. This error sums the
square of error of in output
[L]
m
[L]
X
E = 1/2 (ai − yi )2 (5)
i=1

1.2 Backward Propagation Equations and Derivations


In backward propagation we will find the amount of change require in weights w.r.t. to Error.
Generalized equation can be written as follow, using chain rule.
[l] [l]
∂E ∂E ∂aj ∂zj
[l]
= [l] [l] [l]
(6)
dwij daj dzj dwij

Where last two partial derivatives are equal to as follow


[l]
∂aj [l]
[l]
= f 0 (zj ) (7)
dzj
[l]
∂zj [l−1]
[l]
= ai (8)
dwij

2
Equation 6 can also be written as
∂E [l] [l−1]
[l]
= δj ai (9)
dwij
[l] 2
Where δj is error information

[l]
[l] ∂E ∂aj
δj = [l] [l]
daj dzj
(10)
∂E 0 [l]
= [l]
f (zj )
daj

1.2.1 Gradient of last layer


∂E [l]
For last layer L, [l] is as follow, because aj only contributes to error in yi
daj

Pm[L] [L]
∂E ∂1/2 i=1 (ai − yi )2
[L]
= [L]
daj daj (11)
[L]
= (aj − yj )

Plugging value of eq 11 in 10 and eq 9 for layer L


[L] [L] [L]
δj = (aj − yj ).f 0 (zj ) (12)

∂E [L] [L−1]
[L]
= δj .ai (13)
dwij
[L]
Derivative of E wrt bias bj is give by following equation, as the input from this connection is 1

∂E [L]
[L]
= δj (14)
dbj

1.2.2 Gradient of hidden layers


For last hidden layer (L-1) eq 6 becomes
[L−1] [L−1]
∂E ∂E ∂aj ∂zj
[L−1]
= [L−1] [L−1] [L−1]
(15)
dwij daj dzj dwij

The 2nd and 3rd partial derivatives can be found from equation (7) and (8). The first one is derived using
[L−1]
chain rule as give in (16). The summation show the contribution of aj is in all the units in layer L. The
first tow partial derivative in (16) are error information of kth unit of layer L.

[L]
m [L] [L]
∂E X ∂E ∂ak ∂zk
[L−1]
= [L] [L] [L−1]
daj k=1 dak dzk daj
[L]
(16)
m
[L] [L]
X
= δk wij
k=1

2 δ [l] = ∂E
j [l]
dzj

3
Plugging values og eq (16), (7) and (8) to eq. (15)

[L]
m
∂E X [L] [L] [L−1]
[L−1]
=[ δk wij ]f 0 (zj )aiL−2
dwij k=1
(17)
[L−1] [L−2]
= δj .ai

Therefore error information of jt h unit in L-1 layer is


[L]
m
[L−1] [L] [L] [L−1]
X
δj =[ δk wjk ]f 0 (zj ) (18)
k=1

Generalizing (18) and (17) for any hidden layers l


[l+1]
mX
[l] [l+1] [l+1] [l]
δj =[ δk wjk ]f 0 (zj ) (19)
k=1

∂E [l] [l−1]
[l]
= δj .ai (20)
dwij

∂E [l]
[l]
= δj (21)
dbj

1.2.3 Change in Weights and bias


The new weights and bias will be

[l] ∂E
∆wij = −α [l]
(22)
dwij
[l] ∂E
∆bj = −α [l]
(23)
dbj
[l] [l] [l]
wij = wij + ∆wij (24)
[l] [l] [l]
bj = bj + ∆bj (25)

4
1.3 Training Algorithm
3
Using Stochastic Gradient Descent to train

Algorithm 1 Training Multi Layer Neural Network


1: Initialization:
Randomly initialize all weights and bias in network
2: while termination condition is false do
3: for each training instance xh in training data: do
4: xh = a[0]
5: Forward Propagation:
6: for each layer l from 1 to L do
7: for each unit j in layer l find net and activations do
[l] Pm[l−1] [l] [l−1] [l]
8: zj = i=1 wij ai + bj
[l] [l]
9: aj = f [l] (zj )
10: Back Propagation:
11: for each unit j in last layer L do
12: find error info and change in weights and bias
[L] [L] [L]
13: δj = (aj − yj ).f 0 (zj )
[L] [L] [L−1]
14: ∆wij = −α.δj ai
[L] [L]
15: ∆bj = −α.δj
16: for each hidden layer l from l − 1 to 1 do
17: for each unit unit j in layer l do
18: find error info and change in weights and bias
[l] Pm[l+1] [l+1] [l+1] [l]
19: δj = [ k=1 δk wjk ]f 0 (zj )
[l] [l] [l−1]
20: ∆wij = −α.δj ai
[l] [l]
21: ∆bj = −α.δj
22: Update:
23: Update each weight
[l] [l] [l]
24: wij = wij + ∆wij
[l] [l] [l]
25: bj = bj + ∆bj
Test termination condition

3 This algorithm is adapted from Laurene Fausett’s Fundamentals of Neural Networks Section 6.1.2

5
2 Example
Following is the training example on training data given in figure and the network architecture. Using
sigmoid as activation function
1
a = f (z) = (26)
1 + exp(−z)
f 0 (z) = f (z)(1 − f (z)) = a(1 − a) (27)

Step 1:
Initializing all the weights to 1, α = 1
Step 2:
Iteration 1
Training Instance 0
Input x0 [1 1] x0 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 1
Net Z of layer 1
[1] [1]
z1 z2 = 3. 3.
Activation A of layer 1
[1] [1]
a1 a2 = 0.95257413 0.95257413

6
Net Z of layer 2
[2] [2]
z1 z2 = 2.90514825 2.90514825
Activation A of layer 2
[2] [2]
a1 , a2 = 0.94810035 0.94810035
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.0466523 -0.00255378
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00199222 0.00199222
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00199222 -0.00199222 -0.00199222 -0.00199222
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99800778 0.99800778
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00199222 -0.00199222
New bias in layer 1
[1] [1]
b1 , b2 = 0.99800778 0.99800778
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.04443977 0.00243266 -0.04443977 0.00243266
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.95556023 1.00243266 0.95556023 1.00243266
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.0466523 0.00255378
New bias in layer 2
[2] [2]
b1 , b2 = 0.9533477 1.00255378
Training Instance 1
Input x1 [0 0] x1 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 0
Net Z of layer 1
[1] [1]
z1 z2 = 0.99800778 0.99800778
Activation A of layer 1
[1] [1]
a1 a2 = 0.7306667 0.7306667
Net Z of layer 2
[2] [2]
z1 z2 = 2.34973978 2.46744212
Activation A of layer 2
[2] [2]
a1 , a2 = 0.91291354 0.92182764
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.07257882 -0.00563321
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.01253699 0.01253699
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99800778 0.99800778
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.01253699 -0.01253699
New bias in layer 1
[1] [1]
b1 , b2 = 0.98547079 0.98547079

7
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.05303093 0.004116 -0.05303093 0.004116
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.9025293 1.00654866 0.9025293 1.00654866
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.07257882 0.00563321
New bias in layer 2
[2] [2]
b1 , b2 = 0.88076888 1.00818699
Training Instance 2
Input x2 [0 1] x2 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 1
Net Z of layer 1
[1] [1]
z1 z2 = 1.98347856 1.98347856
Activation A of layer 1
[1] [1]
a1 a2 = 0.87905149 0.87905149
Net Z of layer 2
[2] [2]
z1 z2 = 2.46750832 2.7778032
Activation A of layer 2
[2] [2]
a1 , a2 = 0.92183241 0.9414645
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00563255 0.05188326
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00501187 0.00501187
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0.00501187 -0.00501187
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99800778 0.99800778 0.99299591 0.99299591
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00501187 -0.00501187
New bias in layer 1
[1] [1]
b1 , b2 = 0.98045892 0.98045892
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00495131 -0.04560806 0.00495131 -0.04560806
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.9074806 0.96094061 0.9074806 0.96094061
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00563255 -0.05188326
New bias in layer 2
[2] [2]
b1 , b2 = 0.88640143 0.95630373
Training Instance 3
Input x3 [1 0] x3 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 0
Net Z of layer 1
[1] [1]
z1 z2 = 1.9784667 1.9784667
Activation A of layer 1
[1] [1]
a1 a2 = 0.87851762 0.87851762

8
Net Z of layer 2
[2] [2]
z1 z2 = 2.48087682 2.64471024
Activation A of layer 2
[2] [2]
a1 , a2 = 0.92279029 0.93368421
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00550107 0.05781186
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00539616 0.00539616
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00539616 -0.00539616 -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99261161 0.99261161 0.99299591 0.99299591
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00539616 -0.00539616
New bias in layer 1
[1] [1]
b1 , b2 = 0.97506276 0.97506276
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00483278 -0.05078874 0.00483278 -0.05078874
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.91231338 0.91015187 0.91231338 0.91015187
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00550107 -0.05781186
New bias in layer 2
[2] [2]
b1 , b2 = 0.8919025 0.89849187
Iteration 2
Training Instance 0
Input x0 [1 1] x0 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 1
Net Z of layer 1
[1] [1]
z1 z2 = 2.96067028 2.96067028
Activation A of layer 1
[1] [1]
a1 a2 = 0.95076538 0.95076538
Net Z of layer 2
[2] [2]
z1 z2 = 2.62669446 2.62917364
Activation A of layer 2
[2] [2]
a1 , a2 = 0.93255995 0.93271571
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.05865045 -0.00422257
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00232482 0.00232482
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.00232482 -0.00232482 -0.00232482 -0.00232482
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.99067109 0.99067109
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00232482 -0.00232482
New bias in layer 1

9
[1] [1]
b1 , b2 = 0.97273794 0.97273794
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.05576282 0.00401467 -0.05576282 0.00401467
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.85655056 0.91416654 0.85655056 0.91416654
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.05865045 0.00422257
New bias in layer 2
[2] [2]
b1 , b2 = 0.83325204 0.90271444
Training Instance 1
Input x1 [0 0] x1 [0 1]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 0
Net Z of layer 1
[1] [1]
z1 z2 = 0.97273794 0.97273794
Activation A of layer 1
[1] [1]
a1 a2 = 0.72566489 0.72566489
Net Z of layer 2
[2] [2]
z1 z2 = 2.07638938 2.22947156
Activation A of layer 2
[2] [2]
a1 , a2 = 0.88858708 0.90286502
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = 0.08797019 -0.00851872
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.01345021 0.01345021
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.99067109 0.99067109
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.01345021 -0.01345021
New bias in layer 1
[1] [1]
b1 , b2 = 0.95928773 0.95928773
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.06383688 0.00618173 -0.06383688 0.00618173
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.79271368 0.92034827 0.79271368 0.92034827
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = -0.08797019 0.00851872
New bias in layer 2
[2] [2]
b1 , b2 = 0.74528185 0.91123315
Training Instance 2
Input x2 [0 1] x2 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 0 1
Net Z of layer 1
[1] [1]
z1 z2 = 1.94995882 1.94995882
Activation A of layer 1

10
[1] [1]
a1 a2 = 0.87544215 0.87544215
Net Z of layer 2
[2] [2]
z1 z2 = 2.1332318 2.52265649
Activation A of layer 2
[2] [2]
a1 , a2 = 0.89409142 0.92571494
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.01002869 0.06365844
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.00552174 0.00552174
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0. -0. -0.00552174 -0.00552174
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.99028679 0.99028679 0.98514935 0.98514935
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.00552174 -0.00552174
New bias in layer 1
[1] [1]
b1 , b2 = 0.95376599 0.95376599
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00877954 -0.05572929 0.00877954 -0.05572929
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.80149322 0.86461899 0.80149322 0.86461899
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.01002869 -0.06365844
New bias in layer 2
[2] [2]
b1 , b2 = 0.75531054 0.84757471
Training Instance 3
Input x3 [1 0] x3 [1 0]
Forward Propagation
Activation of layer 0 A0
[0] [0]
a1 a2 = 1 0
Net Z of layer 1
[1] [1]
z1 z2 = 1.94405279 1.94405279
Activation A of layer 1
[1] [1]
a1 a2 = 0.87479671 0.87479671
Net Z of layer 2
[2] [2]
z1 z2 = 2.15759781 2.36030639
Activation A of layer 2
[2] [2]
a1 , a2 = 0.89637663 0.91374996
Back Propagation
Error info of layer 2
[2] [2]
δ1 ,δ2 = -0.00962512 0.07201352
Error info of layer 1
[1] [1]
δ1 , δ2 = 0.0059747 0.0059747
Change in W layer 1
[1] [1] [1] [1]
∆w11 , ∆w12 , ∆w21 , ∆w22 = -0.0059747 -0.0059747 -0. -0.
New weights in layer 1
[1] [1] [1] [1]
w11 , w12 , w21 , w22 = 0.98431209 0.98431209 0.98514935 0.98514935
Change in b layer 1
[1] [1]
∆b1 , ∆b2 = -0.0059747 -0.0059747
New bias in layer 1

11
[1] [1]
b1 , b2 = 0.9477913 0.9477913
Change in W layer 2
[2] [2] [2] [2]
∆w11 , ∆w12 , ∆w21 , ∆w22 = 0.00842002 -0.06299719 0.00842002 -0.06299719
New weights in layer 2
[2] [2] [2] [2]
w11 , w12 , w21 , w22 = 0.80991324 0.80162179 0.80991324 0.80162179
Change in b layer 2
[2] [2]
∆b1 , ∆b2 = 0.00962512 -0.07201352
New bias in layer 2
[2] [2]
b1 , b2 = 0.76493566 0.77556118

The iterations will continue till the convergence occurs.

3 Notes
Activation function
Using tanH, or Relu activation function in hidden layers will give better results. The equations are give as
follow:

ez − e−z
tanh(z) = (28)
ez + e−z
tanh0 (z) = 1 − (tanh(z))2 (29)

relu(z) = max(z, 0) (30)

relu0 (z) = 1if Z ≥ 0, 0otherwise (31)

Loss/Cost Function
Using Cross Entropy loss function instead on Error function will also work been in finding optimal values.
The derivations will be made according to the Loss/Cost function
Initialization
Initialize the weights and bias randomly
Gradient Descent
This report uses Stochastic Gradient Descent, If you are using Gradient Descent instead the , for loop in
line 3 of algorithm will be omitted. and Cost function will be used (Cost is sum of loss over all training
examples)
Mini Batch
In between gradient descent over all training data and stochastic gradient descent, there is a Mini Batch
version, where training data will be divided in chuck and updates will be made after traversing one all data
in one mini batch. The Cost will be sum or average of Loss over all training example in one mini batch. The
derivations will be made according to the Loss/Cost function
Vectorization The operations can be vectorized by representing W l , bl , z l , al , δ l , ∆W l , x and y in form
of vectors/matrix.
For example if W l is the weight matrix, it will connect layer l-1 and l. Its dimensions are (ml−1 ,ml ). If bl is
the vector of bias for layer l, its dimensions will be (ml ,1). If al−1 are activations of layer l − 1. The net of
layer l can be written in vector form as follow.

zl = W l . al−1 + bl

12
4 Exercise
1. Train a 2 layer neural network for the following data. Using 2 units in hidden layer. The input and
output units will be according to the dimensions of data. Perform 2 iteration. Use alpha=1, and
Sigmoid function as activation in all layers

x1 x2 y
1 -1 0
-1 1 0
1 1 1
-1 -1 0

2. Train the same neural network with relu as activation function for hidden layer and sigmoid for output
layer.

3. Use perceptron rule to train a single perceptron for training data given in following table. Perform
only one iteration i.e. Go once through each training instance and shown the updating of weights. Use
alpha= 1 and step function as activation function.
After you have completed one iteration draw the decision boundary on the in X1 and X2 plan, is there
a need for more iterations if goal is to achieve 100% accuracy on training data? 4

x1 x2 y
0 0 1
0.5 0 1
1 0 1
0.5 1 -1
1 1 -1
0 1 1

4. For data given in question 3, train a 2 layer neural network. Using 2 units in hidden layer. The input
and output units will be according to the dimensions of data. Use activation functions appropriate for
this problem.
5. Derive the equation of gradient descent, if we use Cost of complete set before updating the weights.
The Cost function will be

[L]
N m
j[L]
X X
E = 1/2 (ai − yij )2 (32)
j=1 i=1

Where N is the number of training instances, and yij is ith component of desired output of iith instances,
j[L]
and ai is the ith component of neural net’s output of j th instances
Hint: Use C instead of E

6. Using the dataset given in example (page 6), train a neural network with 2 hidden layers. Perform
one iteration. Using 2 units per hidden layer. The input and output units will be according to the
dimensions of data. Use activation function of your choice.
7. The the vectorized form of net is given in section Notes, find the vectorized equation of al , δ l and
∆W l . Please mention the dimensions of each vector/matrix you use.
8. Given the following network, find the following error information f units marked with ?, when input
x= [1.5 1 2] and desired output is y= -1 . Use tanH as activation function of all units.
4 Perceptron rule was covered in class, please refer to the slides

13
9. For the same network in question 8 find the error required change in weights if input x= [-1.5 -1 -2]
and desired output is y= 1. Use tanH as activation function of all units, and α=0.1.
10. Design a feed forward full connected neural network to classify 28x28 image, into class cat of dog.
What is the dimension of input layer, and output layer? Which activation functions you will use in
output layer and in hidden layer? Assume that converting your image to a 100 dim space and them
to 50 dim and them to 10 Dim will give good performance, how many hidden layers you will use and
what number of units you will use in each layer?
11. Given a task of classifying 28x28 gray scale image of handwritten digits [0 to 9] using feed forward full
connected neural network, what will be the dimension of input layer, and output layer? How will you
convert the output to to class value?

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy