Neural Networks Skimmed - Ipynb - Colab
Neural Networks Skimmed - Ipynb - Colab
We want to build a model to discriminate red and green points 2-dimensional space.
Bắt đầu lập trình hoặc tạo mã bằng trí tuệ nhân tạo (AI).
Given an input point x = (x1 , x2 ) , we need to predict the output, either red or green (0 means red, 1 means green).
To build a good classifier for this dataset, we need to implement a neural network, which is quite similar to Logistic Regression (or Softmax
Regression), but with more layers.
In this example,
The input layer has 2 neurons, corresponding to each input feature.
The first hidden layer has parameters: weights W and bias b . The first hidden layer transforms input x into output a[1] .
[1] [1]
[1] [1]
a = g(z )
g() is the activation function. Here, we use sigmoid as the activation function.
[2] [2]
a = g(z )
W
[2]
is of size (3 × 3), a[1] is of size (3 × 1), b[2] is of size (3 × 1).
We also have the sigmoid function g() as the activation function for this layer.
[3] [3]
a = g(z )
We also have the sigmoid function g() as the activation function for this layer.
We can write the feed-forward phase for the above neural network as follows.
[3] [3] [2] [3]
a = g(W a + b )
Can we do this?
[3] [3] [2] [1] [1] [2] [3]
a = g(W (W (W x + b ) + b ) + b )
[3]
= g(Wx + b + b )
a
[3]
= g(w
T
x + b) which is actually a Logistic Regression model, which can only produce a linear classifier.
Why do we need activation function g(x) between each layer of a neural network?
def forward(x, W1, b1, W2, b2, W3, b3):
z1 = np.matmul(W1, x) + b1 # z1 here is a vector
a1 = sigmoid(z1) # a1 here is not a number, but a vector
Because the parameters of the neural networks has random values, its predictions are not good.
For the neural network in this example, we have the cost function
N (i) [3](i) (i) [3](i)
J = −∑ [y log(a ) + (1 − y ) log(1 − a )]
i=1
Because we have only two classes, this loss function is also called Binary Cross Entropy Loss.
If we have more than two classes, we use the general Cross Entropy Loss function.
[i]
N C (i) [3] y
J = −∑ ∑ y log(a ) j
i=1 j=1 j j
keyboard_arrow_down Backpropagation
We also use Gradient Descent to find the parameter values for the neural network.
In order to find the parameter values that minimize the cost/loss function, we need to compute the gradient/derivates of the loss function L
with respect to the parameters at each layer of the neural network.
dL dL
,
[i] [i]
dW db
We apply chain rule in the opposite direction to the feed-forward direction (hence the name backpropagation).
For the sake of simplicity, we will use only one sample here. So we can drop the notation (i).
[3] [3]
Next, we compute dz
[3]
and dz
[3]
.
dW db
[2]
⎡a ⎤
1
[3] [3] [3] ⎢ [2] ⎥ [3] [2] [3] [2] [3] [2] [3]
[3] [3]
[z ] = [w w w ]⎢a ⎥ + [b ] = w a + w a + w a + b
1,1 1 1,2 2 1,3 3
1,1 1,2 1,3 ⎢ 2 ⎥
⎣ [2] ⎦
a
3
Therefore,
[3] [3] [3]
[3] dz dz dz
dz [2] [2] [2] [2] T
= [ [3] [3] [3] ] = [a a a ] = (a )
[3]
dW dw dw dw 1 2 3
1,1 1,2 1,3
Similarly, if we change b[3] a small amount, z [3] will also change the same amount in the same direction.
[3]
dz
= 1.
[3]
db
[3]
dL dL dz [3]
= = (a − y)
[3] [3] [3]
db dz db
[3]
We need to compute
dL dL dz
=
[2] [3] [2]
da dz da
[2]
⎡ a1 ⎤
[3]
[3] [3] [3] ⎢ [2] ⎥ [3]
[3] [2] [3] [2] [3] [2] [3]
[z ] = [w w w ]⎢a ⎥ + [b ] = w a + w a + w a + b
1,1 1,2 1,3 1,1 1 1,2 2 1,3 3
⎢ 2 ⎥
⎣ [2] ⎦
a
3
[3]
dz
⎡ ⎤ [3]
[2]
da
1
⎡ w1,1 ⎤
⎢ ⎥
[3] ⎢ dz
[3] ⎥ ⎢ ⎥
dz [3] [3] T
= ⎢ ⎥ = ⎢
w ⎥ = (W )
da
[2] ⎢ da
[2] ⎥ ⎢ 1,2 ⎥
⎢ 2 ⎥ ⎢ ⎥
⎢ [3]
⎥ [3]
dz ⎣w ⎦
⎣ [2]
⎦ 1,3
da
3
[3]
Therefore,
dL dL dz [3] [3] T
= = (a − y)(W )
[2] [3] [2]
da dz da
Next, we compute .
dL
[2]
dz
We have
That is,
[2] [2] 1
a = g(z ) =
1 1 [2]
−z
1+e 1
[2] [2] 1
a = g(z ) =
2 2 [2]
−z
1+e 2
[2] [2] 1
a = g(z ) =
3 3 [2]
−z
1+e 3
Thus, we have
[2]
da
⎡ 1
⎤
[2]
dz [2] [2]
⎢ 1 ⎥ ⎡a (1 − a ) ⎤
⎢ ⎥ 1 1
[2]
[2] ⎢ da ⎥ ⎢ [2] ⎥
da [2] [2] [2]
= ⎢ ⎥ =
2
⎢a (1 − a ) ⎥ = a ∘ (1 − a )
dz[2] ⎢ [2] ⎥ ⎢ 2 2 ⎥
⎢ dz
2 ⎥
⎢ ⎥ ⎣ [2] [2] ⎦
⎢ da
[2]
⎥ a (1 − a )
3 3 3
⎣ [2] ⎦
dz
3
Therefore,
[2]
da
dL 1
dL ⎡ ⎤
⎡ [2] ⎤ da
[2]
dz
[2]
dz
1
⎢ 1 1 ⎥
⎢ ⎥ ⎢ [2]
⎥
⎢ dL ⎥ ⎢ da ⎥
dL dL dL [2] [2]
= ⎢ ⎢ 2
⎥ =
[2] ⎥ = ∘ (a ∘ (1 − a ))
[2]
dz ⎢ [2] [2] ⎥ [2]
dz ⎢ ⎥ da dz
da
2
⎢ ⎥
⎢ ⎥ 2 2
dL
⎢ ⎥
⎢ da
[2]
⎥
⎣ [2] ⎦ dL 3
dz
3 ⎣ [2] [2] ⎦
da dz
3 3
We have
[2] [2] [1] [2]
z = W a + b
dw
1,1
dw
1,2
dw
1,3
⎢ 1 1,1 1 1,2 1 1,3 ⎥
⎢ ⎥ ⎢ ⎥
⎢ dL dL dL ⎥ ⎢ dz
[2]
dz
[2]
dz
[2]
⎥
dL dL dL dL
⎥ = ⎢ ⎥
2 2 2
= ⎢
[2] ⎢ dw
[2]
dw
[2]
dw
[2]
⎥ ⎢ [2] [2] [2] [2] [2] [2] ⎥
dW dz dw dz dw dz dw
⎢ 2,1 2,2 2,3
⎥ ⎢ 2 2,1 2 2,2 2 2,3 ⎥
⎢ ⎥ ⎢ ⎥
dL dL dL
⎢ dz
[2]
dz
[2]
dz
[2]
⎥
⎣ [2] [2] [2] ⎦ dL 3 dL 3 dL 3
dw dw dw
3,1 3,2 3,3 ⎣ [2] [2] [2] [2] [2] [2] ⎦
dz dw dz dw dz dw
3 3,1 3 3,2 3 3,3
dL [1] T
= (a )
[2]
dz
.
dL dL
=
[2] [2]
db dz
[2]
Next, we compute .
dL dL dz
=
[1] [2] [1]
da dz da
Again, we have
[2] [2] [1] [2]
z = W a + b
[2] [2] [2]
[2] [1] [2]
⎡ z1 ⎤ ⎡ w1,1 w
1,2
w
1,3 ⎤⎡a ⎤ ⎡ b1 ⎤
1
[2]
dz dz dz
dz ⎡ 1 2 3
⎤
⎡ ⎤ [1] [1] [1] [2] [2] [2]
[1]
da da da
da
1
⎢ 1 1 1 ⎥ ⎡ w1,1 w
2,1
w
3,1
⎤
⎢ ⎥ ⎢ ⎥
[2] ⎢ dz
[2] ⎥ ⎢ dz
[2]
dz
[2]
dz
[2]
⎥ ⎢ ⎥
dz [2] [2] [2] [2] T
= ⎢ ⎥ = ⎢ ⎥ = ⎢ ⎥ = (W
1 2 3
)
da
[1] ⎢ [1] ⎥ ⎢ [1] [1] [1] ⎥ ⎢ w1,2 w
2,2
w
3,2 ⎥
da
⎢ 2 ⎥ ⎢ da
2
da
2
da
2 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ [2] [2] [2]
[2]
dz ⎢ [2] [2] [2] ⎥ ⎣w w w ⎦
dz dz dz
⎣ [1]
⎦ 1 2 3 1,3 2,3 3,3
da
3
⎣ [1] [1] [1] ⎦
da da da
3 3 3
We then have
[2]
dL dz dL [2] T dL
= = (W )
[1] [1] [2] [2]
da da dz dz
[1] [1]
[1] [1]
⎡ z1 ⎤ ⎡ w1,1 w
1,2 ⎤ ⎡ b1 ⎤
and
dL
[1]
=
dL
[1]
.
db dz