0% found this document useful (0 votes)
5 views8 pages

Neural Networks Skimmed - Ipynb - Colab

The document discusses building a neural network model to classify red and green points in a 2D space, emphasizing that the dataset is not linearly separable. It details the architecture of the neural network, including layers, activation functions, and the feed-forward phase, as well as the use of cross-entropy loss and backpropagation for training. The document also highlights the importance of activation functions in enabling the model to learn complex patterns.

Uploaded by

thienvanhoc2808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Neural Networks Skimmed - Ipynb - Colab

The document discusses building a neural network model to classify red and green points in a 2D space, emphasizing that the dataset is not linearly separable. It details the architecture of the neural network, including layers, activation functions, and the feed-forward phase, as well as the use of cross-entropy loss and backpropagation for training. The document also highlights the importance of activation functions in enabling the model to learn complex patterns.

Uploaded by

thienvanhoc2808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

keyboard_arrow_down Neural Networks

We want to build a model to discriminate red and green points 2-dimensional space.

Bắt đầu lập trình hoặc tạo mã bằng trí tuệ nhân tạo (AI).

Given an input point x = (x1 , x2 ) , we need to predict the output, either red or green (0 means red, 1 means green).

This dataset is NOT linearly separable.

To build a good classifier for this dataset, we need to implement a neural network, which is quite similar to Logistic Regression (or Softmax
Regression), but with more layers.

keyboard_arrow_down Computation Graph

In this example,
The input layer has 2 neurons, corresponding to each input feature.
The first hidden layer has parameters: weights W and bias b . The first hidden layer transforms input x into output a[1] .
[1] [1]

The second hidden layer has parameters: weights W [2]


and bias b[2] . The second hidden layer takes the output a[1] of the previous layer
as its input and transforms it into output a [2]
.
The third layer is the output layer, which has parameters: weights W and bias b . This last layer takes the output a[2] of the previous
[3] [3]

layer as its input and transforms it into output a [3]


.
a
[3]
is the output of the neural network in this example.

Our prediction for the output would be


[3]
0, if a < 0.5
Predicted class = {
1, otherwise.

keyboard_arrow_down Feed-forward Phase


Given the parameters of a neural network, and an input x, how to produce an output.

For this example, we have


[1] [1] [1]
z = W x + b

[1] [1]
a = g(z )

is of size (3 × 2), x is of size (2 × 1), b is of size (3 × 1).


[1] [1]
W

g() is the activation function. Here, we use sigmoid as the activation function.

[2] [2] [1] [2]


z = W a + b

[2] [2]
a = g(z )

W
[2]
is of size (3 × 3), a[1] is of size (3 × 1), b[2] is of size (3 × 1).

We also have the sigmoid function g() as the activation function for this layer.

[3] [3] [2] [3]


z = W a + b

[3] [3]
a = g(z )

is of size (1 × 3), a[2] is of size (3 × 1), b[3] is of size (1 × 1) (i.e., a scalar).


[3]
W

We also have the sigmoid function g() as the activation function for this layer.

We can write the feed-forward phase for the above neural network as follows.
[3] [3] [2] [3]
a = g(W a + b )

[3] [2] [3]


= g(W (g(z ) + b )

[3] [2] [1] [2] [3]


= g(W (g(W a + b ) + b )

[3] [2] [1] [2] [3]


= g(W (g(W (g(z )) + b ) + b )

[3] [2] [1] [1] [2] [3]


= g(W (g(W (g(W x + b )) + b ) + b )

Can we do this?
[3] [3] [2] [1] [1] [2] [3]
a = g(W (W (W x + b ) + b ) + b )

[3] [2] [1] [2] [1] [2] [3]


= g(W (W W x + W b + b ) + b )

[3] [2] [1] [3] [2] [1] [3] [2] [3]


= g(W W W x + W W b + W b + b )

[3]
= g(Wx + b + b )

Because W is actually a row vector of size (1 × 3), we would have


[3]

a
[3]
= g(w
T
x + b) which is actually a Logistic Regression model, which can only produce a linear classifier.

Why do we need activation function g(x) between each layer of a neural network?
def forward(x, W1, b1, W2, b2, W3, b3):
z1 = np.matmul(W1, x) + b1 # z1 here is a vector
a1 = sigmoid(z1) # a1 here is not a number, but a vector

z2 = np.matmul(W2, a1) + b2 # z2 here is a vector


a2 = sigmoid(z2) # a2 is a vector

z3 = np.matmul(W3, a2) + b3 # z3 here is a (1x1) vector --> a number


a3 = sigmoid(z3) # a3 here is a number

return z1, a1, z2, a2, z3, a3


# return a3 # a3 is the output of our neural network

# Let's try to make some prediction with this neural network

The coordinates of the input point is [0.65 0. ]


True label is 0
Predicted value is 0.6126327783832988

Because the parameters of the neural networks has random values, its predictions are not good.

How to learn these parameters properly?

keyboard_arrow_down Cross Entropy Loss Function


We want to find parameter values of our neural network that minimizes a cost function. We use the same cost function (or loss function) as in
Logistic Regression or Softmax Regression.

For the neural network in this example, we have the cost function
N (i) [3](i) (i) [3](i)
J = −∑ [y log(a ) + (1 − y ) log(1 − a )]
i=1

Because we have only two classes, this loss function is also called Binary Cross Entropy Loss.

If we have more than two classes, we use the general Cross Entropy Loss function.
[i]
N C (i) [3] y
J = −∑ ∑ y log(a ) j

i=1 j=1 j j

If we want to compute the loss function for a sample i in our dataset.


(i) [3](i) (i) [3](i)
L = −(y log(a ) + (1 − y ) log(1 − a ))

keyboard_arrow_down Backpropagation
We also use Gradient Descent to find the parameter values for the neural network.

In order to find the parameter values that minimize the cost/loss function, we need to compute the gradient/derivates of the loss function L
with respect to the parameters at each layer of the neural network.
dL dL
,
[i] [i]
dW db

We use Chain Rule to compute these derivatives.

We apply chain rule in the opposite direction to the feed-forward direction (hence the name backpropagation).

For the sake of simplicity, we will use only one sample here. So we can drop the notation (i).

In Logistic Regression, we already computed


dL [3]
= a − y
[3]
dz

[3] [3]

Next, we compute dz
[3]
and dz
[3]
.
dW db

[3] [3] [2] [3]


z = W a + b

[2]
⎡a ⎤
1

[3] [3] [3] ⎢ [2] ⎥ [3] [2] [3] [2] [3] [2] [3]
[3] [3]
[z ] = [w w w ]⎢a ⎥ + [b ] = w a + w a + w a + b
1,1 1 1,2 2 1,3 3
1,1 1,2 1,3 ⎢ 2 ⎥

⎣ [2] ⎦
a
3

We can see that:


[3] [2]
If we change w1,1 , z [3] will change proportionally to a1 .
[3] [2]
If we change w1,2 , z [3] will change proportionally to a2 .
[3] [2]
If we change w1,3 , z [3] will change proportionally to a3 .
Thus, if we change W , z [3] will change proportionally to a[2] .
[3]

Therefore,
[3] [3] [3]
[3] dz dz dz
dz [2] [2] [2] [2] T
= [ [3] [3] [3] ] = [a a a ] = (a )
[3]
dW dw dw dw 1 2 3
1,1 1,2 1,3

Similarly, if we change b[3] a small amount, z [3] will also change the same amount in the same direction.
[3]
dz
= 1.
[3]
db

Now, we can have


[3]
dL dL dz [3] [2] T
= = (a − y)(a )
[3] [3] [3]
dW dz dW

[3]
dL dL dz [3]
= = (a − y)
[3] [3] [3]
db dz db

Next, we need to calculate , .


dL dL
[2] [2]
dW db

[3]

We need to compute
dL dL dz
=
[2] [3] [2]
da dz da

[3] [3] [2] [3]


z = W a + b

[2]
⎡ a1 ⎤

[3]
[3] [3] [3] ⎢ [2] ⎥ [3]
[3] [2] [3] [2] [3] [2] [3]
[z ] = [w w w ]⎢a ⎥ + [b ] = w a + w a + w a + b
1,1 1,2 1,3 1,1 1 1,2 2 1,3 3
⎢ 2 ⎥

⎣ [2] ⎦
a
3
[3]
dz
⎡ ⎤ [3]
[2]
da
1
⎡ w1,1 ⎤
⎢ ⎥
[3] ⎢ dz
[3] ⎥ ⎢ ⎥
dz [3] [3] T
= ⎢ ⎥ = ⎢
w ⎥ = (W )
da
[2] ⎢ da
[2] ⎥ ⎢ 1,2 ⎥
⎢ 2 ⎥ ⎢ ⎥
⎢ [3]
⎥ [3]
dz ⎣w ⎦
⎣ [2]
⎦ 1,3
da
3

[3]

Therefore,
dL dL dz [3] [3] T
= = (a − y)(W )
[2] [3] [2]
da dz da

Next, we compute .
dL
[2]
dz

We have

, where g(z) is the sigmoid function.


[2] [2] 1
a = g(z ) = −z
1+e

That is,
[2] [2] 1
a = g(z ) =
1 1 [2]
−z
1+e 1

[2] [2] 1
a = g(z ) =
2 2 [2]
−z
1+e 2

[2] [2] 1
a = g(z ) =
3 3 [2]
−z
1+e 3

In Logistic Regression, we already derived the derivative of the sigmoid function .


da
= a(1 − a)
dz

Thus, we have
[2]
da
⎡ 1

[2]
dz [2] [2]
⎢ 1 ⎥ ⎡a (1 − a ) ⎤
⎢ ⎥ 1 1
[2]
[2] ⎢ da ⎥ ⎢ [2] ⎥
da [2] [2] [2]
= ⎢ ⎥ =
2
⎢a (1 − a ) ⎥ = a ∘ (1 − a )
dz[2] ⎢ [2] ⎥ ⎢ 2 2 ⎥
⎢ dz
2 ⎥
⎢ ⎥ ⎣ [2] [2] ⎦
⎢ da
[2]
⎥ a (1 − a )
3 3 3
⎣ [2] ⎦
dz
3

Therefore,
[2]
da
dL 1
dL ⎡ ⎤
⎡ [2] ⎤ da
[2]
dz
[2]

dz
1
⎢ 1 1 ⎥
⎢ ⎥ ⎢ [2]

⎢ dL ⎥ ⎢ da ⎥
dL dL dL [2] [2]
= ⎢ ⎢ 2
⎥ =
[2] ⎥ = ∘ (a ∘ (1 − a ))
[2]
dz ⎢ [2] [2] ⎥ [2]
dz ⎢ ⎥ da dz
da
2
⎢ ⎥
⎢ ⎥ 2 2

dL
⎢ ⎥
⎢ da
[2]

⎣ [2] ⎦ dL 3
dz
3 ⎣ [2] [2] ⎦
da dz
3 3

Now, we can calculate and .


dL dL
[2] [2]
dW db

We have
[2] [2] [1] [2]
z = W a + b

[2] [2] [2]


[2] [1] [2]
⎡z ⎤ ⎡ w1,1 w
1,2
w
1,3
⎤⎡a ⎤ ⎡b ⎤
1 1 1

⎢ [2] ⎥ ⎢ [2] [2] [2]


⎥⎢ ⎥ ⎢ ⎥
⎢z ⎥ = ⎢w ⎥ ⎢ [1] ⎥ + ⎢ [2] ⎥
⎢ w w ⎥ ⎢ a2 ⎥ b
⎢ 2 ⎥ 2,1 2,2 2,3 ⎢ 2 ⎥
⎢ ⎥
⎣ [2] ⎦ [2] [2] [2] ⎣ [1] ⎦ ⎣ [2] ⎦
z ⎣w w w ⎦ a b
3 3,1 3,2 3,3 3 3
[2] [2] [2]
dz dz dz
⎡ dL 1 dL 1 dL 1

dL dL dL
⎡ [2] [2] [2] ⎤ dz
[2]
dw
[2]
dz
[2]
dw
[2]
dz
[2]
dw
[2]

dw
1,1
dw
1,2
dw
1,3
⎢ 1 1,1 1 1,2 1 1,3 ⎥
⎢ ⎥ ⎢ ⎥
⎢ dL dL dL ⎥ ⎢ dz
[2]
dz
[2]
dz
[2]

dL dL dL dL
⎥ = ⎢ ⎥
2 2 2
= ⎢
[2] ⎢ dw
[2]
dw
[2]
dw
[2]
⎥ ⎢ [2] [2] [2] [2] [2] [2] ⎥
dW dz dw dz dw dz dw
⎢ 2,1 2,2 2,3
⎥ ⎢ 2 2,1 2 2,2 2 2,3 ⎥
⎢ ⎥ ⎢ ⎥
dL dL dL
⎢ dz
[2]
dz
[2]
dz
[2]

⎣ [2] [2] [2] ⎦ dL 3 dL 3 dL 3
dw dw dw
3,1 3,2 3,3 ⎣ [2] [2] [2] [2] [2] [2] ⎦
dz dw dz dw dz dw
3 3,1 3 3,2 3 3,3

dL [1] dL [1] dL [1] dL


⎡ a a a ⎤
[2] 1 [2] 2 [2] 3 ⎡ [2] ⎤
dz dz dz dz
1 1 1 1
⎢ ⎥ ⎢ ⎥
⎢ dL [1] dL [1] dL [1] ⎥ ⎢ dL ⎥ [1] [1] [1]
= ⎢ a a a ⎥ =
⎢ ⎥[a a a ]
⎢ [2] 1 [2] 2 [2] 3 ⎥ [2]
dz dz dz ⎢ dz ⎥ 1 2 3
⎢ 2 2 2 ⎥ 2
⎢ ⎥
⎢ ⎥
[1] [1] [1] dL
dL dL dL
a a a ⎣ ⎦
⎣ [2] 1 [2] 2 [2] 3 ⎦ [2]
dz
dz dz dz 3
3 3 3

dL [1] T
= (a )
[2]
dz

where we already computed above.


dL
[2]
dz

And, similarly as above, we have

.
dL dL
=
[2] [2]
db dz

[2]

Next, we compute .
dL dL dz
=
[1] [2] [1]
da dz da

Again, we have
[2] [2] [1] [2]
z = W a + b
[2] [2] [2]
[2] [1] [2]
⎡ z1 ⎤ ⎡ w1,1 w
1,2
w
1,3 ⎤⎡a ⎤ ⎡ b1 ⎤
1

⎢ [2] ⎥ ⎢ [2] [2] [2] ⎥ ⎢ [1] ⎥ ⎢ [2] ⎥


⎢z ⎥ = ⎢w w w ⎥⎢ ⎥ + ⎢b ⎥
⎢ 2 ⎥ ⎢ 2,1 2,2 2,3 ⎥ ⎢ a2 ⎥ ⎢ 2 ⎥
⎢ ⎥
⎣ [2] ⎦ [2] [2] [2] ⎣ [1] ⎦ ⎣ [2] ⎦
z ⎣w w w ⎦ a b
3 3,1 3,2 3,3 3 3

[2] [2] [2]

[2]
dz dz dz
dz ⎡ 1 2 3

⎡ ⎤ [1] [1] [1] [2] [2] [2]
[1]
da da da
da
1
⎢ 1 1 1 ⎥ ⎡ w1,1 w
2,1
w
3,1

⎢ ⎥ ⎢ ⎥
[2] ⎢ dz
[2] ⎥ ⎢ dz
[2]
dz
[2]
dz
[2]
⎥ ⎢ ⎥
dz [2] [2] [2] [2] T
= ⎢ ⎥ = ⎢ ⎥ = ⎢ ⎥ = (W
1 2 3
)
da
[1] ⎢ [1] ⎥ ⎢ [1] [1] [1] ⎥ ⎢ w1,2 w
2,2
w
3,2 ⎥
da
⎢ 2 ⎥ ⎢ da
2
da
2
da
2 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ [2] [2] [2]
[2]
dz ⎢ [2] [2] [2] ⎥ ⎣w w w ⎦
dz dz dz
⎣ [1]
⎦ 1 2 3 1,3 2,3 3,3
da
3
⎣ [1] [1] [1] ⎦
da da da
3 3 3

We then have
[2]
dL dz dL [2] T dL
= = (W )
[1] [1] [2] [2]
da da dz dz

Similarly as above, we can compute as well.


dL
[1]
dz

Now, we proceed to compute .


dL
[1]
dW

[1] [1] [1]


z = W x + b

[1] [1]
[1] [1]
⎡ z1 ⎤ ⎡ w1,1 w
1,2 ⎤ ⎡ b1 ⎤

⎢ [1] ⎥ ⎢ [1] [1] ⎥ x1 ⎢ [1] ⎥


⎢z ⎥ = ⎢w w ⎥[ ] + ⎢b ⎥
⎢ 2 ⎥ ⎢ 2,1 2,2 ⎥ ⎢ 2 ⎥
⎢ ⎥ x2
⎣ [1] ⎦ [1] [1] ⎣ [1] ⎦
z ⎣w w ⎦ b
3 3,1 3,2 3

Similarly as above, we have


dL dL [0] T dL T
= (a ) = x
[1] [1] [1]
dW dz dz

and
dL
[1]
=
dL
[1]
.
db dz

Now, we already had all the gradients. We can do backpropagation.

keyboard_arrow_down Decision Boundary

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy