0% found this document useful (0 votes)
15 views12 pages

NN 2

The document discusses the concepts of forward and backpropagation in neural networks, highlighting the differences between perceptrons and neurons, as well as the importance of activation functions like Sigmoid. It explains how forward propagation computes the output of a neural network using both non-vectorized and vectorized methods, and how backpropagation allows networks to learn by minimizing error through gradient descent. The document emphasizes the mathematical rigor behind these processes and the significance of efficient computation in training neural networks.

Uploaded by

myroslavrepin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

NN 2

The document discusses the concepts of forward and backpropagation in neural networks, highlighting the differences between perceptrons and neurons, as well as the importance of activation functions like Sigmoid. It explains how forward propagation computes the output of a neural network using both non-vectorized and vectorized methods, and how backpropagation allows networks to learn by minimizing error through gradient descent. The document emphasizes the mathematical rigor behind these processes and the significance of efficient computation in training neural networks.

Uploaded by

myroslavrepin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Neural Networks: Forward and Backpropagation

Nikhil Sardana
October 2017

1 Introduction
We have covered Perceptrons, the fundamental unit of the Neural Network (See
”Neural Networks: Introduction and Overview” for the previous lecture). Ad-
ditionally, Multi-Layer Perceptrons, or Neural Networks, were introduced as a
solution for approximating non-linearly separable data. This lecture continues
exploring Neural Networks, but introduces vectorization for more efficient no-
tation and computation. We also cover how neural networks learn, which is
significantly more complex than the perceptron learning algorithm.

2 The Neuron
A single node of a neural network (a neuron) differs from a perceptron in one
way: the activation function. Consider this diagram of a neuron:

1
The symbol σ represents the Sigmoid activation function σ(x) = 1+e−x .

1
1.0 y
σ(x) = 1+e1−x
g(x) = 1+e1−5x 0.8
1
h(x) = 1+e−100x
0.6

0.4

0.2
x
−5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0
Notice how as the coefficent of x approaches infinity, σ(x) approaches the
step function from before. We use σ(x) is because it is differenciable, which is
necessary for networks to learn. Other activation functions include tanh(x) and
ReLU, but we will use Sigmoid for our examples.
The rest of a neuron is identical to a perceptron: multipy each input by its
weight, add them up and the bias and compute the activation function of the
sum.

3 Forward Propagation
3.1 Non-Vectorized Forward Propagation
Forward Propagation is a fancy term for computing the output of a neural
network. We must compute all the values of the neurons in the second layer
before we begin the third, but we can compute the individual neurons in any
given layer in any order. Consider the following network:

2
We denote the value of node i as ni , and the bias of node i as bi . Computing
the network using these variables, we get:

n3 = σ(w13 n1 + w23 n2 + b3 )

n4 = σ(w14 n1 + w24 n2 + b4 )
n5 = σ(w15 n1 + w25 n2 + b5 )
n6 = σ(w36 n3 + w46 n4 + w56 n5 + b6 )
Continuing this example of forward propagation, let’s assign some numbers
and compute the output of this network. Let n1 = 0.2 and n2 = 0.3. Let
w13 = 4, w14 = 5, w15 = 6, w23 = 5, w24 = 6, w25 = 7, w36 = 9, w46 = 10 and
w56 = 11, just so they are easy to remember. Let all the biases b3..6 = 1 (input
nodes do not have biases, the ”input nodes” are simply values given to the
network). In practice, weights and biases of a network are initialized randomly
between −1 and 1. Given these numbers, we compute:

n3 = σ(4 ∗ 0.2 + 5 ∗ 0.3 + 1) = σ(3, 3) = 0.964

n4 = σ(5 ∗ 0.2 + 6 ∗ 0.3 + 1) = σ(3.8) = 0.978


n5 = σ(6 ∗ 0.2 + 7 ∗ 0.3 + 1) = σ(4.3) = 0.987
n6 = σ(9 ∗ 0.964 + 10 ∗ 0.978 + 11 ∗ 0.987 + 1) = σ(30.313) = 1
This example actually illustrates one of the weak points of the Sigmoid
function: it quickly approaches 1 for large numbers. The reason for using the
Sigmoid function will be shown in the section on backpropagation.

3.2 Vectorized Forward Propagation


Look again at these nodes of the network:

n3 = σ(w13 n1 + w23 n2 + b3 )

n4 = σ(w14 n1 + w24 n2 + b4 )
n5 = σ(w15 n1 + w25 n2 + b5 )
We can rewrite this as
     !
n3 w13 w23   b3
n
n4  = σ w14 w24  1 + b4 
n2
n5 w15 w25 b5

Notice how the nodes in each layer of the network are in their own column
vector, in the order they appear. Let’s relabel this network by layers:

3
x0 x1 x2

w1 w2

Here, x0 and x2 represent the input and output layers, and x1 is the middle
layer (called a hidden layer). Mathematically speaking, these are represented
as column vectors of dimension n × 1, where n is the number of nodes in the
layer. Thinking back to the non-vectorized network in section 3.1,
 
  n3  
n n
x0 = 1 x1 = n4  x2 = 6
n2 n7
n5

w1 and w2 are the weight matrices. Thinking back to the non-vectorized network
in section 3.1, w1 corresponds to
 
w13 w23
w14 w24 
w15 w25

and w2 refers to  
w36
w46 
w56
Each layer (except the input) also has a bias vector, which has the same di-
mension as the layer itself (each node has a bias). Again thinking back to the
non-vectorized network in section 3.1, we define b1 to be
 
b3
b4 
b5

and b2 to be  
b6
b7
We can now re-write the forward propagation formula in a far more compact
form. In any n layer network, for a given layer xi+1 (assuming 0 ≤ i < n − 1):

xi+1 = σ(wi xi + bi+1 )

4
4 Backpropagation
Backpropagation is how neural networks learn. It is essential to not only un-
derstand the theory behind backpropagation, but also the mathematics behind
it. This is one of the few mathetmatically rigorous sections of our material. (Of
course, for anyone who has taken Multivariable calculus, the material should
be relatively straightforward. Nevertheless, high school students not entirely
comfortable with the math will no doubt have some trouble.)

4.1 Learning
A neural network learns when it is given training data and labels. The data
(inputs) can be in the form of text, images, numbers, etc. The label is the ground
truth, the correct answer for the given input. Given enough data-label pairs,
a network can learn to generalize the relationship between the data and label.
After training, it is tested or validated on a set of data it has never seen before
(i.e. data not part of the training set). This validation accuracy shows just how
well a network has learned to generalize through training. Backpropagation is
the method of updating the weights and biases of the network to minimize the
error when training.

4.2 Error
Consider the following network:
x0 x1 x2

W1 W2

For the input x0 , let y represent the target vector, or the ground truth. We
define the error as
1
E = ||x2 − y||2
2
Essentially, this is the magnitude of the difference between the target and
the network’s output. In order for a network to become more accurate, we want
to minimize this error.
Let’s think of E as a function. Only x2 can vary, and we can only control
this by changing the weight matrices (and the bias). Thus, for a neuron with n
weights and a bias, the error can be graphed as an n + 2 dimensional function

5
(y = f (x) has 1 input, so it is graphed in two dimensions). For this network,
each of the weights (3 ∗ 4 + 4 ∗ 2 = 20) and the biases (6) determines the error,
so the error has many, many dimensions. If we get to the minimum of this
function, we have minimized the error and trained the network.

4.3 Gradient Descent


We can’t visualize that many dimensions (at least, I can’t), so lets pretend we
are working with a three dimensional function. How to we get to the minimum?
We use gradient descent, of course!

A multi-dimensional function. Look at that minimum!

Gradient descent is simple: Starting at some point, we move in the direction


of steepest decline for a certain length. Then, at our new point, we again
compute the direction of steepest decline, and move in that direction for a
certain length. We repeat this process over and over until every single direction
is an incline, at which point we are at the minimum.
This has three issues. First, how do we know how long our steps are? Take a
step too long, and we could overshoot the minimum. Take a step too short and
it will take us many steps to reach the minimum. The step length is actually
just a constant set by the programmer, and normally ranges from 0.1 to 0.0001.
Adjusting the constant to get the best result is an important practical topic for
getting the best result, and we will discuss it in Part 3 of the lecture. For now,
just know its a constant.
Secondly, doesn’t gradient descent just get us to a minimum? What if there
are multiple minima, and we just happen to land in a local minimum, like the
many in the function below?

6
Getting out of local minima to reach the global minimum is another impor-
tant machine learning topic. Different optimizers can help the network pop out
of local minima using momentum, but this topic is complex and modern, so it
is covered in depth in Part 3 of this lecture. For the purposes of explaining
gradient descent, we’ll just pretend we’re working with an error function with
one minimum.
The third and final issue is: how do we know which direction is the steepest?
We can’t just sample each direction, as there are infinite possibilities. Instead,
we mathematically compute the best direction. Let’s consider a simple two-
dimensional parabola:

From elementary calculus, we know that:

f (x) = x2
f 0 (x) = 2x
The derivative gives us the instantaneous rate of change for any x. If we
have a function in terms of x and y, we can take the derivative of f (x, y) with
respect to x to find the rate of change in the x direction, and the derivative with
respect to y to find the rate of change in the y direction. These are called partial
derivatives. We treat the other variables like we would any other constant.

7
Let’s do an example. Given f (x, y) = 2x2 + 3xy + y 3 , the partial derivatives
are:
∂f
= 4x + 3y
∂x
∂f
= 3x + 3y 2
∂x
The gradient of f (x, y), or ∇f (x, y) is just the vector:
∂f ∂f 
,
∂x ∂y
For our example, the gradient is:

(4x + 3y, 3x + 3y 2 )

This is the direction of steepest ascent. How do we know that? First, lets
consider the directional derivative. ∇u f (x0 , y0 ) is the rate of change of f (x, y)
at the point (x0 , y0 ) in the direction ~u. It is also defined in terms of the gradient
as:
∇u f (x0 , y0 ) = ∇f (x, y) · ~u
We know from our standard dot product rule:

~a · ~b = ||~a||||~b|| cos(θ)

And cos(θ) is maximized at θ = 0. Thus, when two vectors are in the


same direction, their dot product is maximized. From this information, the
maximum of the directional derivative must be when ∇f (x, y) and ~u are in the
same direction. This means that the direction of the steepest ascent (maximum
rate of change) is the direction of the gradient.
Great! Now our third issue has been solved. In order to find the minimum
of a multi-dimensional function, we just need to compute the gradient, move in
that direction for a certain length, and repeat until the gradient is 0. The only
problem is.... how do we compute the gradient? Our function is
1
E(W, b) = ||o − t||2
2
Where o is the network output at t is the target. Since the error is in terms
of the weights and biases, that means that we need to compute:
∂E ∂E ∂E 
, , ...,
∂W1 ∂W2 ∂bn
This is why backpropagation is a fundamental concept in machine learning.
It allows us to compute this gradient in a computationally efficient manner.

4.4 Non-Vectorized Backpropagation


Don’t do it. Seriously.

8
4.5 Vectorized Backpropagation
Consider the network from Section 4.2 again.

x0 x1 x2

W1 W2

Ignoring biases (which we will see follow a relatively simple rule), we know
from forward propagation that:

x1 = σ(W1 x0 )

x2 = σ(W2 x1 )
And the error is, assuming some 2 × 1 target vector y:
1
E= ||x2 − y||2
2
Let’s first take the partial derivative of E with respect to W2 . This is just
like taking a normal derivative (using the chain rule).

∂E ∂(σ(W2 x1 ))
= (x2 − y)
∂W2 ∂W2
∂E ∂W2 x1
= [(x2 − y) σ 0 (W2 x1 )]
∂W2 ∂W2
Here, is the Hadamard product, or element wise multiplication (Remem-
ber, these are all vectors). For the sake of simplification, lets define

δ2 = (x2 − y) σ 0 (W2 x1 )

Then, we can rewrite the partial as


∂E ∂W2 x1
= δ2 = δ2 xT1
∂W2 ∂W2

Note that xT1 means that the x1 vector has been transposed (i.e. it is a row
vector). This is essential for the dimensions to work out, which we can check
now.

9
Since the whole point is to update the weights by some factor every time we
backpropagate in the direction of fastest descent to minimize the error, we want
to subtract the partial matrix (since it is in the direction of fastest ascent):
∂E
Wi = W i − α
∂Wi
∂E
where alpha is the learning rate. This requires ∂Wi to be the same dimensions
as Wi . Using W2 as an example, we know that

x2 = σ(W2 x1 )

where x2 is a 2×1 vector, x1 is a 4×1 vector, so W2 is a 2×4 matrix. Thus, both


∂E T
∂Wi and δ2 x1 are also 2 × 4 matrices. Since δ2 = (y − σ(W2 x1 )) σ 0 (W2 x1 ),
and we know y is a 2 × 1 matrix, δ2 has dimensions 2 × 1. If δ2 is 2 × 1, then it
must be multiplied by a 1 × 4 vector to create a 2 × 4 matrix. Since x1 is 4 × 1,
it must be transposed to become 1 × 4.
Let’s continue to the next weight matrix.
∂E ∂(σ(W2 x1 ))
= (x2 − y)
∂W1 ∂W1
∂E ∂W2 x1
= [(x2 − y) σ 0 (W2 x1 )]
∂W1 ∂W1
∂E ∂W2 x1 ∂x1
= δ2 = W2T δ2
∂W1 ∂W1 ∂W1
Substituting in for x1 , we get:
∂E ∂(σ(W1 x0 ))
= W2T δ2
∂W1 ∂W1
∂E ∂W1 x0
= [W2T δ2 σ 0 (W1 x0 )]
∂W1 ∂W1
Again, we simplify this:

δ1 = W2T δ2 σ 0 (W1 x0 )

and we finish with


∂E ∂W1 x0
= δ1
∂W1 ∂W1
∂E
= δ1 xT0
∂W1
We can generalize this for any layer. The only difference is the delta for the
last layer:
δL = (xL − y) σ 0 (WL xL−1 )
The delta for every other layer is:
T
δi = Wi+1 δi+1 σ 0 (Wi xi−1 )

10
And the gradient for every weight matrix are calculated and the weight
matrices are updated as follows:
∂E
= δi xTi−1
∂Wi

∂E
Wi = W i − α
∂Wi
For biases, the rule is simpler:

bi = bi − αδi
That is the essence of backpropagation. Note that these formulas work for
any activation function. The reason sigmoid is used to teach is because its
derivative is fairly straightforward:

σ 0 (x) = σ(x)(1 − σ(x))

5 Problems
1. Given the following network:

x0 x1 x2

W1 W2

the weight matrices, bias vectors and input are as follows:


 
2 3 4  
2 1 2 3 1 1 1
W1 =   W2 =
3 5 1 1 4 2 2
2 3 4
 
  4
2 1
 
2
x0 = 1 b1 =  
1 b2 =
3
3
2

11
Instead of using the Sigmoid activation function, use a linear function
y = x, which always has a derivative of 1. Compute the output of one
forward pass, then compute a backward pass using the following target
and learning rate:  
4
t= α = 0.1
5

2. Write out the forward propagation algorithm in Python. Use the Numpy
library for matrices.
3. Write out the backpropagation algorithm in Python. Use the Numpy
library for matrices.
4. Write an entire Neural Network in Python, using the Numpy library.

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy