NN 2
NN 2
Nikhil Sardana
October 2017
1 Introduction
We have covered Perceptrons, the fundamental unit of the Neural Network (See
”Neural Networks: Introduction and Overview” for the previous lecture). Ad-
ditionally, Multi-Layer Perceptrons, or Neural Networks, were introduced as a
solution for approximating non-linearly separable data. This lecture continues
exploring Neural Networks, but introduces vectorization for more efficient no-
tation and computation. We also cover how neural networks learn, which is
significantly more complex than the perceptron learning algorithm.
2 The Neuron
A single node of a neural network (a neuron) differs from a perceptron in one
way: the activation function. Consider this diagram of a neuron:
1
The symbol σ represents the Sigmoid activation function σ(x) = 1+e−x .
1
1.0 y
σ(x) = 1+e1−x
g(x) = 1+e1−5x 0.8
1
h(x) = 1+e−100x
0.6
0.4
0.2
x
−5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0
Notice how as the coefficent of x approaches infinity, σ(x) approaches the
step function from before. We use σ(x) is because it is differenciable, which is
necessary for networks to learn. Other activation functions include tanh(x) and
ReLU, but we will use Sigmoid for our examples.
The rest of a neuron is identical to a perceptron: multipy each input by its
weight, add them up and the bias and compute the activation function of the
sum.
3 Forward Propagation
3.1 Non-Vectorized Forward Propagation
Forward Propagation is a fancy term for computing the output of a neural
network. We must compute all the values of the neurons in the second layer
before we begin the third, but we can compute the individual neurons in any
given layer in any order. Consider the following network:
2
We denote the value of node i as ni , and the bias of node i as bi . Computing
the network using these variables, we get:
n3 = σ(w13 n1 + w23 n2 + b3 )
n4 = σ(w14 n1 + w24 n2 + b4 )
n5 = σ(w15 n1 + w25 n2 + b5 )
n6 = σ(w36 n3 + w46 n4 + w56 n5 + b6 )
Continuing this example of forward propagation, let’s assign some numbers
and compute the output of this network. Let n1 = 0.2 and n2 = 0.3. Let
w13 = 4, w14 = 5, w15 = 6, w23 = 5, w24 = 6, w25 = 7, w36 = 9, w46 = 10 and
w56 = 11, just so they are easy to remember. Let all the biases b3..6 = 1 (input
nodes do not have biases, the ”input nodes” are simply values given to the
network). In practice, weights and biases of a network are initialized randomly
between −1 and 1. Given these numbers, we compute:
n3 = σ(w13 n1 + w23 n2 + b3 )
n4 = σ(w14 n1 + w24 n2 + b4 )
n5 = σ(w15 n1 + w25 n2 + b5 )
We can rewrite this as
!
n3 w13 w23 b3
n
n4 = σ w14 w24 1 + b4
n2
n5 w15 w25 b5
Notice how the nodes in each layer of the network are in their own column
vector, in the order they appear. Let’s relabel this network by layers:
3
x0 x1 x2
w1 w2
Here, x0 and x2 represent the input and output layers, and x1 is the middle
layer (called a hidden layer). Mathematically speaking, these are represented
as column vectors of dimension n × 1, where n is the number of nodes in the
layer. Thinking back to the non-vectorized network in section 3.1,
n3
n n
x0 = 1 x1 = n4 x2 = 6
n2 n7
n5
w1 and w2 are the weight matrices. Thinking back to the non-vectorized network
in section 3.1, w1 corresponds to
w13 w23
w14 w24
w15 w25
and w2 refers to
w36
w46
w56
Each layer (except the input) also has a bias vector, which has the same di-
mension as the layer itself (each node has a bias). Again thinking back to the
non-vectorized network in section 3.1, we define b1 to be
b3
b4
b5
and b2 to be
b6
b7
We can now re-write the forward propagation formula in a far more compact
form. In any n layer network, for a given layer xi+1 (assuming 0 ≤ i < n − 1):
4
4 Backpropagation
Backpropagation is how neural networks learn. It is essential to not only un-
derstand the theory behind backpropagation, but also the mathematics behind
it. This is one of the few mathetmatically rigorous sections of our material. (Of
course, for anyone who has taken Multivariable calculus, the material should
be relatively straightforward. Nevertheless, high school students not entirely
comfortable with the math will no doubt have some trouble.)
4.1 Learning
A neural network learns when it is given training data and labels. The data
(inputs) can be in the form of text, images, numbers, etc. The label is the ground
truth, the correct answer for the given input. Given enough data-label pairs,
a network can learn to generalize the relationship between the data and label.
After training, it is tested or validated on a set of data it has never seen before
(i.e. data not part of the training set). This validation accuracy shows just how
well a network has learned to generalize through training. Backpropagation is
the method of updating the weights and biases of the network to minimize the
error when training.
4.2 Error
Consider the following network:
x0 x1 x2
W1 W2
For the input x0 , let y represent the target vector, or the ground truth. We
define the error as
1
E = ||x2 − y||2
2
Essentially, this is the magnitude of the difference between the target and
the network’s output. In order for a network to become more accurate, we want
to minimize this error.
Let’s think of E as a function. Only x2 can vary, and we can only control
this by changing the weight matrices (and the bias). Thus, for a neuron with n
weights and a bias, the error can be graphed as an n + 2 dimensional function
5
(y = f (x) has 1 input, so it is graphed in two dimensions). For this network,
each of the weights (3 ∗ 4 + 4 ∗ 2 = 20) and the biases (6) determines the error,
so the error has many, many dimensions. If we get to the minimum of this
function, we have minimized the error and trained the network.
6
Getting out of local minima to reach the global minimum is another impor-
tant machine learning topic. Different optimizers can help the network pop out
of local minima using momentum, but this topic is complex and modern, so it
is covered in depth in Part 3 of this lecture. For the purposes of explaining
gradient descent, we’ll just pretend we’re working with an error function with
one minimum.
The third and final issue is: how do we know which direction is the steepest?
We can’t just sample each direction, as there are infinite possibilities. Instead,
we mathematically compute the best direction. Let’s consider a simple two-
dimensional parabola:
f (x) = x2
f 0 (x) = 2x
The derivative gives us the instantaneous rate of change for any x. If we
have a function in terms of x and y, we can take the derivative of f (x, y) with
respect to x to find the rate of change in the x direction, and the derivative with
respect to y to find the rate of change in the y direction. These are called partial
derivatives. We treat the other variables like we would any other constant.
7
Let’s do an example. Given f (x, y) = 2x2 + 3xy + y 3 , the partial derivatives
are:
∂f
= 4x + 3y
∂x
∂f
= 3x + 3y 2
∂x
The gradient of f (x, y), or ∇f (x, y) is just the vector:
∂f ∂f
,
∂x ∂y
For our example, the gradient is:
(4x + 3y, 3x + 3y 2 )
This is the direction of steepest ascent. How do we know that? First, lets
consider the directional derivative. ∇u f (x0 , y0 ) is the rate of change of f (x, y)
at the point (x0 , y0 ) in the direction ~u. It is also defined in terms of the gradient
as:
∇u f (x0 , y0 ) = ∇f (x, y) · ~u
We know from our standard dot product rule:
~a · ~b = ||~a||||~b|| cos(θ)
8
4.5 Vectorized Backpropagation
Consider the network from Section 4.2 again.
x0 x1 x2
W1 W2
Ignoring biases (which we will see follow a relatively simple rule), we know
from forward propagation that:
x1 = σ(W1 x0 )
x2 = σ(W2 x1 )
And the error is, assuming some 2 × 1 target vector y:
1
E= ||x2 − y||2
2
Let’s first take the partial derivative of E with respect to W2 . This is just
like taking a normal derivative (using the chain rule).
∂E ∂(σ(W2 x1 ))
= (x2 − y)
∂W2 ∂W2
∂E ∂W2 x1
= [(x2 − y) σ 0 (W2 x1 )]
∂W2 ∂W2
Here, is the Hadamard product, or element wise multiplication (Remem-
ber, these are all vectors). For the sake of simplification, lets define
δ2 = (x2 − y) σ 0 (W2 x1 )
Note that xT1 means that the x1 vector has been transposed (i.e. it is a row
vector). This is essential for the dimensions to work out, which we can check
now.
9
Since the whole point is to update the weights by some factor every time we
backpropagate in the direction of fastest descent to minimize the error, we want
to subtract the partial matrix (since it is in the direction of fastest ascent):
∂E
Wi = W i − α
∂Wi
∂E
where alpha is the learning rate. This requires ∂Wi to be the same dimensions
as Wi . Using W2 as an example, we know that
x2 = σ(W2 x1 )
δ1 = W2T δ2 σ 0 (W1 x0 )
10
And the gradient for every weight matrix are calculated and the weight
matrices are updated as follows:
∂E
= δi xTi−1
∂Wi
∂E
Wi = W i − α
∂Wi
For biases, the rule is simpler:
bi = bi − αδi
That is the essence of backpropagation. Note that these formulas work for
any activation function. The reason sigmoid is used to teach is because its
derivative is fairly straightforward:
5 Problems
1. Given the following network:
x0 x1 x2
W1 W2
11
Instead of using the Sigmoid activation function, use a linear function
y = x, which always has a derivative of 1. Compute the output of one
forward pass, then compute a backward pass using the following target
and learning rate:
4
t= α = 0.1
5
2. Write out the forward propagation algorithm in Python. Use the Numpy
library for matrices.
3. Write out the backpropagation algorithm in Python. Use the Numpy
library for matrices.
4. Write an entire Neural Network in Python, using the Numpy library.
12