UNIT 3 - Backpropagation Algorithm
UNIT 3 - Backpropagation Algorithm
Unit 3
Artificial Neural Networks
By
Dr. G. Sunitha
Professor & BoS Chairperson
Department of CSE
1
Neural Network Learning with Backpropagation Algorithm
2
Neural Network Learning with Backpropagation Algorithm . . .
3
Neural Network Learning with Backpropagation Algorithm . . .
4
Neural Network Learning with Backpropagation Algorithm . . .
5
Neural Network Learning with Backpropagation Algorithm . . .
6
Neural Network Learning with Backpropagation Algorithm . . .
7
Neural Network Learning with Backpropagation Algorithm . . .
8
Neural Network Learning with Backpropagation Algorithm . . .
9
Neural Network Learning with Backpropagation Algorithm . . .
10
Classification by Backpropagation
❖ In 1969 a method for learning in multi-layer network, Backpropagation was invented by Bryson and Ho.
❖ The Backpropagation algorithm is a sensible approach for dividing the contribution of each weight.
❖ Works basically the same as Perceptrons.
❖ Back Propagation learns by iteratively processing a set of training data (samples).
❖ For each sample, weights are modified to minimize the error between network’s classification and actual
classification.
11
Steps in Backpropagation Algorithm
❖ STEP ONE: initialize the weights and biases.
✓ The weights and biases in the network are initialized to random numbers from the interval [-1 , 1].
❖ STEP THREE: Propagate the inputs forward from input layer through hidden layers and to output layer; we
compute the net input and output of each unit in the hidden and output layers.
❖ STEP FOUR: Compute the errors starting from output layer to backwards into hidden layers and back propagate
the error.
❖ STEP FIVE: update weights and biases to reflect the propagated errors.
12
Propagate the Inputs Forward
❖ For unit j in the input layer, its output is equal to its input, that is,
❖ The net input to each unit in the hidden and output layers is computed as follows.
Given a unit j in a hidden or output layer, the net input is
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i
from the previous layer; bj is the bias of the unit
• Each unit in the hidden and output layers takes its net input and then applies an activation function.
Given a net input Ij to unit j, then Oj = f(Ij),
the output of unit j, is computed as
where wjk is the weight of the connection from unit j to unit k in the next higher layer, and
Errk is the error of unit k.
14
Update Weights and Biases
❖ Weights are updated by the following equations, where l is a constant between 0.0 and 1.0 reflecting the
learning rate, this learning rate is fixed for implementation.
❖ We are updating weights and biases after the presentation of each sample. This is called case updating.
❖ Epoch --- One iteration through the training set is called an epoch.
❖ Epoch updating - Alternatively, the weight and bias increments could be accumulated in variables and the
weights and biases updated after all of the samples of the training set have been presented.
❖ Case updating is more accurate
15
Terminating Conditions
Training Stops when Criteria is Met:
❖ The percentage of samples misclassified in the previous epoch is below some threshold, or
In practice, several hundreds of thousands of epochs may be required before the weights will converge.
16
17
Features of ANNs
❖ ANNs perform well, generally better with larger number of hidden units.
❖ More hidden units generally produce lower error.
❖ Determining network topology is difficult.
❖ Choosing single learning rate is impossible.
❖ Difficult to reduce training time by altering the network topology or learning parameters.
❖ Highly accurate predictive models for a large number of different types of problems.
❖ Ease of use and deployment – poor.
Connection between nodes
Number of units
Training level
❖ Learning Capability
Model is built one record at a time
18
Features of ANNs . . .
❖ Weaknesses
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure."
Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of “hidden units"
in the network
❖ Strengths
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the
extraction of rules from trained neural networks
19
Hypothesis Space and Inductive Bias
❖ Every possible combination of network weights is a potential candidate.
❖ All potential candidates form the Hypothesis Space.
❖ A Hypothesis space can be defined as an N-dimensional Euclidean space of N network weights.
❖ The hypothesis space of a neural network is a continuous space.
❖ Error E of a network is differentiable with respect to the continuous parameters of the hypothesis space.
❖ The above two factors results in a well-defined error gradient which leads to efficient search strategies.
❖ Inductive Bias – can be defined as the set of assumptions (implicit or explicit) made by learning
algorithms in order to perform induction (or generalization).
❖ Inductive Bias of NN – “Smooth Interpolation between the data points”.
20
Representational Power of Feedforward NNs
Representational power specifies the power of NNs. What set of problems can be represented by NN?
❖ Boolean Functions –
• Every Boolean function can be represented by NN with 2 layers.
• The maximum no. of hidden neurons required = no. of samples in training data.
• NN may be designed with less number of hidden neurons.
❖ Continuous Functions –
• Every bounded continuous function can be approximated with arbitrarily small error E by using NN with
2 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.
❖ Arbitrary Functions –
• Any arbitrary function can be approximated by a NN with 3 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.
21
Hidden Layer Representations
❖ Hidden layers capture the characteristics of training data to learn the target function.
❖ Training samples only restrain the number of input neurons and output neurons.
❖ The hidden layers and hidden neurons are not explicitly introduced by human designer.
❖ Hence, NN has capability to adjust its structure and parameters to discover efficient NN to solve the given
problem with minimal possible error E.
❖ This ability of Multilayer NNs to automatically discover useful representations of hidden layers is a key feature of
ANN learning.
❖ The more the number of hidden layers/neurons, the more complex problems can be represented by the NN.
22
Convergence and Local Minima
❖ BPN uses Gradient Descent search to search through the hypothesis space.
❖ The objective is to search through the hypothesis space in the direction so as to reduce the error E.
❖ Because the error surface for multilayer NNs may contain multiple local minima, the gradient descent search
may get stuck at one of these local minima.
❖ As a result BPN is not guaranteed to converge at global minima, but may converge at a local minima.
23
Convergence and Local Minima . . .
❖ In spite of this disadvantage BPN is a largely popular model for a NN consisting of a large number of weights.
❖ Large number of weights correspond to very high dimensional error surfaces ;
→ local minima for a weight may not be local minima for other weights ;
→ Hence they provide escape routes → BPN may not get stuck at a local minima.
❖ If initial weights are near to zero
→ during early iterations, sigmoid function provides smooth & linear function ;
→ as iterations pass, weights tend to increase their value in order to reduce the error E ;
→ this is the time where NN represents complex/nonlinear functions ;
→ this is the region which may have more local minima, and BPN may get stuck at a local minima ;
→ But, it may be hoped that by this time BPN has reached close enough to the global minima;
→ and it is acceptable even if BPN gets stuck at a local minima closer to global minima.
24
Convergence and Local Minima . . .
❖ Regarding gradient descent over complex error surfaces, the following heuristics may be attempted to alleviate
the problem of local minima:
1) Add momentum to Chain Rule –
Momentum Term
25
Convergence and Local Minima . . .
❖ Regarding gradient descent over complex error surfaces, the following heuristics may be attempted to alleviate
the problem of local minima: . . .
2) Use Stochastic Gradient Descent instead of Standard Gradient Descent
Stochastic Gradient Descent travels through approximate error surfaces which will have different
local minima. Hence, it can be hoped that BPN will not get stuck in one of these local minima.
26
Generalization, Overfitting
❖ Generalization is the capability of the model to perform well on unseen data.
❖ Overfitting Why does overfitting occur at later
iterations of learning process?
Through iterations
→ weights tend to increase their values to
reduce error E
→ Larger weight values increase model
complexity
→ Leads to overfitting
Solutions:
→ Weight Decay
→ Validation Data
→ K-Fold Cross Validation
27
Case Study: ALVINN
❖ Autonomous Land Vehicle In a Neural Network - 1989
❖ ALVINN is a neural network designed to steer an
autonomous vehicle driving at normal speeds on public
highways.
❖ A forward-pointed camera is mounted on the vehicle.
❖ The camera takes images of resolution 120 x 128 pixels.
❖ Currently ALVINN takes images from a camera and a laser
range finder as input and produces as output the direction
the vehicle should travel in order to follow the road.
❖ ALVINN is trained for 5 minutes to observe and learn from
human driving.
❖ Further it has been tested successfully for autonomous
driving of 90 miles with up to 70 miles speed on public
highways (driving in the left lane of highway, with other
vehicles present)
28
Case Study: ALVINN . . .
29
Case Study: ALVINN . . .
❖ The 120 x 128 image taken by camera is converted
into a coarse-resolution image of 30 x 32.
❖ Each coarse resolution pixel intensity is obtained by
selecting the intensity of a single pixel at random
from the appropriate region within the high-
resolution image.
❖ This 30 x 32 coarse-resolution image is used
as input to the network.
❖ This method significantly reduces the computation
required to produce the coarse-resolution image from
the available high-resolution image.
❖ This efficiency is especially important when the
network must be used to process many images per
second while autonomously driving the vehicle.
❖ Output from each output unit corresponds to a
particular steering direction, and the output values
of these units determine which steering direction is
recommended most strongly. 30
Case Study: ALVINN . . .
31
Case Study: Face Recognition
❖ Application of Backpropagation NN to learn the direction of face in the images – left, right, up, down.
❖ The learning task here involves classifying camera images of faces of various people in various poses.
32
Case Study: Face Recognition . . . .
Data Collection
❖ Images of 20 different people were collected, including approximately 32 images per person,
• varying the person's expression (happy, sad, angry, neutral),
• Varying the direction in which they were looking (left, right, straight ahead, up), and whether or not
they were wearing sunglasses.
• Varying the background behind the person,
• Varying the clothing worn by the person,
• Varying the position of the person’s face within the image.
❖ In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each image pixel
described by a greyscale intensity value between 0 (black) and 255 (white).
33
Case Study: Face Recognition . . . .
Input Encoding
❖ The 120 x 128 image is encoded into a coarse-resolution image of 30 x 32 pixels.
❖ Each coarse resolution pixel intensity is calculated as the mean of the corresponding high-resolution pixel
intensities.
❖ This 30 x 32 coarse-resolution image is used as input to the network.
❖ Data Scaling - The pixel intensity values ranging from 0 to 255 were linearly scaled to range from 0 to 1 so that
network inputs would have values in the same interval as the hidden unit and output unit activations.
Output Encoding
❖ 1-of-n output encoding is used.
❖ Each output neuron produced a real-valued number between 0.1 and 0.9.
❖ The NN’s prediction will be equal to the neuron with highest value.
34
Case Study: Face Recognition . . . .
Network Graph Structure
❖ The Backpropagation network is an acyclic directed graph of sigmoid units.
❖ It is a feedforward network.
❖ It is a fully connected network.
❖ A 2 layer NN with 960 input neurons, 4 output neurons. (2899 weights)
❖ Experimentation is done with
• 3 hidden neurons – produced model with accuracy of 90%. (less training time)
• Up to 30 hidden neurons – produced model with accuracy of 91% - 92%. (more training time)
❖ Using 260 training images, the training time on a Sun Sparc5 workstation was approximately
• 5 minutes for the 3 hidden unit network,
• 1 hour for the 30 hidden unit network.
35
Case Study: Face Recognition . . . .
❖ In these learning experiments the
36
Case Study: Face Recognition . . . .
❖ Number of training iterations was selected by partitioning the available data into a training set and a separate
validation set.
❖ Gradient descent was used to minimize the error over the training set, and after every 50 gradient descent
steps the performance of the network was evaluated over the validation set.
❖ The final selected network was the one with the highest accuracy over the validation set.
❖ The final reported was measured over a test dataset.
37
Case Study: Face Recognition . . . . Large +ve
weight Large -ve
weight
38