0% found this document useful (0 votes)
15 views

UNIT 3 - Backpropagation Algorithm

Uploaded by

esmritypoudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

UNIT 3 - Backpropagation Algorithm

Uploaded by

esmritypoudel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine Learning

Unit 3
Artificial Neural Networks

By
Dr. G. Sunitha
Professor & BoS Chairperson
Department of CSE

Department of Computer Science and Engineering

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102

1
Neural Network Learning with Backpropagation Algorithm

2
Neural Network Learning with Backpropagation Algorithm . . .

3
Neural Network Learning with Backpropagation Algorithm . . .

4
Neural Network Learning with Backpropagation Algorithm . . .

5
Neural Network Learning with Backpropagation Algorithm . . .

6
Neural Network Learning with Backpropagation Algorithm . . .

7
Neural Network Learning with Backpropagation Algorithm . . .

8
Neural Network Learning with Backpropagation Algorithm . . .

9
Neural Network Learning with Backpropagation Algorithm . . .

❖ The inputs are fed simultaneously into the input layer.


❖ The weighted outputs of these units are fed into hidden layer.
❖ The weighted outputs of the last hidden layer are inputs to units making up the output layer.
❖ The errors are back propagated from output layer back to hidden layers.
❖ Weights and bias values are updated.
❖ Process repeated until converging criteria is met.
❖ Once the neural network model is ready, it can be used for prediction.
❖ Chain Learning Rule. After each forward pass through a network, backpropagation performs a backward pass
while adjusting the model’s parameters (weights and biases).

Propagate Input Forwards

Propagate Error Backwards

10
Classification by Backpropagation
❖ In 1969 a method for learning in multi-layer network, Backpropagation was invented by Bryson and Ho.
❖ The Backpropagation algorithm is a sensible approach for dividing the contribution of each weight.
❖ Works basically the same as Perceptrons.
❖ Back Propagation learns by iteratively processing a set of training data (samples).
❖ For each sample, weights are modified to minimize the error between network’s classification and actual
classification.

11
Steps in Backpropagation Algorithm
❖ STEP ONE: initialize the weights and biases.
✓ The weights and biases in the network are initialized to random numbers from the interval [-1 , 1].

❖ STEP TWO: feed the training sample to input layer.

❖ STEP THREE: Propagate the inputs forward from input layer through hidden layers and to output layer; we
compute the net input and output of each unit in the hidden and output layers.

❖ STEP FOUR: Compute the errors starting from output layer to backwards into hidden layers and back propagate
the error.

❖ STEP FIVE: update weights and biases to reflect the propagated errors.

❖ STEP SIX: terminating conditions.

12
Propagate the Inputs Forward
❖ For unit j in the input layer, its output is equal to its input, that is,

❖ The net input to each unit in the hidden and output layers is computed as follows.
Given a unit j in a hidden or output layer, the net input is

where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i
from the previous layer; bj is the bias of the unit

• Each unit in the hidden and output layers takes its net input and then applies an activation function.
Given a net input Ij to unit j, then Oj = f(Ij),
the output of unit j, is computed as

Neurons in Hidden Layers and


Output Layer are called as Sigmoid Function
Sigmoid Units (Non-linear Function)
13
Backpropagate the Error
❖ When reaching the Output layer, the error is computed and propagated backwards.
❖ For a unit k in the output layer the error is computed by a formula:

Where Ok – actual output of output neuron k.


Tk – True output based of known class label (Expected Output);
Classification of training sample Ok(1-Ok) is a Derivative ( rate of change ) of activation function.
❖ The error is propagated backwards by updating weights and biases to reflect the error of the network
classification .
❖ For a unit j in the hidden layer the error is computed by a formula:

where wjk is the weight of the connection from unit j to unit k in the next higher layer, and
Errk is the error of unit k.

14
Update Weights and Biases

❖ Weights are updated by the following equations, where l is a constant between 0.0 and 1.0 reflecting the
learning rate, this learning rate is fixed for implementation.

❖ Biases are updated by the following equations

❖ We are updating weights and biases after the presentation of each sample. This is called case updating.
❖ Epoch --- One iteration through the training set is called an epoch.
❖ Epoch updating - Alternatively, the weight and bias increments could be accumulated in variables and the
weights and biases updated after all of the samples of the training set have been presented.
❖ Case updating is more accurate
15
Terminating Conditions
Training Stops when Criteria is Met:

❖ All wij in the previous epoch are below some threshold, or

❖ The percentage of samples misclassified in the previous epoch is below some threshold, or

❖ A pre specified number of epochs has expired.

In practice, several hundreds of thousands of epochs may be required before the weights will converge.

16
17
Features of ANNs
❖ ANNs perform well, generally better with larger number of hidden units.
❖ More hidden units generally produce lower error.
❖ Determining network topology is difficult.
❖ Choosing single learning rate is impossible.
❖ Difficult to reduce training time by altering the network topology or learning parameters.
❖ Highly accurate predictive models for a large number of different types of problems.
❖ Ease of use and deployment – poor.
Connection between nodes
Number of units
Training level
❖ Learning Capability
Model is built one record at a time

18
Features of ANNs . . .
❖ Weaknesses
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure."
Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of “hidden units"
in the network

❖ Strengths
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the
extraction of rules from trained neural networks

19
Hypothesis Space and Inductive Bias
❖ Every possible combination of network weights is a potential candidate.
❖ All potential candidates form the Hypothesis Space.
❖ A Hypothesis space can be defined as an N-dimensional Euclidean space of N network weights.
❖ The hypothesis space of a neural network is a continuous space.
❖ Error E of a network is differentiable with respect to the continuous parameters of the hypothesis space.
❖ The above two factors results in a well-defined error gradient which leads to efficient search strategies.
❖ Inductive Bias – can be defined as the set of assumptions (implicit or explicit) made by learning
algorithms in order to perform induction (or generalization).
❖ Inductive Bias of NN – “Smooth Interpolation between the data points”.

20
Representational Power of Feedforward NNs
Representational power specifies the power of NNs. What set of problems can be represented by NN?
❖ Boolean Functions –
• Every Boolean function can be represented by NN with 2 layers.
• The maximum no. of hidden neurons required = no. of samples in training data.
• NN may be designed with less number of hidden neurons.
❖ Continuous Functions –
• Every bounded continuous function can be approximated with arbitrarily small error E by using NN with
2 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.
❖ Arbitrary Functions –
• Any arbitrary function can be approximated by a NN with 3 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.

21
Hidden Layer Representations
❖ Hidden layers capture the characteristics of training data to learn the target function.
❖ Training samples only restrain the number of input neurons and output neurons.
❖ The hidden layers and hidden neurons are not explicitly introduced by human designer.
❖ Hence, NN has capability to adjust its structure and parameters to discover efficient NN to solve the given
problem with minimal possible error E.
❖ This ability of Multilayer NNs to automatically discover useful representations of hidden layers is a key feature of
ANN learning.
❖ The more the number of hidden layers/neurons, the more complex problems can be represented by the NN.

22
Convergence and Local Minima
❖ BPN uses Gradient Descent search to search through the hypothesis space.
❖ The objective is to search through the hypothesis space in the direction so as to reduce the error E.
❖ Because the error surface for multilayer NNs may contain multiple local minima, the gradient descent search
may get stuck at one of these local minima.
❖ As a result BPN is not guaranteed to converge at global minima, but may converge at a local minima.

23
Convergence and Local Minima . . .
❖ In spite of this disadvantage BPN is a largely popular model for a NN consisting of a large number of weights.
❖ Large number of weights correspond to very high dimensional error surfaces ;
→ local minima for a weight may not be local minima for other weights ;
→ Hence they provide escape routes → BPN may not get stuck at a local minima.
❖ If initial weights are near to zero
→ during early iterations, sigmoid function provides smooth & linear function ;
→ as iterations pass, weights tend to increase their value in order to reduce the error E ;
→ this is the time where NN represents complex/nonlinear functions ;
→ this is the region which may have more local minima, and BPN may get stuck at a local minima ;
→ But, it may be hoped that by this time BPN has reached close enough to the global minima;
→ and it is acceptable even if BPN gets stuck at a local minima closer to global minima.

24
Convergence and Local Minima . . .
❖ Regarding gradient descent over complex error surfaces, the following heuristics may be attempted to alleviate
the problem of local minima:
1) Add momentum to Chain Rule –

Momentum Term

Momentum has two effects on the gradient descent:


✓ It keeps the descent in the same direction through the iterations and
✓ It keeps the descent going through local minima and flat regions.

25
Convergence and Local Minima . . .
❖ Regarding gradient descent over complex error surfaces, the following heuristics may be attempted to alleviate
the problem of local minima: . . .
2) Use Stochastic Gradient Descent instead of Standard Gradient Descent
Stochastic Gradient Descent travels through approximate error surfaces which will have different
local minima. Hence, it can be hoped that BPN will not get stuck in one of these local minima.

3) Train multiple NNs with different initial weights.


Different initial weights lead to different error surface and different local minima;
The NN with best performance can be selected as final model.

26
Generalization, Overfitting
❖ Generalization is the capability of the model to perform well on unseen data.
❖ Overfitting Why does overfitting occur at later
iterations of learning process?

Through iterations
→ weights tend to increase their values to
reduce error E
→ Larger weight values increase model
complexity
→ Leads to overfitting

Solutions:
→ Weight Decay
→ Validation Data
→ K-Fold Cross Validation

27
Case Study: ALVINN
❖ Autonomous Land Vehicle In a Neural Network - 1989
❖ ALVINN is a neural network designed to steer an
autonomous vehicle driving at normal speeds on public
highways.
❖ A forward-pointed camera is mounted on the vehicle.
❖ The camera takes images of resolution 120 x 128 pixels.
❖ Currently ALVINN takes images from a camera and a laser
range finder as input and produces as output the direction
the vehicle should travel in order to follow the road.
❖ ALVINN is trained for 5 minutes to observe and learn from
human driving.
❖ Further it has been tested successfully for autonomous
driving of 90 miles with up to 70 miles speed on public
highways (driving in the left lane of highway, with other
vehicles present)

28
Case Study: ALVINN . . .

❖ ALVINN is a 2-layer Backpropagation NN with


960 input neurons, 4 hidden neurons and 30
output neurons.
❖ Here the individual units are interconnected in
layers that form a directed acyclic graph.
❖ It is a feedforward network.
❖ It is a fully connected network.
❖ The output layer is a linear representation of the
direction the vehicle should travel in order to keep
the vehicle on road.

29
Case Study: ALVINN . . .
❖ The 120 x 128 image taken by camera is converted
into a coarse-resolution image of 30 x 32.
❖ Each coarse resolution pixel intensity is obtained by
selecting the intensity of a single pixel at random
from the appropriate region within the high-
resolution image.
❖ This 30 x 32 coarse-resolution image is used
as input to the network.
❖ This method significantly reduces the computation
required to produce the coarse-resolution image from
the available high-resolution image.
❖ This efficiency is especially important when the
network must be used to process many images per
second while autonomously driving the vehicle.
❖ Output from each output unit corresponds to a
particular steering direction, and the output values
of these units determine which steering direction is
recommended most strongly. 30
Case Study: ALVINN . . .

❖ The large matrix of black and white boxes depicts


the weights from the 30 x 32 pixel inputs into the
hidden unit. Here, a white box indicates a
positive weight, a black box a negative weight,
and the size of the box indicates the weight
magnitude.
❖ The smaller rectangular diagram directly above
the large matrix shows the weights from this
hidden unit to each of the 30 output units.

31
Case Study: Face Recognition
❖ Application of Backpropagation NN to learn the direction of face in the images – left, right, up, down.
❖ The learning task here involves classifying camera images of faces of various people in various poses.

32
Case Study: Face Recognition . . . .
Data Collection
❖ Images of 20 different people were collected, including approximately 32 images per person,
• varying the person's expression (happy, sad, angry, neutral),
• Varying the direction in which they were looking (left, right, straight ahead, up), and whether or not
they were wearing sunglasses.
• Varying the background behind the person,
• Varying the clothing worn by the person,
• Varying the position of the person’s face within the image.
❖ In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each image pixel
described by a greyscale intensity value between 0 (black) and 255 (white).

33
Case Study: Face Recognition . . . .
Input Encoding
❖ The 120 x 128 image is encoded into a coarse-resolution image of 30 x 32 pixels.
❖ Each coarse resolution pixel intensity is calculated as the mean of the corresponding high-resolution pixel
intensities.
❖ This 30 x 32 coarse-resolution image is used as input to the network.
❖ Data Scaling - The pixel intensity values ranging from 0 to 255 were linearly scaled to range from 0 to 1 so that
network inputs would have values in the same interval as the hidden unit and output unit activations.

Output Encoding
❖ 1-of-n output encoding is used.
❖ Each output neuron produced a real-valued number between 0.1 and 0.9.
❖ The NN’s prediction will be equal to the neuron with highest value.

34
Case Study: Face Recognition . . . .
Network Graph Structure
❖ The Backpropagation network is an acyclic directed graph of sigmoid units.
❖ It is a feedforward network.
❖ It is a fully connected network.
❖ A 2 layer NN with 960 input neurons, 4 output neurons. (2899 weights)
❖ Experimentation is done with
• 3 hidden neurons – produced model with accuracy of 90%. (less training time)
• Up to 30 hidden neurons – produced model with accuracy of 91% - 92%. (more training time)

❖ Using 260 training images, the training time on a Sun Sparc5 workstation was approximately
• 5 minutes for the 3 hidden unit network,
• 1 hour for the 30 hidden unit network.

35
Case Study: Face Recognition . . . .
❖ In these learning experiments the

• Learning rate l was set to 0.3,

• Momentum α was set to 0.3.


❖ Lower values for both parameters produced roughly equivalent generalization accuracy, but longer training
times.
❖ If these values are set too high, training fails to converge to a network with acceptable error over the training
set.
❖ Full gradient descent was used in all these experiments (in contrast to the stochastic approximation to gradient
descent).
❖ Input unit weights were initialized to zero.
❖ Network weights in the output units were initialized to small random values.

36
Case Study: Face Recognition . . . .
❖ Number of training iterations was selected by partitioning the available data into a training set and a separate
validation set.
❖ Gradient descent was used to minimize the error over the training set, and after every 50 gradient descent
steps the performance of the network was evaluated over the validation set.
❖ The final selected network was the one with the highest accuracy over the validation set.
❖ The final reported was measured over a test dataset.

37
Case Study: Face Recognition . . . . Large +ve
weight Large -ve
weight

38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy