Back Propagation
Back Propagation
1
Ref book. Machine Learning by Tom Mitchell
2
Perceptron Training Rule
One way to learn an acceptable weight vector is to begin with random weights, then
iteratively apply the perceptron to each training example, modifying the perceptron weights
whenever it misclassifies an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all training examples
correctly. Weights are modified at each step according to the perceptron training rule, which
revises the weight wi associated with input xi according to the rule
Here t is the target output for the current training example, o is the output generated by
the perceptron, and η is a positive constant called the learning rate.
3
For example, if xi = .8, η = 0.1, t = 1, and o = - 1,
then the weight update will be Δwi = η(t - o)xi = 0.1(1 - (-1))0.8 = 0.16. Then the weight is
increased.
On the other hand, if t = -1 and o = 1, then weights associated with positive xi will
be decreased rather than increased.
4
where D is the set of training examples, td is the target output for training example d,
and od is the output of the linear unit for training example d. By this definition, E(w) is
simply half the squared difference between the target output td and the unit output od,
summed over all training examples.
5
Error Surface
6
DERIVATION OF THE GRADIENT DESCENT RULE
How can we calculate the direction of steepest descent along the error surface?
This direction can be found by computing the derivative of E with respect to each
component of the vector w. This vector derivative is called the gradient of E with
respect to w, written
7
8
9
10
11
12
BACK PROPAGATION
“How does backpropagation work?”
Backpropagation learns by iteratively processing a data set of training tuples, comparing the
network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification problems)
or a continuous value (for numeric prediction).
For each training tuple, the weights are modified so as to minimize the mean-squared error
between the network’s prediction and the actual target value.
These modifications are made in the “backwards” direction (i.e., from the output layer)
through each hidden layer down to the first hidden layer (hence the name backpropagation).
13
Gradient Descent (GD) is a widely used optimization algorithm in machine learning and deep
learning that minimises the cost function of a neural network model during training. It works by
iteratively adjusting the weights or parameters of the model in the direction of the negative
gradient of the cost function until the minimum of the cost function is reached.
The cost function evaluates the difference between the actual and predicted outputs.
The algorithm calculates gradients, representing the partial derivatives of the cost function
concerning each parameter.
Multilayer feed-forward networks, given enough hidden units and enough training samples, can
closely approximate any function
14
15
16
17
18
19
20
21
22
23
For a unit j in the output layer,
To compute the error of a hidden layer unit j, the weighted sum of the errors of
the units connected to unit j in the next layer are considered. The error of a
hidden layer unit j is
24
Multi Layer Feed Forward Network Problem