Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
Fuzzy Systems
3
Why we need Multi-layer Feed forward Networks
(MLFF)?
Region 2
4
Why we need Multi-layer Feed forward Networks
(MLFF)?
• Second suggestion
• In some cases we need a curve decision boundary or we try to solve
more complicated classification and regression problems.
• So, we need to:
• Add more layers
• Increase a number of neurons in each layer.
• Use non linear activation function in
the hidden layers.
5
Notation for Multi-Layer Networks
•• Dealing
with multi-layer networks is easy if a sensible notation is
adopted.
• We simply need another label (n) to tell us which layer in the network
we are dealing with.
• Each unit j in layer n receives activations from the previous layer of
processing units and sends activations to the next layer of units.
6
ANN Representation
(1 input layer + 1 hidden layer +1 output layer)
layer (0) layer (1) layer (2)
(𝟏)
𝒘
𝟏𝟏
=
(𝟐)
𝒘
(𝟏) 𝒘
(𝟏)
𝟏𝟐 ( 𝒛 |𝒂 ) 𝒘
(𝟏)
𝟏
(𝟏)
𝟏
𝒘
𝟏𝟏
𝒘
(𝟏)
𝟏𝟑
𝒘
(𝟐)
𝟏𝟐
(𝟐)
( 𝒛𝟏 |𝒂 )
(𝟐) (𝟐)
𝟏
𝒚 𝟏
(𝟏) 𝟐𝟏
𝟐𝟏 (𝟏)
𝒘
𝑥2 = 𝑎
(0)
2 (𝟏)
𝟐𝟐
( 𝒛 𝟐 |𝒂 )
(𝟏)
𝟐 𝒘
(𝟐)
𝟐𝟐
𝒘 𝟐𝟑
𝒘
(𝟏)
𝒘
(𝟐)
𝟑𝟏 ( 𝒛𝟐 |𝒂 )
(𝟐) (𝟐)
𝟐
𝒚 𝟐
𝟑𝟏
( 𝒛 𝟑 |𝒂 )
(𝟏) (𝟏) (𝟐)
𝒘
(𝟏)
𝟑𝟐 𝒘
𝟑𝟐
𝟑
(0) (𝟏)
𝑥2 = 𝑎 2
𝒘
𝟑𝟑 • example:
for
+ +
σ ()
++
σ ()
7
Gradient Descent
and Backpropagation
Error Function
● how we can evaluate performance of a neuron ????
● We can use a Error function (or cost function or loss
function) to measure how far off we are from the
expected value.
● Choosing appropriate Error function help the learning
algorithm to reach to best values for weights and biases.
● We’ll use the following variables:
○ D to represent the true value (desired value)
○ y to represent neuron’s prediction
9
Error Functions
(Cost function or Lost Function)
• There are many formulates for error functions.
• In this course, we will deal with two Error function
formulas.
1Sum Squared Error (SSE) :
for single perceptron
10
Why the error in ANN occurs?
• Each weight and bias in the network contribute in
the occasion of the error.
11
Gradient Decent (in 1 dimension)
• Assume we have a error function E and we need to
use it to update one weight w
• The figure show the error function in terms of w
• Our target is to learn the value of w produces the
minimum value of E.
How?
E
W
minimum 12
Gradient Decent (in 1 dimension)
• In
Gradient Decent algorithm, we use the following
equation to get a better value of w:
(called Delta rule)
Where:
: is the learning rate
: is mathematically can be computed using derivative of
E with respect to w ()
E
(3)
W
minimum 13
Local Minima problem
14
Choosing learning rate
15
Gradient Decent (multi dimension)
•• In
ANN with many layers and many neurons in each layer the
Error function will be multi-variable function.
• So, the derivative in equation (3) should be partial derivative
(4)
16
derivative of activation functions
Sigmoid
17
Learning Rule in the output layer
•using
SSE as error function and sigmoid
as Activation function
= * *
Where:
•So (How?),
• Then:
=
19
Learning Rule in the Hidden layer
• Now we have to determine the appropriate
weight change for an input to hidden weight.
• This is more complicated because it depends on
the error at all of the nodes this weighted
connection can lead to.
• The mathematical proof is out our scope.
20
Gradient Decent (Notes)
•Note
1:
• the neuron activation function (f ) should be is
defined and differentiable function.
Note 3:
• The calculating of for the hidden layer will be more
difficult (Why?)
Note 2:
• The previous calculation will be repeated for each
weight and for each bias in the ANN
• So, we need big computational power (what about
deeper networks? )
21
Gradient Decent (Notes)
• is represent the change in the values of to get
better output
• The equation of is dependent on the choosing of
the Error(Cost) function and activation function.
• Gradient Decent algorithm help in calculated the
new values of weights and bias.
• Question: is one iteration (one trail) enough to bet
the best values for weights and biases
• Answer: No, we need a extended version ?
Backpropagation
22
How Backpropagation Work?
𝑭𝒐𝒓𝒘𝒂𝒓𝒅
𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌
𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏
𝒍𝒂𝒚𝒆𝒓
𝟎 𝒍𝒂𝒚𝒆𝒓
𝟏 𝒍𝒂𝒚𝒆𝒓
𝟐
𝒘
𝟏𝟏
(𝟏)
-
(𝟏)
𝒂 𝟏
(𝟏)
𝒘
𝟏𝟐 𝒘
(𝟐)
𝟏𝟏 -
(𝟏)
𝒘
𝟐𝟏
𝒘 (𝟏)
𝟐𝟐 𝒚
(𝟏)
𝒘
𝟑𝟏 (𝟐)
𝒘
𝟐𝟏
(𝟏)
𝒘
𝟑𝟐
23
Online Learning vs. Offline Learning
• Online: Pattern-by-Pattern
• Offline: Batch learning
learning • Error calculated for all
• Error calculated for each patterns
pattern • Weights updated once at
• Weights updated after each the end of each epoch
individual pattern
24
Choosing Appropriate Activation and Cost
Functions
• We already know consideration of single layer networks what output
activation and cost functions should be used for particular problem types.
• We have also seen that non-linear hidden unit activations are needed,
such as sigmoids.
• So we can summarize the required network properties:
• Regression/ Function Approximation Problems
• SSE cost function, linear output activations, sigmoid hidden activations
• Classification Problems (2 classes, 1 output)
• CE cost function, sigmoid output and hidden activations
• Classification Problems (multiple-classes, 1 output per class)
• CE cost function, softmax outputs, sigmoid hidden activations
• In each case, application of the gradient descent learning algorithm (by
computing the partial derivatives) leads to appropriate back-propagation
weight update equations.
25
Overall picture : learning process on ANN
26
Neural network simulator
• Search through the internet to find a simulator and
report it
For example:
• https://www.mladdict.com/neural-network-simula
tor
• http://playground.tensorflow.org/
27