Multilayer Perceptron and Neural Networks
Multilayer Perceptron and Neural Networks
net/publication/228340819
CITATIONS READS
242 25,066
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Call for Book Chapters: Security and Privacy for Internet of Medical Things (IoMT) View project
All content following this page was uploaded by Nikos E Mastorakis on 28 September 2016.
Abstract: - The attempts for solving linear inseparable problems have led to different variations on the number
of layers of neurons and activation functions used. The backpropagation algorithm is the most known and used
supervised learning algorithm. Also called the generalized delta algorithm because it expands the training way
of the adaline network, it is based on minimizing the difference between the desired output and the actual
output, through the downward gradient method (the gradient tells us how a function varies in different
directions). Training a multilayer perceptron is often quite slow, requiring thousands or tens of thousands of
epochs for complex problems. The best known methods to accelerate learning are: the momentum method and
applying a variable learning rate. The paper presents the possibility to control the induction driving using neural
systems.
activation functions are linear, because a linear argument and are saturated, somewhat taking over
function of linear functions is also a linear function. the role of threshold for high absolute values of the
argument. It has been shown [4] that a network
(possibly infinite) with one hidden layer is able to
approximate any continuous function.
1 − e − a⋅ s
f ( s) = . (2)
1 + e − a⋅ s Fig. 6: Example network "feed forward". Each circle
represents a unit of the type shown in Figure 6.
It may be noted that the sigmoid functions act Each connection between units is a share. Each
approximately linear for small absolute values of the unit also has an entry in the diagonal are not shown.
For some types of applications recurrent networks (ie 2 The backpropagation algorithm
not "feed forward"), in which some interconnections Learning networks is typically achieved through a
forming loop, are also used. I have seen in Figure 6 supervised manner. It can be assumed to be available
an example of feed forward network. As mentioned a learning environment that contains both the
interconnections units of this type of network does a learning models and models of desired output
not form loop, so the network is called feed forward. corresponding to input (this is known as "target
Networks in which there is one or more loops of models"). As we will see, learning is typically based
interconnections as represented in Figure 7.a shall on the minimization of measurement errors between
appoint recurring between the units has a share. Each network outputs and desired outputs. This implies a
unit also has an entry in the diagonal are not shown. back propagation through a network similar to that
which is learned. For this reason algorithm learning
is called back-propagation. The method was first
proposed by [2], but at that time it was virtually
ignored, because it supposed volume calculations too
large for that time. It was then rediscovered by [20],
but only in the mid-'80s was launched by Williams
[18] as a generally accepted tool for training of the
multilayer perceptron. The idea is to find the
a)
minimum error function e(w) in relation to the
connections weights. The algorithm for a multilayer
perceptron with a hidden layer is the following [8]:
Step 1: Initializing. All network weights and
thresholds are initialized with random values,
distributed evenly in a small range, for example
⎛ − 2.4 2.4 ⎞
⎜ ⎟
⎜ F , F ⎟ , where Fi is the total number of inputs
⎝ i i ⎠
b) of the neuron i [6]. If these values are 0, the
gradients which will be calculated during the trial
will be also 0 (if there is no direct link between input
and output) and the network will not learn. More
training attempts are indicated, with different initial
weights, to find the best value for the cost function
(minimum error). Conversely, if initial values are
large, they tend to saturate these units. In this case,
c) derived sigmoid function is very small. It acts as a
multiplier factor during the learning process and thus
the saturated units will be nearly blocked, which
makes learning very slow.
Step 2: A new era of training. An era means
presenting all the examples in the training set. In
most cases, training the network involves more
d)
training epochs. To maintain mathematical rigor, the
Fig. 7: Common types of networks: a) a recurrent
weights will be adjusted only after all the test vectors
network; b) a stratified network; c) a network with
links between units of input and output; d) a feed
will be applied to the network. Therefore, the
forward network fully connected. gradients of the weights must be memorized and
adjusted after each model in the training set, and the
In feed forward networks, units are usually arranged end of an epoch of training, the weights will be
in levels (layers) as in Figure 7.b but other topologies changed only one time (there is an „on-line” variant,
can be used. Figure 7.c shows a type of network that more simple, in which the weights are updated
is useful in some applications in which direct links directly, in this case, the order in which the vectors
between units of input and output are used. Figure of the network are presented might matter.
7.d shows a network with 3 units which is fully All the gradients of the weights and the current error
connected i.e. that all interconnections are allowed to are initialized with 0 (Δwij = 0 and E = 0).
feed restriction forward. Step 3: The forward propagation of the signal
3.1 An example from the training set is applied to the 4.3 The gradients of the errors for the neurons in the
to the inputs. hidden layer are calculated:
3.2 The outputs of the neurons from the hidden layer
( )
l
are calculated: δ j ( p) = y j ( p) ⋅ 1 − y j ( p) ⋅ ∑ δk ( p) ⋅ w jk ( p) , (11)
⎛ n ⎞ k =1
y j ( p) = f ⎜ ∑ xi ( p) ⋅ wij − θ j ⎟ , (3)
⎝ i =1 ⎠ where l is the number of outputs for the network.
where n is the number of inputs for the neuron j from 4.4 The gradients of the weights between the input
the hidden layer, and f is the sigmoid activation layer and the hidden layer are updated:
function.
Δwij ( p) = Δwij ( p) + xi ( p ) ⋅ δ j ( p ) . (12)
3.3 The real outputs of the network are calculated:
E=E+
(ek ( p ) )2 where η is the learning rate.
. (5)
2 If an epoch is completed, we test if it fulfils the
criterion for termination (E<Emax or a maximum
Step 4: The backward propagation of the errors number of training epochs has been reached).
and the adjustments of the weights. If not, we pass to step 2. If yes, the algorithm ends.
4.1 The gradients of the errors for the neurons in the Example: MATLAB program [11] allows the
output layer are calculated: generation of a logical OR functions, which means
that the perceptron separates the classes of 0 from
δ k ( p) = f '⋅ek ( p) , (6) the classes of 1. Obtaining in the Matlab work space:
where f’ is the derived function for the activation, epoch:1SSE:3
and the error ek ( p ) = yd , k ( p ) − yk ( p) . epoch:2SSE:1
epoch:3SSE:1 epoch:4SSE:0
If we use the single-pole sigmoid (equation 1, its Test on the lot [0 1] s =1
derived is:
e− x After the fourth iteration, the perceptron separates
f ' ( x) = = f ( x) ⋅ (1 − f ( x) ) . (7)
(
1 + e− x
2
) two classes (0 and 1) by a line. After the fourth
iteration the perceptron separates by a line two
classes (0 and 1). The percepton was tested in the
If we use the bipolar sigmoid (equation 2, its derived
is: ⎡0 ⎤
presence of the vector input ⎢ ⎥ .
⎣1⎦
2a ⋅ e−a⋅ x a
f ' ( x) = = ⋅ (1 − f ( x)) ⋅ (1 + f ( x)) . (8)
(1 + e )
− a⋅ x 2 2
δ k ( p ) = yk ( p ) ⋅ (1 − yk ( p ) ) ⋅ ek ( p) . (9)
Δw jk ( p) = Δw jk ( p ) + y j ( p ) ⋅ δ k ( p) . (10)
The perceptron makes the logic OR function for which decreases the speed of convergence. For the
which the classes are linearly separable; that is one output neurons, the activation functions adapted to
of the conditions of the perceptron. If the previous the distribution of the output data are recommended.
programs is performed for the exclusive OR Therefore, for problems of the binary classification
function, we will observe that, for any of the two (0/1), the single-pole sigmoid is appropriate. For a
classes, there is no line to allow the separation into classification with n classes, each corresponding to a
two classes (0 and 1). binary output of the network (for example, an
application of optical character recognition), the
softmax extension of the single-pole sigmoid may be
3 Methods to accelerate the learning used.
The momentum method [18] proposes adding a term
to adjust weights. This term is proportional to the last e yk
amendment of the weight, i.e. the values with which y k' = n . (16)
yi
the weights are adjusted are stored and they directly ∑e
influence all further adjustments: i =1
Δwij ( p) = Δwij ( p) + α ⋅ Δwij ( p − 1) . (14) For continuous values, we can make a pre-processing
and a post processing of data, so that the network
Adding a new term is done after the update of the will operate with scaled values, for example in the
gradients for the weights from equations 10 and 12. range [-0.9, 0.9] for the hyperbolic tangent. Also, for
The method of variable learning rate [19] is to use an continuous values, the activation function of the
individual learning rate for each weight and adapt output neurons may be linear, especially if there are
these parameters in each iteration, depending on the no known limits for the range in which these can be
successive signs of the gradients [9]: found. In a local minimum, the gradients of the error
become 0 and the learning no longer continues. A
⎧⎪u ⋅ ηij ( p −1),sgn(Δwij ( p)) = sgn(Δwij ( p −1)) ⎫⎪ solution is multiple independent trials, with weights
ηij ( p) = ⎨ ⎬
(15) initialized differently at the beginning, which raises
⎪⎩d ⋅ ηij ( p −1),sgn(Δwij ( p)) = −sgn(Δwij ( p −1))⎪⎭
the probability of finding the global minimum. For
large problems, this thing can be hard to achieve and
If during the training the error starts to increase, then local minimums may be accepted, with the
rather than decrease, the learning rates are reset to condition that the errors are small enough. Also,
initial values and then the process continues. different configurations of the network might be
tried, with a larger number of neurons in the hidden
layer or with more hidden layers, which in general
4 Practical considerations of working lead to smaller local minimums. Still, although local
with multilayer perceptrons minimums are indeed a problem, practically they are
For relatively simple problems, a learning rate of not unsolvable. An important issue is the choice of
η = 0.7 is acceptable, but in general it is the best configuration for the network in terms of
number of neurons in hidden layers. In most
recommended the learning rate to be around 0.2. To
situations, a single hidden layer is sufficient. There
accelerate through the momentum method, a
are no precise rules for choosing the number of
satisfactory value for α is 0.9. If the learning rate is
neurons. In general, the network can be seen as a
variable, typical values that work well in most
system in which the number of test vectors
situations are u = 1.2 and d = 0.8.
multiplied by the number of outputs is the number of
Choosing the activation function for the output
equations and the number of weights represents the
layer of the network depends on the nature of the
number of unknown. The equations are generally
problem to be solved. For the hidden layers of
nonlinear and very complex and so it is very difficult
neurons, sigmoid functions are preferred, because
to solve them exactly through conventional means.
they have the advantage of both non-linearity and the
Training algorithm aims precisely to find
differentially (prerequisite for applying the
approximate solutions to minimize errors. If the
backpropagation algorithm). The biggest influence of
network approximates the training set well, this is
a sigmoid on the performances of the algorithm
not a guarantee that it will find the same good
seems to be the symmetry of origin [1]. The bipolar
solutions for the data in another set, the testing set.
sigmoid is symmetrical to the origin, while the
Generalization implies the existence of regularities in
unipolar sigmoid is symmetrical to the point (0, 0.5),
the data, of a model that can be learned. In analogy Example: We associate an input vector X=[1 –0.5]
with classical linear systems, this would mean some and a target vector T=[0.5 1] of size imposed by two
redundant equations. Thus, if the number of weights restrictions that can be reduced to two degrees of
is less than the number of test vectors, for a correct
freedom (the points W and the slopes B) of a single
approximation, the network must be based on
intrinsic patterns of data models, models which are Adaline neuron [9]. We suggest solving the linear
to be found in the test data as well. A heuristic rule system of 2 equations with 2 unknowns [12]:
states that the number of weights should be around
or below one tenth of the number of training vectors w+b=0.5, - 0.5w+b=1, (17)
and the number of exits. In some situations however obtaining in the end the solutions:
(e.g., if training data are relatively few), the number
1 5
of weights can be even half of the product. For a w= - and b = .
multilayer perceptron is considered that the number 3 6
of neurons in a layer must be sufficiently large so The Matlab program offers solutions obtained with
that this layer to provide three or more edges for the help of the Adaline neuron either by points or by
each convex region identified by the next layer [5]. slopes. Matlab program offers solutions obtained
So the number of neurons in a layer must be more using Adaline neuron, either by points or by slopes
than three times higher than that of the next layer. As [3], [7], [10], [21].
mentioned before, a sufficient number of weights
lead to under-fitting, while too many of the weights
leads to over-fitting, events presented in Figure 9.
the speed of learning procedure. will tend to congested facilities. The saturation
Output units and target values. Most practical derived nonlinear sigmoid is very small. Since these
applications of multi perceptrons can be divided in a derivatives act as a multiplier in the back
clear relative in two different classes. In a class of propagation, the relative weights derived entry unit
target outputs have a continuous range of values, and will be very small. The unit will be largely "locked"
the network is to make an operation of non-linear by learning very slow.
regression. Normal in this case is not convenient to If you put a unit of data and network are all the
put non-linearity in the output network. In fact we same radicals in the arithmetic average of the squares
are normally outputs that are able to cover the entire (rms) and are all independent of each other and the
range of possible target values, which is often higher weights are initialized in a fixed time when the rms
than the values sigmoid. I can well understand to sum of the entry unit will be proportional to fi 1/2,
scale output amplitudes sigmoid how but it is rarely where fi is the number of entries and the unit (often
any advantage relative to simple use of units with called fan-in of the unit). To maintain the rms sum of
non-linearity in output. Output units are said to be entries similar to each other, and to avoid saturation
linear. Simply get them to output the weighted sum of units with high fan-in, a parameter, controlling the
of the entries plus their term diagonal. size of the range boot, is sometimes varied from one
In another class, which includes mainly unit to another, making you = k/(fi) 1/2. There are
applications for classification and pattern recognition various options for the choice of k. Some prefer to
target outputs are binary, ie, take only 2 values. In initialize the weights so close to home, making it a
this case it is usual to use units of output by non- very small k (e.g. 0.01 to 0.1) and thus retain their
linearity sigmoid similar to other units in the units in the central line at the beginning of the
network. Binary target values that are most learning process. Others prefer high values of k (eg 1
appropriate depend on sigmoid used. Often target or higher), leading their units in the non-linear even
values are chosen to be equal to the 2 values of at the beginning of the learning process.
asymptote sigmoid (0 and 1 for logistics function and Decorrelation and normalization of entry. To
± 1 for the tanh and arctan scale). In this case gain consider the simplest network that can design one,
error to 0 units of output will need to obtain consists of a single linear unit. Networks with a
complete saturation ie the amount of entries should single linear unit (adalines) are used for a long time
become infinite. This would tend to lead weights of in the area of signal processing in discrete time.
these units to increase indefinitely in absolute value Filters with finite impulse response to (FIR) can
and slow the learning process. To improve the speed now be seen as single units without a diagonal line.
of learning is therefore usually used for target values Entries are consecutive samples input signal and
which are close but not equal to the asymptote of filter coefficients are the weights. Therefore,
sigmoid (eg 0.05 and 0.95 for the logistics and ± 0.9 adaptive filtering with FIR filters is an essential form
for the functions tanh and arctan scale). of learning in real-time networks with linear
Initializing share. Before you can start the networks. Therefore there is no surprise that the first
algorithm back-propagation is necessary to set the filtered adaptive algorithms were derived from the
weights of the network with some initial values. A delta rule [14]. It is well known in Adaptive filter
natural choice would be to initialize all with a value theory that learning is the fastest, because the error
of 0. So do not lean learning outcome in a particular is well-conditioned (no tub) if the entries are linear
direction. However it can be seen easily by applying units uncorrelated between them, which means that
the back propagation rule that if the initial weights <xixj>=0 for i≠j, and value equal squares
are all 0 gradient is 0 (except for those relating to 2 2
<xi >=<xj > for all i,j. Here <.> is expected value
share or links between units of input and output, if (often, when we learn perceptrons, the expected
such links exist in the network). Furthermore the value can be estimated by simply learning media
gradient components will always remain 0 during the set). If it is used also in diagonal line units, it act as
learning even if there are direct links. Therefore, it is a further input which is equal to 1. Which means
normally necessary to initialize the weights with that the square is 1, and therefore the squares of
different values of 0. The most common procedure is other entries must be all equal to 1? On the other
to initialize with random values drawn from a hand, cross correlation of other entries with the new
uniform distribution on a symmetric interval [-a, a]. entry is made simple and expected values of these
As mentioned above some independent learning entries. Which should be equal to 0, as with all
independent random initialization can be used to find cross-correlation between input:
the best minim cost function. It is understandable
that the large share (resulting in high values of a) <xi>=<xj>=0. (18)
7 Conclusion
Multilayer perceptrons are the most commonly used
types of neural networks. Using the backpropagation
b) algorithm for training, they can be used for a wide
Fig. 12: Control surface approximation of fuzzy range of applications, from the functional
controller by a network of neurons a) normalized approximation to prediction in various fields, such as
coordinates; b) actual values. estimating the load of a calculating system or
modelling the evolution of chemical reactions of
polymerization, described by complex systems of
differential equations. In implementing the
algorithm, there are a number of practical problems,
mostly related to the choice of the parameters and
network configuration. First, a small learning rate
leads to a slow convergence of the algorithm, while a
too high rate may cause failure (algorithm will
"jump" over the solution). Another problem
characteristic of this method of training is given by
a) local minimums. A neural network must be capable
of generalization.
The advantage of fuzzy logic controller will
disappear when comparing to a wind-up PI
controller, knowing that this is working in a linear.
On the other hand, a wind-up PI-controller does not
make any problems when the output variable reaches
the saturation value since the signal corresponding to
the difference between limited output and unlimited
output is once more fed to the controller for
b) desaturation. For a same control surface, the
Fig. 13: Driving on-load start-up, using a fuzzy advantage of using a neural controller consists in
controller (f) and a neural controller perceptron type (n) calculus time decreasing as against with that lost
with 4 layers: a) speed shape; b) stator current shape. when it is used a fuzzy controller with a bigger
number of linguistic labels.