0% found this document useful (0 votes)
3 views4 pages

Adapting Resilient Propagation For Deep Learning: Alan Mosca George D. Magoulas

Uploaded by

Hazt Plays
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Adapting Resilient Propagation For Deep Learning: Alan Mosca George D. Magoulas

Uploaded by

Hazt Plays
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Adapting Resilient Propagation for Deep Learning

Alan Mosca George D. Magoulas


Department of Computer Science and Department of Computer Science and
Information Systems Information Systems
Birkbeck, University of London Birkbeck, University of London
Malet Street, London WC1E 7HX - United Kingdom Malet Street, London WC1E 7HX - United Kingdom
Email: a.mosca@dcs.bbk.ac.uk Email: gmagoulas@dcs.bbk.ac.uk
arXiv:1509.04612v2 [cs.NE] 16 Sep 2015

Abstract—The Resilient Propagation (Rprop) algorithm has Algorithm 1 Rprop


been very popular for backpropagation training of multilayer 1: η+ = 1.2, η− = 0.5, ∆max = 50, ∆min = 10−6
feed-forward neural networks in various applications. The stan-
dard Rprop however encounters difficulties in the context of 2: pick ∆ij (0)
deep neural networks as typically happens with gradient-based 3: ∆wij (0) = − sgn ∂E(0)
∂wij · ∆ij (0)
learning algorithms. In this paper, we propose a modification of 4: for all t ∈ [1..T ] do
the Rprop that combines standard Rprop steps with a special
5: if ∂E(t) ∂E(t−1)
∂wij · ∂wij > 0 then
drop out technique. We apply the method for training Deep
Neural Networks as standalone components and in ensemble for- 6: ∆ij (t) = min{∆ij (t − 1) · η+ , ∆max }
mulations. Results on the MNIST dataset show that the proposed 7: ∆wij (t) = − sgn ∂E(t)
∂wij · ∆ij (t)
modification alleviates standard Rprop’s problems demonstrating 8: wij (t + 1) = wij (t) + ∆wij (t)
improved learning speed and accuracy. ∂E(t−1)
9: ∂wij = ∂E(t)
∂wij
I. I NTRODUCTION 10: else if ∂E(t) ∂E(t−1)
∂wij · ∂wij < 0 then
Deep Learning techniques have generated many of the state- 11: ∆ij (t) = max{∆ij (t − 1) · η− , ∆min }
∂E(t−1)
of-the-art models [1], [2], [3] that reached impressive results 12: ∂wij =0
on benchmark datasets like MNIST [4]. Such models are usu- 13: else
ally trained with variations of the standard Backpropagation 14: ∆wij (t) = − sgn ∂E(t)
∂wij · ∆ij (t)
method, with stochastic gradient descent (SGD). In the field of 15: wij (t + 1) = wij (t) + ∆wij (t)
∂E(t−1)
shallow neural networks, there have been several developments 16: ∂wij = ∂E(t)
∂wij
to training algorithms that have sped up convergence [5], 17: end if
[6]. This paper aims to bridge the gap between the field 18: end for
of Deep Learning and these advanced training methods, by
combining Resilient Propagation (Rprop) [5], Dropout [7] and
Deep Neural Networks Ensembles. B. Dropout
A. Rprop Dropout [7] is a regularisation method by which only a
The Resilient Propagation [5] weight update rule was random selection of nodes in the network is updated du-
initially introduced as a possible solution to the “vanishing ring each training iteration, but at the final evaluation stage
gradients” problem: as the depth and complexity of an artificial the whole network is used. The selection is performed by
neural network increase, the gradient propagated backwards sampling a dropout mask Dm from a Bernoulli distribution
by the standard SGD backpropagation becomes increasingly with P (mutedi ) = Dr , where P (mutedi ) is the probability
smaller, leading to negligible weight updates, which slow of node i being muted during the weight update step of
down training considerably. Rprop solves this problem by backpropagation, and Dr is the dropout rate, which is usually
using a fixed update value δij , which is increased or decreased 0.5 for the middle layers, 0.2 or 0 for the input layers, and
multiplicatively at each iteration by an asymmetric factor η+ 0 for the output layer. For convenience this dropout mask
and η− respectively, depending on whether the gradient with is represented as a weight binary matrix D ∈ {0, 1}M×N ,
respect to wij has changed sign between two iterations or not. covering all the weights in the network that can be used to
This “backtracking” allows Rprop to still converge to a local multiply the weight-space of the network to obtain what is
minima, but the acceleration provided by the multiplicative called a thinned network, for the current training iteration,
factor η+ helps it skip over flat regions much more quickly. where each weight wij is zeroed out based on the probability
To avoid double punishment when in the backtracking phase, of its parent node i being muted.
Rprop artificially forces the gradient product to be 0, so that The remainder of this paper is structured as follows:
the following iteration is skipped. An illustration of Rprop can • In section II we explain why using Dropout causes an
be found in Algorithm 1. incompatibility with Rprop, and propose a modification
to solve the issue. Algorithm 2 Rprop adapted for Dropout
• In section III we show experimental results using the 1: η+ = 1.2, η− = 0.5, ∆max = 50, ∆min = 10−6
MNIST dataset, first to highlight how Rprop is able to 2: pick ∆ij (0)
converge much more quickly during the initial epochs, ∆wij (0) = − sgn ∂E(0)
3: ∂wij · ∆ij (0)
and then use this to speed up the training of a Stacked 4: for all t ∈ [1..T ] do
Ensemble. 5: if Dmij = 0 then
• Finally in section IV, we look at how this work can be 6: ∆ij (t) = ∆ij (t − 1)
extended with further evaluation and development. 7: ∆wij (t) = 0
II. R PROP AND D ROPOUT 8: else
In this section we explain the zero gradient problem, and 9: if ∂E(t) ∂E(t−1)
∂wij · ∂wij > 0 then
propose a solution by adapting the Rprop algorithm to be 10: ∆ij (t) = min{∆ij (t − 1) · η+ , ∆max }
aware of Dropout. 11: ∆wij (t) = − sgn ∂E(t)
∂wij · ∆ij (t)
A. The zero-gradient problem 12: wij (t + 1) = wij (t) + ∆wij (t)
∂E(t−1)
13: ∂wij = ∂E(t)
∂wij
In order to avoid double punishment when there is a change
of sign in the gradient, Rprop artificially sets the gradient 14: else if ∂E(t) ∂E(t−1)
∂wij · ∂wij < 0 then
product associated with weight ij for the next iteration to 15: ∆ij (t) = max{∆ij (t − 1) · η− , ∆min }
∂Et ∂Et+1 ∂E(t−1)
∂wij · ∂wij = 0. This condition is checked during the 16: ∂wij =0
following iteration, and if true no updates to the weights wij 17: else
or the learning rate ∆ij are performed. 18: if ∂E(t−1)
∂wij = 0 then
Using the zero-valued gradient product as an indication ∆wij (t) = − sgn ∂E(t)
19: ∂wij · ∆ij (t)
to skip an iteration is acceptable in normal gradient descent 20: wij (t + 1) = wij (t) + ∆wij (t)
because the only other occurrence of this would be when 21: else
learning has terminated. When Dropout is introduced, an 22: ∆ij (t) = ∆ij (t − 1)
additional number of events can produce these zero values: 23: ∆wij (t) = 0
• When neuron i is skipped, the dropout mask for all 24: end if
weights wij going to the layer above has a value of 0 25: end if
• When neuron j in the layer above is skipped, the gradient 26: end if
propagated back to all the weights wij is also 0 27: end for
These additional zero-gradient events force additional skipped
training iterations and missed learning rate adaptations that
slow down the training unnecessarily. III. E VALUATING ON MNIST
B. Adaptations to Rprop In this section we describe an initial evaluation of per-
By making Rprop aware of the dropout mask Dm, we are formance on the MNIST dataset. For all experiments we
able to distinguish whether a zero-gradient event occurs as use a Deep Neural Network (DNN) with five middle layers,
a signal to skip the next weight update or whether it occurs of 2500, 2000, 1500, 1000, 500 neurons respectively, and a
for a different reason, and therefore w and ∆ updates should dropout rate Drmid = 0.5 for the middle layers and no
be allowed. The new version of the Rprop update rule for Dropout on the inputs. The dropout rate has been shown
each weight ij is shown in Algorithm 2. We use t to indicate to be an optimal choice for the MNIST dataset in [9]. A
the current training example, t − 1 for the previous training similar architecture has been used to produce state-of-the-art
example, t + 1 for the next training example, and where a results [3], however the authors used the entire training set
value with (0) appears, it is intended to be the initial value. for validation, and graphical transformations of said set for
All other notation is the same as used in the original Rprop: training. These added transformations have led to a “virtually
• E(t) is the error function (in this case negative log infinite” training set size, whereby at every epoch, a new
likelihood) training set is generated, and a much larger validation set of
• ∆ij (t) is the current update value for weight at index ij the original 60000 images. The test set remains the original
• ∆wij (t) is the current weight update value for index ij 10000 image test set. An explanation of these transformations
In particular, the conditions at line 5 and line 18 are is provided in [10], which also confirms that:
providing the necessary protection from the additional zero- “The most important practice is getting a training
gradients, and implementing correctly the recipe prescribed set as large as possible: we expand the training set
by Dropout, by completely skipping every weight for which by adding a new form of distorted data”
Dmij = 0 (which means that neuron j was dropped out and We therefore attribute these big improvements to the transfor-
therefore the gradient will necessarily be 0. We expect that mations applied, and have not found it a primary goal to repli-
this methodolgy can be extended to other variants of Rprop, cate these additional transformations to obtain the state-of-the-
such as, but not limited to, iRprop+ [8] and JRprop [6]. art results and instead focused on utilising the untransformed
dataset, using 50000 images for training, 10000 for validation stays below it consistently until it reaches its minimum. Also,
and 10000 for testing. Subsequently, we performed a search the unmodified version does not reach the same final error as
using the validation set as an indicator to find the optimal the modified version, and starts overtraining much sooner, and
hyperparameters of the modified version of Rprop. We found does not reach a better error than SGD. Table I shows with
that the best results were reached with η+ = 0.01, η− = 0.1, more detail how the performance of the two methods compares
∆max = 5 and ∆min = 10−3 . We trained all models to the over the first 200 epochs.
maximum of 2000 allowed epochs, and measured the error on
the validation set at every epoch, so that it could be used to Unmodified Rprop vs Modified Rprop
select the model to be applied to the test set. We also measured 14
Modified
the time it took to reach the best validation error, and report its Unmodified
12
approximate magnitude, to use as a comparison of orders of
magnitude. The results presented are an average of 5 repeated

Validation Error (%)


10
runs, limited to a maximum of 2000 training epochs.
8
A. Compared to SGD
6
From the results in Table I we see that the modified version
of Rprop is able to start-up much quicker and reaches an 4
error value that is close to the minimum much more quickly.
SGD reaches a higher error value, and after a much longer 2
0 20 40 60 80 100 120 140 160 180 200
time. Although the overall error improvement is significant, Training Epoch
the speed gain from using Rprop is more appealing because it
allows to save a large number of iterations that could be used Fig. 2: Validation Error - Unmod. vs Mod. Rprop
for improving the model in different ways. Rprop obtains its
best validation error after only 35 epochs, whilst SGD reached
the minimum after 473. An illustration of the first 200 epochs
can be seen in Figure 1. C. Using Modified Rprop to speed up training of Deep Lear-
ning Ensembles
SGD vs Modified Rprop The increase in speed of convergence can make it practical
90 to produce Ensembles of Deep Neural Networks, as the
Modified Rprop
80 SGD time to train each member DNN is considerably reduced
70 without undertraining the network. We have been able to train
Validation Error (%)

60 these Ensembles in less than 12 hours in total on a single-


50 GPU, single-CPU desktop system 1 . We have trained different
40
Ensemble types, and we report the final results in Table II. The
methods used are Bagging [11] and Stacking [12], with 3 and
30
10 member DNNs. Each member was trained for a maximum
20
of 50 epochs.
10
• Bagging is an ensemble method by which several dif-
0
0 20 40 60 80 100 120 140 160 180 200 ferent training sets are created by random resampling of
Training Epoch the original training set, and each of these are used to
Fig. 1: Validation Error - SGD vs Mod. Rprop train a new classifier. The entire set of trained classifiers
is usually then aggregated by taking an average or a
majority vote to reach a single classification decision.
• Stacking is an ensemble method by which the different

Method Min Val Err Epochs Time Test Err 1st Epoch
classifiers are aggregated using an additional learning
SGD 2.85% 1763 320 min 3.50% 88.65% algorithm that uses the inputs of these first-space clas-
Rprop 3.03% 105 25 min 3.53% 12.81% sifiers to learn information about how to reach a better
Mod Rprop 2.57% 35 10 min 3.49% 13.54%
classification result. This additional learning algorithm is
called a second-space classifier.
TABLE I: Simulation results
In the case of Stacking the final second-space classifier
was another DNN with two middle layers, respectively of
size (200N, 100N ), where N is the number of DNNs in the
B. Compared to unmodified Rprop Ensemble, trained for a maximum of 200 epochs with the
We can see from Figure 2 that the modified version of 1 We used a Nvidia GTX-770 graphics card on a core i5 processor,
Rprop has a faster start-up than the unmodified version, and programmed with Theano in python
modified Rprop. We used the same original train, validation [5] M. Riedmiller and H. Braun, “A direct adaptive method for faster
and test sets for this, and collected the average over 5 repeated backpropagation learning: The rprop algorithm,” in proceeding of the
IEEE International Conference on Neural Networks. IEEE, 1993, pp.
runs. The results are still not comparable to what is presented 586–591.
in [3], which is consistent with the observations about the [6] A. D. Anastasiadis, G. D. Magoulas, and M. N. Vrahatis, “New
importance of the dataset transformations, however we note globally convergent training scheme based on the resilient propagation
algorithm,” Neurocomputing, vol. 64, pp. 253–270, 2005.
that we are able to improve the error in less time it took to [7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
train a single network with SGD. A Wilcoxon signed ranks test R. Salakhutdinov, “Improving neural networks by preventing co-
shows that the increase in performance obtained from using adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
[8] C. Igel and M. Hüsken, “Improving the Rprop learning algorithm,” in
the ensembles of size 10 compared to the ensemble of size 3 Proceedings of the second international ICSC symposium on neural
is significant, at the 98% confidence level. computation (NC 2000), vol. 2000. Citeseer, 2000, pp. 115–121.
[9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
Method Size Test Err Time dinov, “Dropout: A simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
Bagging 3 2.56% 35 min
1929–1958, 2014.
Bagging 10 2.13% 128 min
[10] P. Y. Simard, D. Steinkraus, and J. C. Platt,
Stacking 3 2.48% 39 min
“Best practices for convolutional neural networks applied
Stacking 10 2.19% 145 min
to visual document analysis,” 2003. [Online]. Available:
http://research.microsoft.com/apps/pubs/default.aspx?id=68920
TABLE II: Ensemble performance [11] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp.
123–140, 1996.
[12] W. D, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259,
1992.

IV. C ONCLUSIONS AND F UTURE W ORK


We have highlighted that many training methods that have
been used in shallow learning may be adapted for use in Deep
Learning. We have looked at Rprop and how the appearance of
zero-gradients during the training as a side effect of Dropout
poses a challenge to learning, and proposed a solution which
allows Rprop to train DNNs to a better error, and still be much
faster than standard SGD backpropagation.
We then showed that this increase in training speed can
be used to train effectively an Ensemble of DNNs on a
commodity desktop system, and reap the added benefits of
Ensemble methods in less time than it would take to train a
Deep Neural Network with SGD.
It remains to be assessed in further work whether this im-
proved methodology would lead to a new state-of-the-art error
when applying the pre-training and dataset enhancements that
have been used in other methods, and how the improvements
to Rprop can be ported to its numerous variants.

ACKNOWLEDGEMENT
The authors would like to thank the School of Business,
Economics and Informatics, Birkbeck College, University of
London, for the grant received to support this research.

R EFERENCES
[1] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization
of neural networks using dropconnect,” in Proceedings of the 30th
International Conference on Machine Learning (ICML-13), 2013, pp.
1058–1066.
[2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
networks for image classification,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR). IEEE
Press, 2012, pp. 3642–3649.
[3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber,
“Deep, big, simple neural nets for handwritten digit recognition,” Neural
computation, vol. 22, no. 12, pp. 3207–3220, 2010.
[4] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.”
[Online]. Available: http://yann.lecun.com/exdb/mnist/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy