LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
Abstract—Several variants of the Long Short-Term Memory synthesis [10], protein secondary structure prediction [11],
(LSTM) architecture for recurrent neural networks have been analysis of audio [12], and video data [13] among others.
proposed since its inception in 1995. In recent years, these
networks have become the state-of-the-art models for a variety The central idea behind the LSTM architecture is a memory
of machine learning problems. This has led to a renewed interest cell which can maintain its state over time, and non-linear
in understanding the role and utility of various computational gating units which regulate the information flow into and out of
components of typical LSTM variants. In this paper, we present the cell. Most modern studies incorporate many improvements
the first large-scale analysis of eight LSTM variants on three
that have been made to the LSTM architecture since its
arXiv:1503.04069v2 [cs.NE] 4 Oct 2017
output
...
recurrent
...
recurrent
block output ... Legend
output gate
LSTM block unweighted connection
output
+
... weighted connection
recurrent
... ...
peepholes connection with time-lag
h input
recurrent branching point
SRN ...
mutliplication
unit
g + cell recurrent
...
+ sum over all inputs
+
... forget gate gate activation function
(always sigmoid)
+ input +
input activation function
... input gate ...
g
... (usually tanh)
input recurrent g input
block input output activation function
h
(usually tanh)
+
...
...
input recurrent
Figure 1. Detailed schematic of the Simple Recurrent Network (SRN) unit (left) and a Long Short-Term Memory block (right) as used in the hidden layers of
a recurrent neural network.
III. H ISTORY OF LSTM study of this kind was later done by Jozefowicz et al. [30]. Sak
The initial version of the LSTM block [14, 15] included et al. [9] introduced a linear projection layer that projects the
(possibly multiple) cells, input and output gates, but no forget output of the LSTM layer down before recurrent and forward
gate and no peephole connections. The output gate, unit connections in order to reduce the amount of parameters for
biases, or input activation function were omitted for certain LSTM networks with many blocks. By introducing a trainable
experiments. Training was done using a mixture of Real Time scaling parameter for the slope of the gate activation functions,
Recurrent Learning (RTRL) [23, 24] and Backpropagation Doetsch et al. [5] were able to improve the performance of
Through Time (BPTT) [24, 25]. Only the gradient of the cell LSTM on an offline handwriting recognition dataset. In what
was propagated back through time, and the gradient for the they call Dynamic Cortex Memory, Otte et al. [31] improved
other recurrent connections was truncated. Thus, that study convergence speed of LSTM by adding recurrent connections
did not use the exact gradient for training. Another feature of between the gates of a single block (but not between the
that version was the use of full gate recurrence, which means blocks).
that all the gates received recurrent inputs from all gates at the Cho et al. [32] proposed a simplified variant of the LSTM
previous time-step in addition to the recurrent inputs from the architecture called Gated Recurrent Unit (GRU). They used
block outputs. This feature did not appear in any of the later neither peephole connections nor output activation functions,
papers. and coupled the input and the forget gate into an update gate.
Finally, their output gate (called reset gate) only gates the
A. Forget Gate recurrent connections to the block input (Wz ). Chung et al.
[33] performed an initial comparison between GRU and Vanilla
The first paper to suggest a modification of the LSTM LSTM and reported mixed results.
architecture introduced the forget gate [21], enabling the LSTM
to reset its own state. This allowed learning of continual tasks
such as embedded Reber grammar. IV. E VALUATION S ETUP
raw audio we extract 12 Mel Frequency Cepstrum Coefficients correction, etc.) was used.
(MFCCs) [35] + energy over 25ms hamming-windows with The networks were trained using the Connectionist Temporal
stride of 10ms and a pre-emphasis coefficient of 0.97. This Classification (CTC) error function by Graves et al. [39] with
preprocessing is standard in speech recognition and was chosen 82 outputs (81 characters plus the special empty label). We
in order to stay comparable with earlier LSTM-based results measure performance in terms of the Character Error Rate
(e.g. [20, 36]). The 13 coefficients along with their first and (CER) after decoding using best-path decoding [39].
second derivatives comprise the 39 inputs to the network and JSB Chorales: JSB Chorales is a collection of 382 four-
were normalized to have zero mean and unit variance. part harmonized chorales by J. S. Bach [40], consisting of
The performance is measured as classification error per- 202 chorales in major keys and 180 chorals in minor keys.
centage. The training, testing, and validation sets are split in We used the preprocessed piano-rolls provided by Boulanger-
5
line with Halberstadt [37] into 3696, 400, and 192 sequences, Lewandowski et al. [41]. These piano-rolls were generated
having 304 frames on average. by transposing each MIDI sequence in C major or C minor
We restrict our study to the core test set, which is an and sampling frames every quarter note. The networks where
established subset of the full TIMIT corpus, and use the trained to do next-step prediction by minimizing the negative
splits into training, testing, and validation sets as detailed log-likelihood. The complete dataset consists of 229, 76, and
by Halberstadt [37]. In short, that means we only use the core 77 sequences (training, validation, and test sets respectively)
test set and drop the SA samples3 from the training set. The with an average length of 61.
validation set is built from some of the discarded samples from
the full test set. B. Network Architectures & Training
IAM Online: The IAM Online Handwriting Database [38]4 A network with a single LSTM hidden layer and a sigmoid
consists of English sentences as time series of pen movements output layer was used for the JSB Chorales task. Bidirectional
that have to be mapped to characters. The IAM-OnDB dataset LSTM [20] was used for TIMIT and IAM Online tasks,
splits into one training set, two validation sets, and one test set, consisting of two hidden layers, one processing the input
having 775, 192, 216, and 544 boards each. Each board, see forwards and the other one backwards in time, both connected
Figure 2(a), contains multiple hand-written lines, which in turn to a single softmax output layer. As loss function we employed
consist of several strokes. We use one line per sequence, and Cross-Entropy Error for TIMIT and JSB Chorales, while for
2 Note that in linguistics a phone represents a distinct speech sound the IAM Online task the Connectionist Temporal Classification
independent of the language. In contrast, a phoneme refers to a sound that (CTC) loss by Graves et al. [39] was used. The initial weights
distinguishes two words in a given language [34]. These terms are often for all networks were drawn from a normal distribution with
confused in the machine learning literature.
3 The dialect sentences (the SA samples) were meant to expose the dialectal standard deviation of 0.1. Training was done using Stochastic
variants of the speakers and were read by all 630 speakers. We follow [37] Gradient Descent with Nesterov-style momentum [42] with
and remove them because they bias the distribution of phones.
4 The IAM-OnDB was obtained from http://www.iam.unibe.ch/fki/databases/ 5 Available at http://www-etud.iro.umontreal.ca/∼boulanni/icml2012 at the
iam-on-line-handwriting-database time of writing.
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5
updates after each sequence. The learning rate was rescaled by • learning rate: log-uniform samples from [10−6 , 10−2 ];
a factor of (1 − momentum). Gradients were computed using • momentum: 1 − log-uniform samples from [0.01, 1.0];
full BPTT for LSTMs [20]. Training stopped after 150 epochs standard deviation of Gaussian input noise: uniform samples
•
or once there was no improvement on the validation set for from [0, 1].
more than fifteen epochs. In the case of the TIMIT dataset, two additional (boolean)
hyperparameters were considered (not tuned for the other two
C. LSTM Variants datasets). The first one was the choice between traditional
The vanilla LSTM from Section II is referred as Vanilla (V). momentum and Nesterov-style momentum [42]. Our analysis
For activation functions we follow the standard and use the showed that this had no measurable effect on performance so
logistic sigmoid for σ, and the hyperbolic tangent for both g the latter was arbitrarily chosen for all further experiments.
and h. The derived eight variants of the V architecture are The second one was whether to clip the gradients to the range
the following. We only report differences to the forward pass [−1, 1]. This turned out to hurt overall performance,6 therefore
formulas presented in Section II-A: the gradients were never clipped in the case of the other two
t datasets.
NIG: No Input Gate: i = 1
t Note that, unlike an earlier small-scale study [33], the number
NFG: No Forget Gate: f = 1
t of parameters was not kept fixed for all variants. Since different
NOG: No Output Gate: o = 1
NIAF: No Input Activation Function: g(x) = x variants can utilize their parameters differently, fixing this
NOAF: No Output Activation Function: h(x) = x number can bias comparisons.
t t
CIFG: Coupled Input and Forget Gate: f = 1 − i
NP: No Peepholes: V. R ESULTS & D ISCUSSION
īt = Wi xt + Ri yt−1 + bi Each of the 5400 experiments was run on one of 128 AMD
Opteron CPUs at 2.5 GHz and took 24.3 h on average to
f̄ t = Wf xt + Rf yt−1 + bf
complete. This sums up to a total single-CPU computation
ōt = Wo xt + Ro yt−1 + bo time of just below 15 years.
FGR: Full Gate Recurrence: For TIMIT the test set performance of the best trial were
29.6% classification error (CIFG) which is close to the best
īt = Wi xt + Ri yt−1 + pi ct−1 + bi reported result of 26.9% [20]. Our best result of -8.38 log-
+ Rii it−1 + Rf i f t−1 + Roi ot−1 likelihood (NIAF) on the JSB Chorales dataset on the other
f̄ = Wf xt + Rf yt−1 + pf ct−1 + bf
t hand is well below the -5.56 from Boulanger-Lewandowski
et al. [41]. Best LSTM result is 26.9% For the IAM Online
+ Rif it−1 + Rf f f t−1 + Rof ot−1 dataset our best result was a Character Error Rate of 9.26%
ōt = Wo xt + Ro yt−1 + po ct−1 + bo (NP) on the test set. The best previously published result is
+ Rio it−1 + Rf o f t−1 + Roo ot−1 11.5% CER by Graves et al. [45] using a different and much
more extensive preprocessing.7 Note though, that the goal of
The first six variants are self-explanatory. The CIFG variant this study is not to provide state-of-the-art results, but to do a
uses only one gate for gating both the input and the cell fair comparison of different LSTM variants. So these numbers
recurrent self-connection – a modification of LSTM referred are only meant as a rough orientation for the reader.
to as Gated Recurrent Units (GRU) [32]. This is equivalent to
setting ft = 1 − it instead of learning the forget gate weights
A. Comparison of the Variants
independently. The FGR variant adds recurrent connections
between all the gates as in the original formulation of the A summary of the random search results is shown in Figure 3.
LSTM [15]. It adds nine additional recurrent weight matrices, Welch’s t-test at a significance level of p = 0.05 was used8
thus significantly increasing the number of parameters. to determine whether the mean test set performance of each
variant was significantly different from that of the baseline. The
D. Hyperparameter Search box for a variant is highlighted in blue if its mean performance
differs significantly from the mean performance of the vanilla
While there are other methods to efficiently search for good
LSTM.
hyperparameters (cf. [43, 44]), random search has several
The results in the top half of Figure 3 represent the
advantages for our setting: it is easy to implement, trivial
distribution of all 200 test set performances over the whole
to parallelize, and covers the search space more uniformly,
search space. Any conclusions drawn from them are therefore
thereby improving the follow-up analysis of hyperparameter
importance. 6 Although this may very well be the result of the range having been chosen
100 TIMIT 2.5 IAM Online 1.8 12.5 JSB Chorales 1.4
100
1.6 12.0
90 1.2
2.0
1.4
11.5
80 80
1.0
negative log-likelihood
1.2
40 0.6
50
9.5 0.4
0.4
0.5
40
9.0 0.2
20 0.2
30 8.5
0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF
35 TIMIT 5 IAM Online 3.0 JSB Chorales 1.6
30
8.8
34 1.4
2.5
4
33 1.2
25
number of parameters ∗105
negative log-likelihood
2.0
character error rate
32 1.0
3
2
30 0.6
1.0
8.5
15
29 0.4
1
0.5
28 0.2
8.4
10
27 0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF
Figure 3. Test set performance for all 200 trials (top) and for the best 10% (bottom) trials (according to the validation set) for each dataset and variant. Boxes
show the range between the 25th and the 75th percentile of the data, while the whiskers indicate the whole range. The red dot represents the mean and the red
line the median of the data. The boxes of variants that differ significantly from the vanilla LSTM are shown in blue with thick lines. The grey histogram in the
background presents the average number of parameters for the top 10% performers of every variant.
specific to our choice of search ranges. We have tried to chose Input and forget gate coupling (CIFG) did not significantly
reasonable ranges for the hyperparameters that include the best change mean performance on any of the datasets, although
settings for each variant and are still small enough to allow the best performance improved slightly on music modeling.
for an effective search. The means and variances tend to be Similarly, removing peephole connections (NP) also did not
rather similar for the different variants and datasets, but even lead to significant changes, but the best performance improved
here some significant differences can be found. slightly for handwriting recognition. Both of these variants
In order to draw some more interesting conclusions we simplify LSTMs and reduce the computational complexity, so
restrict our further analysis to the top 10% performing trials it might be worthwhile to incorporate these changes into the
for each combination of dataset and variant (see bottom half architecture.
of Figure 3). This way our findings will be less dependent on Adding full gate recurrence (FGR) did not significantly
the chosen search space and will be representative for the case change performance on TIMIT or IAM Online, but led to
of “reasonable hyperparameter tuning efforts.”9 worse results on the JSB Chorales dataset. Given that this
The first important observation based on Figure 3 is that variant greatly increases the number of parameters, we generally
removing the output activation function (NOAF) or the forget advise against using it. Note that this feature was present in
gate (NFG) significantly hurt performance on all three datasets. the original proposal of LSTM [14, 15], but has been absent
Apart from the CEC, the ability to forget old information in all following studies.
and the squashing of the cell state appear to be critical for Removing the input gate (NIG), the output gate (NOG), and
the LSTM architecture. Indeed, without the output activation the input activation function (NIAF) led to a significant reduc-
function, the block output can in principle grow unbounded. tion in performance on speech and handwriting recognition.
Coupling the input and the forget gate avoids this problem and However, there was no significant effect on music modeling
might render the use of an output non-linearity less important, performance. A small (but statistically insignificant) average
which could explain why GRU performs well without it. performance improvement was observed for the NIG and NIAF
9 How much effort is “reasonable” will still depend on the search space. If
architectures on music modeling. We hypothesize that these
the ranges are chosen much larger, the search will take much longer to find behaviors will generalize to similar problems such as language
good hyperparameters. modeling. For supervised learning on continuous real-valued
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7
data (such as speech and handwriting recognition), the input Hidden Layer Size: Not surprisingly the hidden layer size
gate, output gate, and input activation function are all crucial is an important hyperparameter affecting the LSTM network
for obtaining good performance. performance. As expected, larger networks perform better, but
with diminishing returns. It can also be seen in Figure 4 (middle,
green) that the required training time increases with the network
B. Impact of Hyperparameters size. Note that the scale here is wall-time and thus factors in
The fANOVA framework for assessing hyperparameter both the increased computation time for each epoch as well as
importance by Hutter et al. [19] is based on the observation the convergence speed.
that marginalizing over dimensions can be done efficiently Input Noise: Additive Gaussian noise on the inputs, a
in regression trees. This allows predicting the marginal error traditional regularizer for neural networks, has been used for
for one hyperparameter while averaging over all the others. LSTM as well. However, we find that not only does it almost
Traditionally this would require a full hyperparameter grid always hurt performance, it also slightly increases training
search, whereas here the hyperparameter space can be sampled times. The only exception is TIMIT, where a small dip in error
at random. for the range of [0.2, 0.5] is observed.
Average performance for any slice of the hyperparameter Momentum: One unexpected result of this study is that
space is obtained by first training a regression tree and then momentum affects neither performance nor training time in
summing over its predictions along the corresponding subset any significant way. This follows from the observation that for
of dimensions. To be precise, a random regression forest none of the datasets, momentum accounted for more than 1%
of 100 trees is trained and their prediction performance is of the variance of test set performance. It should be noted that
averaged. This improves the generalization and allows for an for TIMIT the interaction between learning rate and momentum
estimation of uncertainty of those predictions. The obtained accounts for 2.5% of the total variance, but as with learning
marginals can then be used to decompose the variance into rate × hidden size (cf. Interaction of Hyperparameters below)
additive components using the functional ANalysis Of VAriance it does not reveal any interpretable structure. This may be
(fANOVA) method [46] which provides an insight into the the result of our choice to scale learning rates dependent on
overall importance of hyperparameters and their interactions. momentum (Section IV-B). These observations suggest that
Learning rate: Learning rate is the most important hyper- momentum does not offer substantial benefits when training
parameter, therefore it is very important to understand how to LSTMs with online stochastic gradient descent.
set it correctly in order to achieve good performance. Figure 4 Analysis of Variance: Figure 5 shows what fraction of the
shows (in blue) how setting the learning rate value affects the test set performance variance can be attributed to different
predicted average performance on the test set. It is important hyperparameters. It is obvious that the learning rate is by
to note that this is an average over all other hyperparameters far the most important hyperparameter, always accounting for
and over all the trees in the regression forest. The shaded area more than two thirds of the variance. The next most important
around the curve indicates the standard deviation over tree hyperparameter is the hidden layer size, followed by the input
predictions (not over other hyperparameters), thus quantifying noise, leaving the momentum with less than one percent of the
the reliability of the average. The same is shown in green with variance. Higher order interactions play an important role in
the predicted average training time. the case of TIMIT, but are much less important for the other
The plots in Figure 4 show that the optimal value for the two data sets.
learning rate is dependent on the dataset. For each dataset, Interaction of Hyperparameters: Some hyperparameters
there is a large basin (up to two orders of magnitude) of good interact with each other resulting in different performance
learning rates inside of which the performance does not vary from what could be expected by looking at them individually.
much. A related but unsurprising observation is that there is a As shown in Figure 5 all these interactions together explain
sweet-spot for the learning rate at the high end of the basin.10 between 5% and 20% of the variance in test set performance.
In this region, the performance is good and the training time Understanding these interactions might allow us to speed up
is small. So while searching for a good learning rate for the the search for good combinations of hyperparameters. To
LSTM, it is sufficient to do a coarse search by starting with a that end we visualize the interaction between all pairs of
high value (e.g. 1.0) and dividing it by ten until performance hyperparameters in Figure 6. Each heat map in the left part
stops increasing. shows marginal performance for different values of the respec-
Figure 5 also shows that the fraction of variance caused tive two hyperparameters. This is the average performance
by the learning rate is much bigger than the fraction due predicted by the decision forest when marginalizing over all
to interaction between learning rate and hidden layer size other hyperparameters. So each one is the 2D version of the
(some part of the “higher order” piece, for more see below performance plots from Figure 4 in the paper.
at Interaction of Hyperparameters). This suggests that the
The right side employs the idea of ANOVA to better
learning rate can be quickly tuned on a small network and
illustrate the interaction between the hyperparameters. This
then used to train a large one.
means that variance of performance that can be explained
10 Note that it is unfortunately outside the investigated range for IAM Online
by varying a single hyperparameter has been removed. In
and JSB Chorales. This means that ideally we should have chosen the range case two hyperparameters do not interact at all (are perfectly
of learning rates to include higher values as well. independent), that residual would thus be all zero (grey).
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8
65 45 46 TIMIT 40 43 21.5
60 40 44 35 42 21.0
35 20.5
Classification Error
55 30
total time in h
30 42 41 20.0
50 25 25 19.5
20 40 40 19.0
45 20
40 15 38 15 39 18.5
10 36 38 18.0
35 5 10 17.5
30 0 34 5 37 17.0
100 100 IAM Online 100 58 34
100 56 33
80 90 80
Character Error Rate
80 54 32
total time in h
80 31
60 60 52
60 70 30
40 40 50 29
40 60 48
20 20 28
50 46 27
20
0 40 0 44 26
12.5 0.60 10.15 JSB Chorales 3.0 10.2
12.0 0.55 10.10 2.5 10.1
Negative Log Likelihood
total time in h
11.0 0.45 10.00 9.9
10.5 9.95
0.40 9.90 1.5 0.45
10.0 0.35 9.85 9.8
1.0 9.7
9.5 error 0.30 9.80 0.40
9.0 time 0.25 9.75 0.5 9.6
8.5 -6 0.20 9.70 0.0 9.5
10 10-5 10-4 10-3 10-2 20 40 60 80 100 120 140 160 180 200 0.0 0.2 0.4 0.6 0.8 1.0
learning rate hidden size input noise standard deviation
Figure 4. Predicted marginal error (blue) and marginal time for different values of the learning rate, hidden size, and the input noise (columns) for the test set
of all three datasets (rows). The shaded area indicates the standard deviation between the tree-predicted marginals and thus the reliability of the predicted mean
performance. Note that each plot is for the vanilla LSTM but curves for all variants that are not significantly worse look very similar.
On the right side of Figure 6 we can see for the same pair
of hyperparameters how their interaction differs from the case
of them being completely independent. This heat map exhibits
less structure, and it may in fact be the case that we would
need more samples to properly analyze the interplay between
them. However, given our observations so far this might not
be worth the effort. In any case, it is clear from the plot on the
left that varying the hidden size does not change the region of
optimal learning rate.
One clear interaction pattern can be observed in the IAM On-
line and JSB datasets between learning rate and input noise.
Here it can be seen that for high learning rates (' 10−4 )
lower input noise (/ .5) is better like also observed in the
marginals from Figure 4. But this trend reverses for lower
learning rates, where higher values of input noise are beneficial.
Though interesting this is not of any practical relevance because
performance is generally bad in that region of low learning
Figure 5. Pie charts showing which fraction of variance of the test set rates. Apart from this, however, it is difficult to discern any
performance can be attributed to each of the hyperparameters. The percentage regularities in the analyzed hyperparameter interactions. We
of variance that is due to interactions between multiple parameters is indicated
as “higher order.” conclude that there is little practical value in attending to the
interplay between hyperparameters. So for practical purposes
hyperparameters can be treated as approximately independent
For example, looking at the pair hidden size and learning and thus optimized separately.
rate on the left side for the TIMIT dataset, we can see that
performance varies strongly along the x-axis (learning rate), VI. C ONCLUSION
first decreasing and then increasing again. This is what we This paper reports the results of a large scale study on
would expect knowing the valley-shape of the learning rate variants of the LSTM architecture. We conclude that the
from Figure 4. Along the y-axis (hidden size) performance most commonly used LSTM architecture (vanilla LSTM)
seems to decrease slightly from top to bottom. Again this is performs reasonably well on various datasets. None of the eight
roughly what we would expect from the hidden size plot in investigated modifications significantly improves performance.
Figure 4. However, certain modifications such as coupling the input and
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9
-2.0 -2.0
TIMIT 57 TIMIT 2.4
54
-1.5 51 -1.5 1.6
momentum
48 0.8
momentum
-1.0 45 -1.0 0.0
42 0.8
-0.5 39
36 -0.5 1.6
33 2.4
-0.0 -0.0
1.0 1.0
1.2 1.2
hidden size
hidden size
1.5 1.5
1.8 1.8
2.0 2.0
0.0 0.0
0.3 0.3
input noise std
0.7 0.7
1.0 1.0
-5.9 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0 -5.9 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0
learning rate momentum hidden size learning rate momentum hidden size
(a)
-2.0 -2.0
IAMOnline 90 IAMOnline 24
-1.5 80 -1.5 16
70 8
momentum
momentum
-1.0 60 -1.0 0
50
8
40
-0.5 -0.5 16
30
20 24
-0.0 -0.0
1.0 1.0
1.2 1.2
hidden size
hidden size
1.5 1.5
1.7 1.7
2.0 2.0
0.0 0.0
0.3 0.3
input noise std
0.5 0.5
0.7 0.7
1.0 1.0
-6.0 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.7 2.0 -6.0 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.7 2.0
learning rate momentum hidden size learning rate momentum hidden size
(b)
-2.0 12.0
JSB Chorales 11.6
-2.0
JSB Chorales
0.32
0.24
-1.5 11.2 -1.5 0.16
10.8
momentum
0.08
momentum
-1.0 10.4
10.0 -1.0 0.00
9.6 0.08
-0.5 -0.5 0.16
9.2
8.8 0.24
-0.0 -0.0 0.32
1.0 1.0
1.2 1.2
hidden size
hidden size
1.5 1.5
1.8 1.8
2.0 2.0
0.0 0.0
0.2 0.2
input noise std
0.5 0.5
0.7 0.7
1.0 1.0
-6.0 -5.0 -4.0 -3.1 -2.1-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0 -6.0 -5.0 -4.0 -3.1 -2.1-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0
learning rate momentum hidden size learning rate momentum hidden size
(c)
Figure 6. Total marginal predicted performance for all pairs of hyperparameters (left) and the variation only due to their interaction (right). The plot is divided
vertically into three subplots, one for every dataset (TIMIT, IAM Online, and JSB Chorales). The subplots itself are divided horizontally into two parts, each
containing a lower triangular matrix of heat maps. The rows and columns of these matrices represent the different hyperparameters (learning rate, momentum,
hidden size, and input noise) and there is one heat map for every combination. The color encodes the performance as measured by the Classification Error for
TIMIT, Character Error Rate for IAM Online and Negative Log-Likelihood for the JSB Chorales Dataset. For all datasets low (blue) is better than high (red).
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10
forget gates (CIFG) or removing peephole connections (NP) [5] Patrick Doetsch, Michal Kozielski, and Hermann Ney.
simplified LSTMs in our experiments without significantly Fast and robust training of recurrent neural networks for
decreasing performance. These two variants are also attractive offline handwriting recognition. In 14th International
because they reduce the number of parameters and the Conference on Frontiers in Handwriting Recognition,
computational cost of the LSTM. 2014. URL http://people.sabanciuniv.edu/berrin/cs581/
The forget gate and the output activation function are the Papers/icfhr2014/data/4334a279.pdf.
most critical components of the LSTM block. Removing any [6] Alex Graves. Generating sequences with recurrent neural
of them significantly impairs performance. We hypothesize networks. arXiv:1308.0850 [cs], August 2013. URL
that the output activation function is needed to prevent the http://arxiv.org/abs/1308.0850.
unbounded cell state to propagate through the network and [7] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Re-
destabilize learning. This would explain why the LSTM variant current Neural Network Regularization. arXiv:1409.2329
GRU can perform reasonably well without it: its cell state is [cs], September 2014. URL http://arxiv.org/abs/1409.
bounded because of the coupling of input and forget gate. 2329.
As expected, the learning rate is the most crucial hyperpa- [8] Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals,
rameter, followed by the network size. Surprisingly though, the and Wojciech Zaremba. Addressing the Rare Word
use of momentum was found to be unimportant in our setting Problem in Neural Machine Translation. arXiv preprint
of online gradient descent. Gaussian noise on the inputs was arXiv:1410.8206, 2014. URL http://arxiv.org/abs/1410.
found to be moderately helpful for TIMIT, but harmful for the 8206.
other datasets. [9] Hasim Sak, Andrew Senior, and Françoise Beaufays.
The analysis of hyperparameter interactions revealed no Long short-term memory recurrent neural network ar-
apparent structure. Furthermore, even the highest measured chitectures for large scale acoustic modeling. In Pro-
interaction (between learning rate and network size) is quite ceedings of the Annual Conference of International
small. This implies that for practical purposes the hyperparame- Speech Communication Association (INTERSPEECH),
ters can be treated as approximately independent. In particular, 2014. URL http://193.6.4.39/∼czap/letoltes/IS14/IS2014/
the learning rate can be tuned first using a fairly small network, PDF/AUTHOR/IS141304.PDF.
thus saving a lot of experimentation time. [10] Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong.
Neural networks can be tricky to use for many practitioners TTS synthesis with bidirectional LSTM based recurrent
compared to other methods whose properties are already well neural networks. In Proc. Interspeech, 2014.
understood. This has remained a hurdle for newcomers to the [11] Søren Kaae Sønderby and Ole Winther. Protein Secondary
field since a lot of practical choices are based on the intuitions Structure Prediction with Long Short Term Memory
of experts, as well as experiences gained over time. With this Networks. arXiv:1412.7828 [cs, q-bio], December 2014.
study, we have attempted to back some of these intuitions with URL http://arxiv.org/abs/1412.7828. arXiv: 1412.7828.
experimental results. We have also presented new insights, both [12] E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini,
on architecture selection and hyperparameter tuning for LSTM and B. Schuller. Multi-resolution linear prediction based
networks which have emerged as the method of choice for features for audio onset detection with bidirectional LSTM
solving complex sequence learning problems. In future work, neural networks. In 2014 IEEE International Conference
we plan to explore more complex modifications of the LSTM on Acoustics, Speech and Signal Processing (ICASSP),
architecture. pages 2164–2168, May 2014. doi: 10.1109/ICASSP.2014.
6853982.
R EFERENCES [13] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
[1] Sepp Hochreiter. Untersuchungen zu dynamischen neu- Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
ronalen Netzen. Masters Thesis, Technische Universität and Trevor Darrell. Long-term Recurrent Convolu-
München, München, 1991. tional Networks for Visual Recognition and Descrip-
[2] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. tion. arXiv:1411.4389 [cs], November 2014. URL
Gradient flow in recurrent nets: the difficulty of learning http://arxiv.org/abs/1411.4389. arXiv: 1411.4389.
long-term dependencies. In S. C. Kremer and J. F. Kolen, [14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short
editors, A Field Guide to Dynamical Recurrent Neural Term Memory. Technical Report FKI-207-95, Tech-
Networks. IEEE Press, 2001. nische Universität München, München, August 1995.
[3] A Graves, M Liwicki, S Fernandez, R Bertolami, H Bunke, URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
and J Schmidhuber. A Novel Connectionist System 10.1.1.51.3117.
for Improved Unconstrained Handwriting Recognition. [15] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-
IEEE Transactions on Pattern Analysis and Machine Term Memory. Neural Computation, 9(8):1735–1780,
Intelligence, 31(5), 2009. November 1997. ISSN 0899-7667. doi: 10.1162/neco.
[4] Vu Pham, Théodore Bluche, Christopher Kermorvant, and 1997.9.8.1735. URL http://www.bioinf.jku.at/publications/
Jérôme Louradour. Dropout improves Recurrent Neural older/2604.pdf.
Networks for Handwriting Recognition. arXiv:1312.4569 [16] R. L. Anderson. Recent Advances in Finding Best
[cs], November 2013. URL http://arxiv.org/abs/1312. Operating Conditions. Journal of the American Statistical
4569. Association, 48(264):789–798, December 1953. ISSN
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11
0162-1459. doi: 10.2307/2281072. URL http://www.jstor. [30] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever.
org/stable/2281072. An empirical exploration of recurrent network archi-
[17] Francisco J. Solis and Roger J.-B. Wets. Minimization by tectures. In Proceedings of the 32nd International
Random Search Techniques. Mathematics of Operations Conference on Machine Learning (ICML-15), pages 2342–
Research, 6(1):19–30, February 1981. ISSN 0364-765X. 2350, 2015.
doi: 10.1287/moor.6.1.19. URL http://pubsonline.informs. [31] Sebastian Otte, Marcus Liwicki, and Andreas Zell. Dy-
org/doi/abs/10.1287/moor.6.1.19. namic Cortex Memory: Enhancing Recurrent Neural Net-
[18] James Bergstra and Yoshua Bengio. Random search for works for Gradient-Based Sequence Learning. In Artificial
hyper-parameter optimization. The Journal of Machine Neural Networks and Machine Learning – ICANN 2014,
Learning Research, 13(1):281–305, 2012. URL http: number 8681 in Lecture Notes in Computer Science,
//dl.acm.org/citation.cfm?id=2188395. pages 1–8. Springer International Publishing, September
[19] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. 2014. ISBN 978-3-319-11178-0, 978-3-319-11179-7.
An Efficient Approach for Assessing Hyperparameter URL http://link.springer.com/chapter/10.1007/978-3-319-
Importance. pages 754–762, 2014. URL http://jmlr.org/ 11179-7 1.
proceedings/papers/v32/hutter14.html. [32] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,
[20] Alex Graves and Jürgen Schmidhuber. Framewise Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
phoneme classification with bidirectional LSTM and other Learning Phrase Representations using RNN Encoder-
neural network architectures. Neural Networks, 18(5–6): Decoder for Statistical Machine Translation. arXiv
602–610, July 2005. ISSN 0893-6080. doi: 10.1016/j. preprint arXiv:1406.1078, 2014. URL http://arxiv.org/
neunet.2005.06.042. URL http://www.sciencedirect.com/ abs/1406.1078.
science/article/pii/S0893608005001206. [33] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[21] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. and Yoshua Bengio. Empirical Evaluation of Gated
Learning to forget: Continual prediction with LSTM. Recurrent Neural Networks on Sequence Modeling.
In Artificial Neural Networks, 1999. ICANN 99. Ninth arXiv:1412.3555 [cs], December 2014. URL http://arxiv.
International Conference on (Conf. Publ. No. 470), org/abs/1412.3555.
volume 2, pages 850–855, 1999. [34] David Crystal. Dictionary of linguistics and phonetics,
[22] Felix A. Gers and Jürgen Schmidhuber. Recurrent nets volume 30. John Wiley & Sons, 2011.
that time and count. In Neural Networks, 2000. IJCNN [35] P. Mermelstein. Distance measures for speech recognition:
2000, Proceedings of the IEEE-INNS-ENNS International Psychological and instrumental. In C. H. Chen, editor,
Joint Conference on, volume 3, pages 189–194. IEEE, Pattern Recognition and Artificial Intelligence, pages 374–
2000. ISBN 0769506194. 388. Academic Press, New York, 1976.
[23] AJ Robinson and Frank Fallside. The utility driven [36] Alexander Graves. Supervised Sequence Labelling with
dynamic error propagation network. University of Recurrent Neural Networks. Ph.d., The Technical Univer-
Cambridge Department of Engineering, 1987. sity of Munich, 2008.
[24] R. J. Williams. Complexity of exact gradient computation [37] Andrew K. Halberstadt. Heterogeneous acoustic mea-
algorithms for recurrent neural networks. Technical Report surements and multiple classifiers for speech recognition.
Technical Report NU-CCS-89-27, Boston: Northeastern PhD thesis, Massachusetts Institute of Technology, 1998.
University, College of Computer Science, 1989. [38] Marcus Liwicki and Horst Bunke. IAM-OnDB-an on-line
[25] P. J. Werbos. Generalization of backpropagation with English sentence database acquired from handwritten text
application to a recurrent gas market model. Neural on a whiteboard. In Document Analysis and Recognition,
Networks, 1, 1988. 2005. Proceedings. Eighth International Conference on,
[26] JS Garofolo, LF Lamel, WM Fisher, JG Fiscus, DS Pallett, pages 956–961. IEEE, 2005.
and NL Dahlgren. DARPA TIMIT Acoustic-Phonetic [39] Alex Graves, Santiago Fernández, Faustino Gomez, and
Continuous Speech Corpus CD-ROM. National Institute Jürgen Schmidhuber. Connectionist temporal classifica-
of Standards and Technology, NTIS Order No PB91- tion: labelling unsegmented sequence data with recurrent
505065, 1993. neural networks. In Proceedings of the 23rd international
[27] Felix A. Gers, Juan Antonio Pérez-Ortiz, Douglas Eck, conference on Machine learning, pages 369–376, 2006.
and Jürgen Schmidhuber. DEFK-LSTM. In ESANN 2002, URL http://dl.acm.org/citation.cfm?id=1143891.
Proceedings of the 10th Eurorean Symposium on Artificial [40] Moray Allan and Christopher KI Williams. Harmonising
Neural Networks, 2002. chorales by probabilistic inference. Advances in neural
[28] J Schmidhuber, D Wierstra, M Gagliolo, and F J Gomez. information processing systems, 17:25–32, 2005.
Training Recurrent Networks by EVOLINO. Neural [41] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and
Computation, 19(3):757–779, 2007. Pascal Vincent. Modeling Temporal Dependencies in
[29] Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen High-Dimensional Sequences: Application to Polyphonic
Schmidhuber. Evolving memory cell structures for Music Generation and Transcription. pages 1159–1166,
sequence learning. In Artificial Neural Networks–ICANN 2012. URL http://icml.cc/discuss/2012/590.html.
2009, pages 755–764. Springer, 2009. URL http://link. [42] Ilya Sutskever, James Martens, George Dahl, and Geoffrey
springer.com/chapter/10.1007/978-3-642-04277-5 76. Hinton. On the importance of initialization and momen-
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12
tum in deep learning. In JMLR, pages 1139–1147, 2013. Configuration. In Proc. of LION-5, pages 507–523, 2011.
URL http://jmlr.org/proceedings/papers/v28/sutskever13. [45] Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen
html. Schmidhuber, and Santiago Fernández. Unconstrained
[43] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. on-line handwriting recognition with recurrent neural
Practical Bayesian Optimization of Machine Learning networks. In Advances in Neural Information Processing
Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, Systems, pages 577–584, 2008.
and K. Q. Weinberger, editors, Advances in Neural [46] Giles Hooker. Generalized Functional ANOVA Diagnos-
Information Processing Systems 25, pages 2951–2959. tics for High-Dimensional Functions of Dependent Vari-
Curran Associates, Inc., 2012. ables. Journal of Computational and Graphical Statistics,
[44] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequen- 16(3):709–732, September 2007. ISSN 1061-8600, 1537-
tial Model-Based Optimization for General Algorithm 2715. doi: 10.1198/106186007X237892. URL http://www.
tandfonline.com/doi/abs/10.1198/106186007X237892.
Bas R. Steunebrink is a postdoctoral researcher
at the Swiss AI lab IDSIA. He received his PhD
in 2010 from Utrecht University, the Netherlands.
Klaus Greff received his Diploma in Computer Sci-
Bas’s interests and expertise include Artificial Gen-
ence from the University of Kaiserslautern, Germany
eral Intelligence (AGI), cognitive robotics, machine
in 2011. Currently he is pursuing his PhD at IDSIA
learning, resource-constrained control, and affective
in Lugano, Switzerland, under the supervision of
computing.
Prof. Jürgen Schmidhuber in the field of Machine
Learning. His research interests include Sequence
Learning and Recurrent Neural Networks.