0% found this document useful (0 votes)

25 views12 pages

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

Uploaded by

Stefano Bbc Rossi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views12 pages

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

Uploaded by

Stefano Bbc Rossi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

LSTM: A Search Space Odyssey

Klaus Greff, Rupesh K. Srivastava, Jan Koutnı́k, Bas R. Steunebrink, Jürgen Schmidhuber

Abstract—Several variants of the Long Short-Term Memory synthesis [10], protein secondary structure prediction [11],
(LSTM) architecture for recurrent neural networks have been analysis of audio [12], and video data [13] among others.
proposed since its inception in 1995. In recent years, these
networks have become the state-of-the-art models for a variety The central idea behind the LSTM architecture is a memory
of machine learning problems. This has led to a renewed interest cell which can maintain its state over time, and non-linear
in understanding the role and utility of various computational gating units which regulate the information flow into and out of
components of typical LSTM variants. In this paper, we present the cell. Most modern studies incorporate many improvements
the first large-scale analysis of eight LSTM variants on three
that have been made to the LSTM architecture since its
arXiv:1503.04069v2 [cs.NE] 4 Oct 2017

representative tasks: speech recognition, handwriting recognition,

and polyphonic music modeling. The hyperparameters of all original formulation [14, 15]. However, LSTMs are now
LSTM variants for each task were optimized separately using applied to many learning problems which differ significantly
random search, and their importance was assessed using the in scale and nature from the problems that these improvements
powerful fANOVA framework. In total, we summarize the results were initially tested on. A systematic study of the utility of
of 5400 experimental runs (≈ 15 years of CPU time), which
makes our study the largest of its kind on LSTM networks. various computational components which comprise LSTMs
Our results show that none of the variants can improve upon (see Figure 1) was missing. This paper fills that gap and
the standard LSTM architecture significantly, and demonstrate systematically addresses the open question of improving the
the forget gate and the output activation function to be its LSTM architecture.
most critical components. We further observe that the studied
hyperparameters are virtually independent and derive guidelines We evaluate the most popular LSTM architecture (vanilla
for their efficient adjustment. LSTM; Section II) and eight different variants thereof on
Index Terms—Recurrent neural networks, Long Short-Term
three benchmark problems: acoustic modeling, handwriting
Memory, LSTM, sequence learning, random search, fANOVA. recognition, and polyphonic music modeling. Each variant
differs from the vanilla LSTM by a single change. This
allows us to isolate the effect of each of these changes
I. I NTRODUCTION on the performance of the architecture. Random search [16–
18] is used to find the best-performing hyperparameters for
Recurrent neural networks with Long Short-Term Memory each variant on each problem, enabling a reliable comparison
(which we will concisely refer to as LSTMs) have emerged as of the performance of different variants. We also provide
an effective and scalable model for several learning problems insights gained about hyperparameters and their interaction
related to sequential data. Earlier methods for attacking these using fANOVA [19].
problems have either been tailored towards a specific problem
or did not scale to long time dependencies. LSTMs on the
other hand are both general and effective at capturing long-
term temporal dependencies. They do not suffer from the
optimization hurdles that plague simple recurrent networks II. VANILLA LSTM
(SRNs) [1, 2] and have been used to advance the state-of-
the-art for many difficult problems. This includes handwriting
recognition [3–5] and generation [6], language modeling [7] The LSTM setup most commonly used in literature was
and translation [8], acoustic modeling of speech [9], speech originally described by Graves and Schmidhuber [20]. We refer
to it as vanilla LSTM and use it as a reference for comparison
2016
c IEEE. Personal use of this material is permitted. Permission from of all the variants. The vanilla LSTM incorporates changes
IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
by Gers et al. [21] and Gers and Schmidhuber [22] into the
purposes, creating new collective works, for resale or redistribution to servers original LSTM [15] and uses full gradient training. Section III
or lists, or reuse of any copyrighted component of this work in other works. provides descriptions of these major LSTM changes.
Manuscript received May 15, 2015; revised March 17, 2016; accepted June 9,
2016. Date of publication July 8, 2016; date of current version June 20, 2016. A schematic of the vanilla LSTM block can be seen in
DOI: 10.1109/TNNLS.2016.2582924 Figure 1. It features three gates (input, forget, output), block
This research was supported by the Swiss National Science Foundation grants
“Theory and Practice of Reinforcement Learning 2” (#138219) and “Advanced input, a single cell (the Constant Error Carousel), an output
Reinforcement Learning” (#156682), and by EU projects “NASCENCE” activation function, and peephole connections1 . The output of
(FP7-ICT-317662), “NeuralDynamics” (FP7-ICT-270247) and WAY (FP7- the block is recurrently connected back to the block input and
ICT-288551).
K. Greff, R. K. Srivastava, J. Koutı́k, B. R. Steunebrink and J. Schmidhuber all of the gates.
are with the Istituto Dalle Molle di studi sull’Intelligenza Artificiale (IDSIA),
the Scuola universitaria professionale della Svizzera italiana (SUPSI), and the
Università della Svizzera italiana (USI).
Author e-mails addresses: {klaus, rupesh, hkou, bas, juergen}@idsia.ch 1 Some studies omit peephole connections, described in Section III-B.
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2

output
...
recurrent

...
recurrent
block output ... Legend
output gate
LSTM block unweighted connection
output
+
... weighted connection
recurrent
... ...
peepholes connection with time-lag
h input
recurrent branching point
SRN ...
mutliplication
unit
g + cell recurrent
...
+ sum over all inputs
+
... forget gate gate activation function
(always sigmoid)
+ input +
input activation function
... input gate ...
g
... (usually tanh)
input recurrent g input
block input output activation function
h
(usually tanh)
+
...
...
input recurrent

Figure 1. Detailed schematic of the Simple Recurrent Network (SRN) unit (left) and a Long Short-Term Memory block (right) as used in the hidden layers of
a recurrent neural network.

A. Forward Pass B. Backpropagation Through Time

The deltas inside the LSTM block are then calculated as:
Let xt be the input vector at time t, N be the number of
LSTM blocks and M the number of inputs. Then we get the δyt = ∆t + RTz δzt+1 + RTi δit+1 + RTf δf t+1 + RTo δot+1
following weights for an LSTM layer: δōt = δyt h(ct ) σ 0 (ōt )
δct = δyt ot h0 (ct ) + po δōt + pi δ īt+1
• Input weights: Wz , Wi , Wf , Wo ∈ RN ×M
+ pf δ f̄ t+1 + δct+1 f t+1
• Recurrent weights: Rz , Ri , Rf , Ro ∈ RN ×N
δ f̄ t = δct ct−1 σ 0 (f̄ t )
• Peephole weights: pi , pf , po ∈ RN
δ īt = δct zt σ 0 (īt )
• Bias weights: bz , bi , bf , bo ∈ RN
δz̄t = δct it g 0 (z̄t )
Then the vector formulas for a vanilla LSTM layer forward
pass can be written as:
Here ∆t is the vector of deltas passed down from the layer
∂E
above. If E is the loss function it formally corresponds to ∂y t,

but not including the recurrent dependencies. The deltas for

z̄t = Wz xt + Rz yt−1 + bz
the inputs are only needed if there is a layer below that needs
zt = g(z̄t ) block input training, and can be computed as follows:
t t t−1 t−1
ī = Wi x + Ri y + pi c + bi
t
i = σ(ī ) t
input gate δxt = WzT δz̄t + WiT δ īt + WfT δ f̄ t + WoT δōt
f̄ t = Wf xt + Rf yt−1 + pf ct−1 + bf
Finally, the gradients for the weights are calculated as
f t = σ(f̄ t ) forget gate follows, where ? can be any of {z̄, ī, f̄ , ō}, and h?1 , ?2 i denotes
ct = zt it + ct−1 f t cell the outer product of two vectors:
t t t−1 t
ō = Wo x + Ro y + po c + bo
t t T T −1
o = σ(ō ) output gate X X
δW? = hδ?t , xt i δpi = ct δ īt+1
yt = h(ct ) ot block output
t=0 t=0
T
X −1 T
X −1
δR? = hδ?t+1 , yt i δpf = ct δ f̄ t+1
Where σ, g and h are point-wise non-linear activation functions. t=0 t=0
The logistic sigmoid (σ(x) = 1+e1−x ) is used as gate activation XT XT
function and the hyperbolic tangent (g(x) = h(x) = tanh(x)) δb? = δ?t δpo = ct δōt
is usually used as the block input and output activation function. t=0 t=0
Point-wise multiplication of two vectors is denoted by .
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3

III. H ISTORY OF LSTM study of this kind was later done by Jozefowicz et al. [30]. Sak
The initial version of the LSTM block [14, 15] included et al. [9] introduced a linear projection layer that projects the
(possibly multiple) cells, input and output gates, but no forget output of the LSTM layer down before recurrent and forward
gate and no peephole connections. The output gate, unit connections in order to reduce the amount of parameters for
biases, or input activation function were omitted for certain LSTM networks with many blocks. By introducing a trainable
experiments. Training was done using a mixture of Real Time scaling parameter for the slope of the gate activation functions,
Recurrent Learning (RTRL) [23, 24] and Backpropagation Doetsch et al. [5] were able to improve the performance of
Through Time (BPTT) [24, 25]. Only the gradient of the cell LSTM on an offline handwriting recognition dataset. In what
was propagated back through time, and the gradient for the they call Dynamic Cortex Memory, Otte et al. [31] improved
other recurrent connections was truncated. Thus, that study convergence speed of LSTM by adding recurrent connections
did not use the exact gradient for training. Another feature of between the gates of a single block (but not between the
that version was the use of full gate recurrence, which means blocks).
that all the gates received recurrent inputs from all gates at the Cho et al. [32] proposed a simplified variant of the LSTM
previous time-step in addition to the recurrent inputs from the architecture called Gated Recurrent Unit (GRU). They used
block outputs. This feature did not appear in any of the later neither peephole connections nor output activation functions,
papers. and coupled the input and the forget gate into an update gate.
Finally, their output gate (called reset gate) only gates the
A. Forget Gate recurrent connections to the block input (Wz ). Chung et al.
[33] performed an initial comparison between GRU and Vanilla
The first paper to suggest a modification of the LSTM LSTM and reported mixed results.
architecture introduced the forget gate [21], enabling the LSTM
to reset its own state. This allowed learning of continual tasks
such as embedded Reber grammar. IV. E VALUATION S ETUP

B. Peephole Connections The focus of our study is to empirically compare different

Gers and Schmidhuber [22] argued that in order to learn LSTM variants, and not to achieve state-of-the-art results.
precise timings, the cell needs to control the gates. So far Therefore, our experiments are designed to keep the setup
this was only possible through an open output gate. Peephole simple and the comparisons fair. The vanilla LSTM is used as
connections (connections from the cell to the gates, blue a baseline and evaluated together with eight of its variants. Each
in Figure 1) were added to the architecture in order to variant adds, removes, or modifies the baseline in exactly one
make precise timings easier to learn. Additionally, the output aspect, which allows to isolate their effect. They are evaluated
activation function was omitted, as there was no evidence that on three different datasets from different domains to account
it was essential for solving the problems that LSTM had been for cross-domain variations.
tested on so far. For fair comparison, the setup needs to be similar for
each variant. Different variants might require different settings
C. Full Gradient of hyperparameters to give good performance, and we are
interested in the best performance that can be achieved
The final modification towards the vanilla LSTM was
with each variant. For this reason we chose to tune the
done by Graves and Schmidhuber [20]. This study presented
hyperparameters like learning rate or amount of input noise
the full backpropagation through time (BPTT) training for
individually for each variant. Since hyperparameter space is
LSTM networks with the architecture described in Section II,
large and impossible to traverse completely, random search was
and presented results on the TIMIT [26] benchmark. Using
used in order to obtain good-performing hyperparameters [18]
full BPTT had the added advantage that LSTM gradients
for every combination of variant and dataset. Random search
could be checked using finite differences, making practical
was also chosen for the added benefit of providing enough data
implementations more reliable.
for analyzing the general effect of various hyperparameters on
the performance of each LSTM variant (Section V-B).
D. Other Variants
Since its introduction the vanilla LSTM has been the most
commonly used architecture, but other variants have been A. Datasets
suggested too. Before the introduction of full BPTT training,
Gers et al. [27] utilized a training method based on Extended Each dataset is split into three parts: a training set, a
Kalman Filtering which enabled the LSTM to be trained on validation set used for early stopping and for optimizing the
some pathological cases at the cost of high computational hyperparameters, and a test set for the final evaluation.
complexity. Schmidhuber et al. [28] proposed using a hybrid TIMIT: The TIMIT Speech corpus [26] is large enough
evolution-based method instead of BPTT for training but to be a reasonable acoustic modeling benchmark for speech
retained the vanilla LSTM architecture. recognition, yet it is small enough to keep a large study such
Bayer et al. [29] evolved different LSTM block architectures as ours manageable. Our experiments focus on the frame-wise
that maximize fitness on context-sensitive grammars. A larger classification task for this dataset, where the objective is to
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4

joined the two validation sets together, so the final training,

validation, and testing sets contain 5 355, 2 956 and 3 859
sequences respectively.
Each handwriting line is accompanied with a target character
sequence, see Figure 2(b), assembled from the following
81 ASCII characters:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789 !"#&\’()*+,-./[]:;?
The board labeled as a08-551z (in the training set) contains
(a) a sequence of eleven percent (%) characters that does not have
Ben Zoma said: "The days of 1thy an image in the strokes, and the percent character does not
life means in the day-time; all the days occur in any other board. That board was removed from the
of 1thy life means even at night-time ." experiments.
(Berochoth .) And the Rabbis thought We subsampled each sequence to half its length, which
it important that when we read the speeds up the training and does not harm performance. Each
frame of the sequence is a 4-dimensional vector containing ∆x,
(b) ∆y (the change in pen position), t (time since the beginning of
the stroke), and a fourth dimension that contains value of one
Figure 2. (a) Example board (a08-551z, training set) from the IAM-OnDB at the time of the pen lifting (a transition to the next stroke)
dataset and (b) its transcription into character label sequences.
and zeroes at all other time steps. Possible starts and ends of
characters within each stroke are not explicitly marked. No
classify each audio-frame as one of 61 phones. From the additional preprocessing (like base-line straightening, cursive
2

raw audio we extract 12 Mel Frequency Cepstrum Coefficients correction, etc.) was used.
(MFCCs) [35] + energy over 25ms hamming-windows with The networks were trained using the Connectionist Temporal
stride of 10ms and a pre-emphasis coefficient of 0.97. This Classification (CTC) error function by Graves et al. [39] with
preprocessing is standard in speech recognition and was chosen 82 outputs (81 characters plus the special empty label). We
in order to stay comparable with earlier LSTM-based results measure performance in terms of the Character Error Rate
(e.g. [20, 36]). The 13 coefficients along with their first and (CER) after decoding using best-path decoding [39].
second derivatives comprise the 39 inputs to the network and JSB Chorales: JSB Chorales is a collection of 382 four-
were normalized to have zero mean and unit variance. part harmonized chorales by J. S. Bach [40], consisting of
The performance is measured as classification error per- 202 chorales in major keys and 180 chorals in minor keys.
centage. The training, testing, and validation sets are split in We used the preprocessed piano-rolls provided by Boulanger-
5
line with Halberstadt [37] into 3696, 400, and 192 sequences, Lewandowski et al. [41]. These piano-rolls were generated
having 304 frames on average. by transposing each MIDI sequence in C major or C minor
We restrict our study to the core test set, which is an and sampling frames every quarter note. The networks where
established subset of the full TIMIT corpus, and use the trained to do next-step prediction by minimizing the negative
splits into training, testing, and validation sets as detailed log-likelihood. The complete dataset consists of 229, 76, and
by Halberstadt [37]. In short, that means we only use the core 77 sequences (training, validation, and test sets respectively)
test set and drop the SA samples3 from the training set. The with an average length of 61.
validation set is built from some of the discarded samples from
the full test set. B. Network Architectures & Training
IAM Online: The IAM Online Handwriting Database [38]4 A network with a single LSTM hidden layer and a sigmoid
consists of English sentences as time series of pen movements output layer was used for the JSB Chorales task. Bidirectional
that have to be mapped to characters. The IAM-OnDB dataset LSTM [20] was used for TIMIT and IAM Online tasks,
splits into one training set, two validation sets, and one test set, consisting of two hidden layers, one processing the input
having 775, 192, 216, and 544 boards each. Each board, see forwards and the other one backwards in time, both connected
Figure 2(a), contains multiple hand-written lines, which in turn to a single softmax output layer. As loss function we employed
consist of several strokes. We use one line per sequence, and Cross-Entropy Error for TIMIT and JSB Chorales, while for
2 Note that in linguistics a phone represents a distinct speech sound the IAM Online task the Connectionist Temporal Classification
independent of the language. In contrast, a phoneme refers to a sound that (CTC) loss by Graves et al. [39] was used. The initial weights
distinguishes two words in a given language [34]. These terms are often for all networks were drawn from a normal distribution with
confused in the machine learning literature.
3 The dialect sentences (the SA samples) were meant to expose the dialectal standard deviation of 0.1. Training was done using Stochastic
variants of the speakers and were read by all 630 speakers. We follow [37] Gradient Descent with Nesterov-style momentum [42] with
and remove them because they bias the distribution of phones.
4 The IAM-OnDB was obtained from http://www.iam.unibe.ch/fki/databases/ 5 Available at http://www-etud.iro.umontreal.ca/∼boulanni/icml2012 at the
iam-on-line-handwriting-database time of writing.
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5

updates after each sequence. The learning rate was rescaled by • learning rate: log-uniform samples from [10−6 , 10−2 ];
a factor of (1 − momentum). Gradients were computed using • momentum: 1 − log-uniform samples from [0.01, 1.0];
full BPTT for LSTMs [20]. Training stopped after 150 epochs standard deviation of Gaussian input noise: uniform samples
•
or once there was no improvement on the validation set for from [0, 1].
more than fifteen epochs. In the case of the TIMIT dataset, two additional (boolean)
hyperparameters were considered (not tuned for the other two
C. LSTM Variants datasets). The first one was the choice between traditional
The vanilla LSTM from Section II is referred as Vanilla (V). momentum and Nesterov-style momentum [42]. Our analysis
For activation functions we follow the standard and use the showed that this had no measurable effect on performance so
logistic sigmoid for σ, and the hyperbolic tangent for both g the latter was arbitrarily chosen for all further experiments.
and h. The derived eight variants of the V architecture are The second one was whether to clip the gradients to the range
the following. We only report differences to the forward pass [−1, 1]. This turned out to hurt overall performance,6 therefore
formulas presented in Section II-A: the gradients were never clipped in the case of the other two
t datasets.
NIG: No Input Gate: i = 1
t Note that, unlike an earlier small-scale study [33], the number
NFG: No Forget Gate: f = 1
t of parameters was not kept fixed for all variants. Since different
NOG: No Output Gate: o = 1
NIAF: No Input Activation Function: g(x) = x variants can utilize their parameters differently, fixing this
NOAF: No Output Activation Function: h(x) = x number can bias comparisons.
t t
CIFG: Coupled Input and Forget Gate: f = 1 − i
NP: No Peepholes: V. R ESULTS & D ISCUSSION
īt = Wi xt + Ri yt−1 + bi Each of the 5400 experiments was run on one of 128 AMD
Opteron CPUs at 2.5 GHz and took 24.3 h on average to
f̄ t = Wf xt + Rf yt−1 + bf
complete. This sums up to a total single-CPU computation
ōt = Wo xt + Ro yt−1 + bo time of just below 15 years.
FGR: Full Gate Recurrence: For TIMIT the test set performance of the best trial were
29.6% classification error (CIFG) which is close to the best
īt = Wi xt + Ri yt−1 + pi ct−1 + bi reported result of 26.9% [20]. Our best result of -8.38 log-
+ Rii it−1 + Rf i f t−1 + Roi ot−1 likelihood (NIAF) on the JSB Chorales dataset on the other
f̄ = Wf xt + Rf yt−1 + pf ct−1 + bf
t hand is well below the -5.56 from Boulanger-Lewandowski
et al. [41]. Best LSTM result is 26.9% For the IAM Online
+ Rif it−1 + Rf f f t−1 + Rof ot−1 dataset our best result was a Character Error Rate of 9.26%
ōt = Wo xt + Ro yt−1 + po ct−1 + bo (NP) on the test set. The best previously published result is
+ Rio it−1 + Rf o f t−1 + Roo ot−1 11.5% CER by Graves et al. [45] using a different and much
more extensive preprocessing.7 Note though, that the goal of
The first six variants are self-explanatory. The CIFG variant this study is not to provide state-of-the-art results, but to do a
uses only one gate for gating both the input and the cell fair comparison of different LSTM variants. So these numbers
recurrent self-connection – a modification of LSTM referred are only meant as a rough orientation for the reader.
to as Gated Recurrent Units (GRU) [32]. This is equivalent to
setting ft = 1 − it instead of learning the forget gate weights
A. Comparison of the Variants
independently. The FGR variant adds recurrent connections
between all the gates as in the original formulation of the A summary of the random search results is shown in Figure 3.
LSTM [15]. It adds nine additional recurrent weight matrices, Welch’s t-test at a significance level of p = 0.05 was used8
thus significantly increasing the number of parameters. to determine whether the mean test set performance of each
variant was significantly different from that of the baseline. The
D. Hyperparameter Search box for a variant is highlighted in blue if its mean performance
differs significantly from the mean performance of the vanilla
While there are other methods to efficiently search for good
LSTM.
hyperparameters (cf. [43, 44]), random search has several
The results in the top half of Figure 3 represent the
advantages for our setting: it is easy to implement, trivial
distribution of all 200 test set performances over the whole
to parallelize, and covers the search space more uniformly,
search space. Any conclusions drawn from them are therefore
thereby improving the follow-up analysis of hyperparameter
importance. 6 Although this may very well be the result of the range having been chosen

We performed 27 random searches (one for each combination too tightly.

7 Note that these numbers differ from the best test set performances that can
of the nine variants and three datasets). Each random search
be found in Figure 3. This is the case because here we only report the single
encompasses 200 trials for a total of 5400 trials of randomly best performing trial as determined on the validation set. In Figure 3, on the
sampling the following hyperparameters: other hand, we show the test set performance of the 20 best trials for each
variant.
• number of LSTM blocks per hidden layer: log-uniform 8 We applied the Bonferroni adjustment to correct for performing eight
samples from [20, 200]; different tests (one for each variant).
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6

100 TIMIT 2.5 IAM Online 1.8 12.5 JSB Chorales 1.4
100
1.6 12.0
90 1.2
2.0
1.4
11.5
80 80
1.0

number of parameters ∗105

classification error in %

negative log-likelihood
1.2

character error rate

11.0
70 1.5
1.0 0.8
60 10.5
60 0.8 0.6
1.0 10.0

40 0.6
50
9.5 0.4
0.4
0.5
40
9.0 0.2
20 0.2
30 8.5
0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF
35 TIMIT 5 IAM Online 3.0 JSB Chorales 1.6
30
8.8
34 1.4
2.5
4
33 1.2
25
number of parameters ∗105

number of parameters ∗105

8.7
classification error in %

negative log-likelihood
2.0
character error rate

32 1.0
3

31 20 1.5 8.6 0.8

2
30 0.6
1.0
8.5
15
29 0.4
1
0.5
28 0.2
8.4
10
27 0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF

Figure 3. Test set performance for all 200 trials (top) and for the best 10% (bottom) trials (according to the validation set) for each dataset and variant. Boxes
show the range between the 25th and the 75th percentile of the data, while the whiskers indicate the whole range. The red dot represents the mean and the red
line the median of the data. The boxes of variants that differ significantly from the vanilla LSTM are shown in blue with thick lines. The grey histogram in the
background presents the average number of parameters for the top 10% performers of every variant.

specific to our choice of search ranges. We have tried to chose Input and forget gate coupling (CIFG) did not significantly
reasonable ranges for the hyperparameters that include the best change mean performance on any of the datasets, although
settings for each variant and are still small enough to allow the best performance improved slightly on music modeling.
for an effective search. The means and variances tend to be Similarly, removing peephole connections (NP) also did not
rather similar for the different variants and datasets, but even lead to significant changes, but the best performance improved
here some significant differences can be found. slightly for handwriting recognition. Both of these variants
In order to draw some more interesting conclusions we simplify LSTMs and reduce the computational complexity, so
restrict our further analysis to the top 10% performing trials it might be worthwhile to incorporate these changes into the
for each combination of dataset and variant (see bottom half architecture.
of Figure 3). This way our findings will be less dependent on Adding full gate recurrence (FGR) did not significantly
the chosen search space and will be representative for the case change performance on TIMIT or IAM Online, but led to
of “reasonable hyperparameter tuning efforts.”9 worse results on the JSB Chorales dataset. Given that this
The first important observation based on Figure 3 is that variant greatly increases the number of parameters, we generally
removing the output activation function (NOAF) or the forget advise against using it. Note that this feature was present in
gate (NFG) significantly hurt performance on all three datasets. the original proposal of LSTM [14, 15], but has been absent
Apart from the CEC, the ability to forget old information in all following studies.
and the squashing of the cell state appear to be critical for Removing the input gate (NIG), the output gate (NOG), and
the LSTM architecture. Indeed, without the output activation the input activation function (NIAF) led to a significant reduc-
function, the block output can in principle grow unbounded. tion in performance on speech and handwriting recognition.
Coupling the input and the forget gate avoids this problem and However, there was no significant effect on music modeling
might render the use of an output non-linearity less important, performance. A small (but statistically insignificant) average
which could explain why GRU performs well without it. performance improvement was observed for the NIG and NIAF
9 How much effort is “reasonable” will still depend on the search space. If
architectures on music modeling. We hypothesize that these
the ranges are chosen much larger, the search will take much longer to find behaviors will generalize to similar problems such as language
good hyperparameters. modeling. For supervised learning on continuous real-valued
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7

data (such as speech and handwriting recognition), the input Hidden Layer Size: Not surprisingly the hidden layer size
gate, output gate, and input activation function are all crucial is an important hyperparameter affecting the LSTM network
for obtaining good performance. performance. As expected, larger networks perform better, but
with diminishing returns. It can also be seen in Figure 4 (middle,
green) that the required training time increases with the network
B. Impact of Hyperparameters size. Note that the scale here is wall-time and thus factors in
The fANOVA framework for assessing hyperparameter both the increased computation time for each epoch as well as
importance by Hutter et al. [19] is based on the observation the convergence speed.
that marginalizing over dimensions can be done efficiently Input Noise: Additive Gaussian noise on the inputs, a
in regression trees. This allows predicting the marginal error traditional regularizer for neural networks, has been used for
for one hyperparameter while averaging over all the others. LSTM as well. However, we find that not only does it almost
Traditionally this would require a full hyperparameter grid always hurt performance, it also slightly increases training
search, whereas here the hyperparameter space can be sampled times. The only exception is TIMIT, where a small dip in error
at random. for the range of [0.2, 0.5] is observed.
Average performance for any slice of the hyperparameter Momentum: One unexpected result of this study is that
space is obtained by first training a regression tree and then momentum affects neither performance nor training time in
summing over its predictions along the corresponding subset any significant way. This follows from the observation that for
of dimensions. To be precise, a random regression forest none of the datasets, momentum accounted for more than 1%
of 100 trees is trained and their prediction performance is of the variance of test set performance. It should be noted that
averaged. This improves the generalization and allows for an for TIMIT the interaction between learning rate and momentum
estimation of uncertainty of those predictions. The obtained accounts for 2.5% of the total variance, but as with learning
marginals can then be used to decompose the variance into rate × hidden size (cf. Interaction of Hyperparameters below)
additive components using the functional ANalysis Of VAriance it does not reveal any interpretable structure. This may be
(fANOVA) method [46] which provides an insight into the the result of our choice to scale learning rates dependent on
overall importance of hyperparameters and their interactions. momentum (Section IV-B). These observations suggest that
Learning rate: Learning rate is the most important hyper- momentum does not offer substantial benefits when training
parameter, therefore it is very important to understand how to LSTMs with online stochastic gradient descent.
set it correctly in order to achieve good performance. Figure 4 Analysis of Variance: Figure 5 shows what fraction of the
shows (in blue) how setting the learning rate value affects the test set performance variance can be attributed to different
predicted average performance on the test set. It is important hyperparameters. It is obvious that the learning rate is by
to note that this is an average over all other hyperparameters far the most important hyperparameter, always accounting for
and over all the trees in the regression forest. The shaded area more than two thirds of the variance. The next most important
around the curve indicates the standard deviation over tree hyperparameter is the hidden layer size, followed by the input
predictions (not over other hyperparameters), thus quantifying noise, leaving the momentum with less than one percent of the
the reliability of the average. The same is shown in green with variance. Higher order interactions play an important role in
the predicted average training time. the case of TIMIT, but are much less important for the other
The plots in Figure 4 show that the optimal value for the two data sets.
learning rate is dependent on the dataset. For each dataset, Interaction of Hyperparameters: Some hyperparameters
there is a large basin (up to two orders of magnitude) of good interact with each other resulting in different performance
learning rates inside of which the performance does not vary from what could be expected by looking at them individually.
much. A related but unsurprising observation is that there is a As shown in Figure 5 all these interactions together explain
sweet-spot for the learning rate at the high end of the basin.10 between 5% and 20% of the variance in test set performance.
In this region, the performance is good and the training time Understanding these interactions might allow us to speed up
is small. So while searching for a good learning rate for the the search for good combinations of hyperparameters. To
LSTM, it is sufficient to do a coarse search by starting with a that end we visualize the interaction between all pairs of
high value (e.g. 1.0) and dividing it by ten until performance hyperparameters in Figure 6. Each heat map in the left part
stops increasing. shows marginal performance for different values of the respec-
Figure 5 also shows that the fraction of variance caused tive two hyperparameters. This is the average performance
by the learning rate is much bigger than the fraction due predicted by the decision forest when marginalizing over all
to interaction between learning rate and hidden layer size other hyperparameters. So each one is the 2D version of the
(some part of the “higher order” piece, for more see below performance plots from Figure 4 in the paper.
at Interaction of Hyperparameters). This suggests that the
The right side employs the idea of ANOVA to better
learning rate can be quickly tuned on a small network and
illustrate the interaction between the hyperparameters. This
then used to train a large one.
means that variance of performance that can be explained
10 Note that it is unfortunately outside the investigated range for IAM Online
by varying a single hyperparameter has been removed. In
and JSB Chorales. This means that ideally we should have chosen the range case two hyperparameters do not interact at all (are perfectly
of learning rates to include higher values as well. independent), that residual would thus be all zero (grey).
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8

65 45 46 TIMIT 40 43 21.5
60 40 44 35 42 21.0
35 20.5
Classification Error
55 30

total time in h
30 42 41 20.0
50 25 25 19.5
20 40 40 19.0
45 20
40 15 38 15 39 18.5
10 36 38 18.0
35 5 10 17.5
30 0 34 5 37 17.0
100 100 IAM Online 100 58 34
100 56 33
80 90 80
Character Error Rate

80 54 32

total time in h
80 31
60 60 52
60 70 30
40 40 50 29
40 60 48
20 20 28
50 46 27
20
0 40 0 44 26
12.5 0.60 10.15 JSB Chorales 3.0 10.2
12.0 0.55 10.10 2.5 10.1
Negative Log Likelihood

11.5 0.50 10.05 0.50

2.0 10.0

total time in h
11.0 0.45 10.00 9.9
10.5 9.95
0.40 9.90 1.5 0.45
10.0 0.35 9.85 9.8
1.0 9.7
9.5 error 0.30 9.80 0.40
9.0 time 0.25 9.75 0.5 9.6
8.5 -6 0.20 9.70 0.0 9.5
10 10-5 10-4 10-3 10-2 20 40 60 80 100 120 140 160 180 200 0.0 0.2 0.4 0.6 0.8 1.0
learning rate hidden size input noise standard deviation

Figure 4. Predicted marginal error (blue) and marginal time for different values of the learning rate, hidden size, and the input noise (columns) for the test set
of all three datasets (rows). The shaded area indicates the standard deviation between the tree-predicted marginals and thus the reliability of the predicted mean
performance. Note that each plot is for the vanilla LSTM but curves for all variants that are not significantly worse look very similar.

On the right side of Figure 6 we can see for the same pair
of hyperparameters how their interaction differs from the case
of them being completely independent. This heat map exhibits
less structure, and it may in fact be the case that we would
need more samples to properly analyze the interplay between
them. However, given our observations so far this might not
be worth the effort. In any case, it is clear from the plot on the
left that varying the hidden size does not change the region of
optimal learning rate.
One clear interaction pattern can be observed in the IAM On-
line and JSB datasets between learning rate and input noise.
Here it can be seen that for high learning rates (' 10−4 )
lower input noise (/ .5) is better like also observed in the
marginals from Figure 4. But this trend reverses for lower
learning rates, where higher values of input noise are beneficial.
Though interesting this is not of any practical relevance because
performance is generally bad in that region of low learning
Figure 5. Pie charts showing which fraction of variance of the test set rates. Apart from this, however, it is difficult to discern any
performance can be attributed to each of the hyperparameters. The percentage regularities in the analyzed hyperparameter interactions. We
of variance that is due to interactions between multiple parameters is indicated
as “higher order.” conclude that there is little practical value in attending to the
interplay between hyperparameters. So for practical purposes
hyperparameters can be treated as approximately independent
For example, looking at the pair hidden size and learning and thus optimized separately.
rate on the left side for the TIMIT dataset, we can see that
performance varies strongly along the x-axis (learning rate), VI. C ONCLUSION
first decreasing and then increasing again. This is what we This paper reports the results of a large scale study on
would expect knowing the valley-shape of the learning rate variants of the LSTM architecture. We conclude that the
from Figure 4. Along the y-axis (hidden size) performance most commonly used LSTM architecture (vanilla LSTM)
seems to decrease slightly from top to bottom. Again this is performs reasonably well on various datasets. None of the eight
roughly what we would expect from the hidden size plot in investigated modifications significantly improves performance.
Figure 4. However, certain modifications such as coupling the input and
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9

-2.0 -2.0
TIMIT 57 TIMIT 2.4
54
-1.5 51 -1.5 1.6

momentum
48 0.8

momentum
-1.0 45 -1.0 0.0
42 0.8
-0.5 39
36 -0.5 1.6
33 2.4
-0.0 -0.0
1.0 1.0

1.2 1.2
hidden size

hidden size
1.5 1.5

1.8 1.8

2.0 2.0
0.0 0.0

0.3 0.3
input noise std

input noise std

0.5 0.5

0.7 0.7

1.0 1.0
-5.9 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0 -5.9 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0
learning rate momentum hidden size learning rate momentum hidden size

(a)
-2.0 -2.0
IAMOnline 90 IAMOnline 24
-1.5 80 -1.5 16
70 8
momentum

momentum
-1.0 60 -1.0 0
50
8
40
-0.5 -0.5 16
30
20 24
-0.0 -0.0
1.0 1.0

1.2 1.2
hidden size

hidden size

1.5 1.5

1.7 1.7

2.0 2.0
0.0 0.0

0.3 0.3
input noise std

input noise std

0.5 0.5

0.7 0.7

1.0 1.0
-6.0 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.7 2.0 -6.0 -5.0 -4.0 -3.0 -2.0-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.7 2.0
learning rate momentum hidden size learning rate momentum hidden size

(b)
-2.0 12.0
JSB Chorales 11.6
-2.0
JSB Chorales
0.32
0.24
-1.5 11.2 -1.5 0.16
10.8
momentum

0.08
momentum

-1.0 10.4
10.0 -1.0 0.00
9.6 0.08
-0.5 -0.5 0.16
9.2
8.8 0.24
-0.0 -0.0 0.32
1.0 1.0

1.2 1.2
hidden size

hidden size

1.5 1.5

1.8 1.8

2.0 2.0
0.0 0.0

0.2 0.2
input noise std

input noise std

0.5 0.5

0.7 0.7

1.0 1.0
-6.0 -5.0 -4.0 -3.1 -2.1-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0 -6.0 -5.0 -4.0 -3.1 -2.1-2.0 -1.5 -1.0 -0.5 -0.01.0 1.2 1.5 1.8 2.0
learning rate momentum hidden size learning rate momentum hidden size

(c)

Figure 6. Total marginal predicted performance for all pairs of hyperparameters (left) and the variation only due to their interaction (right). The plot is divided
vertically into three subplots, one for every dataset (TIMIT, IAM Online, and JSB Chorales). The subplots itself are divided horizontally into two parts, each
containing a lower triangular matrix of heat maps. The rows and columns of these matrices represent the different hyperparameters (learning rate, momentum,
hidden size, and input noise) and there is one heat map for every combination. The color encodes the performance as measured by the Classification Error for
TIMIT, Character Error Rate for IAM Online and Negative Log-Likelihood for the JSB Chorales Dataset. For all datasets low (blue) is better than high (red).
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10

forget gates (CIFG) or removing peephole connections (NP) [5] Patrick Doetsch, Michal Kozielski, and Hermann Ney.
simplified LSTMs in our experiments without significantly Fast and robust training of recurrent neural networks for
decreasing performance. These two variants are also attractive offline handwriting recognition. In 14th International
because they reduce the number of parameters and the Conference on Frontiers in Handwriting Recognition,
computational cost of the LSTM. 2014. URL http://people.sabanciuniv.edu/berrin/cs581/
The forget gate and the output activation function are the Papers/icfhr2014/data/4334a279.pdf.
most critical components of the LSTM block. Removing any [6] Alex Graves. Generating sequences with recurrent neural
of them significantly impairs performance. We hypothesize networks. arXiv:1308.0850 [cs], August 2013. URL
that the output activation function is needed to prevent the http://arxiv.org/abs/1308.0850.
unbounded cell state to propagate through the network and [7] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Re-
destabilize learning. This would explain why the LSTM variant current Neural Network Regularization. arXiv:1409.2329
GRU can perform reasonably well without it: its cell state is [cs], September 2014. URL http://arxiv.org/abs/1409.
bounded because of the coupling of input and forget gate. 2329.
As expected, the learning rate is the most crucial hyperpa- [8] Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals,
rameter, followed by the network size. Surprisingly though, the and Wojciech Zaremba. Addressing the Rare Word
use of momentum was found to be unimportant in our setting Problem in Neural Machine Translation. arXiv preprint
of online gradient descent. Gaussian noise on the inputs was arXiv:1410.8206, 2014. URL http://arxiv.org/abs/1410.
found to be moderately helpful for TIMIT, but harmful for the 8206.
other datasets. [9] Hasim Sak, Andrew Senior, and Françoise Beaufays.
The analysis of hyperparameter interactions revealed no Long short-term memory recurrent neural network ar-
apparent structure. Furthermore, even the highest measured chitectures for large scale acoustic modeling. In Pro-
interaction (between learning rate and network size) is quite ceedings of the Annual Conference of International
small. This implies that for practical purposes the hyperparame- Speech Communication Association (INTERSPEECH),
ters can be treated as approximately independent. In particular, 2014. URL http://193.6.4.39/∼czap/letoltes/IS14/IS2014/
the learning rate can be tuned first using a fairly small network, PDF/AUTHOR/IS141304.PDF.
thus saving a lot of experimentation time. [10] Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong.
Neural networks can be tricky to use for many practitioners TTS synthesis with bidirectional LSTM based recurrent
compared to other methods whose properties are already well neural networks. In Proc. Interspeech, 2014.
understood. This has remained a hurdle for newcomers to the [11] Søren Kaae Sønderby and Ole Winther. Protein Secondary
field since a lot of practical choices are based on the intuitions Structure Prediction with Long Short Term Memory
of experts, as well as experiences gained over time. With this Networks. arXiv:1412.7828 [cs, q-bio], December 2014.
study, we have attempted to back some of these intuitions with URL http://arxiv.org/abs/1412.7828. arXiv: 1412.7828.
experimental results. We have also presented new insights, both [12] E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini,
on architecture selection and hyperparameter tuning for LSTM and B. Schuller. Multi-resolution linear prediction based
networks which have emerged as the method of choice for features for audio onset detection with bidirectional LSTM
solving complex sequence learning problems. In future work, neural networks. In 2014 IEEE International Conference
we plan to explore more complex modifications of the LSTM on Acoustics, Speech and Signal Processing (ICASSP),
architecture. pages 2164–2168, May 2014. doi: 10.1109/ICASSP.2014.
6853982.
R EFERENCES [13] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
[1] Sepp Hochreiter. Untersuchungen zu dynamischen neu- Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
ronalen Netzen. Masters Thesis, Technische Universität and Trevor Darrell. Long-term Recurrent Convolu-
München, München, 1991. tional Networks for Visual Recognition and Descrip-
[2] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. tion. arXiv:1411.4389 [cs], November 2014. URL
Gradient flow in recurrent nets: the difficulty of learning http://arxiv.org/abs/1411.4389. arXiv: 1411.4389.
long-term dependencies. In S. C. Kremer and J. F. Kolen, [14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short
editors, A Field Guide to Dynamical Recurrent Neural Term Memory. Technical Report FKI-207-95, Tech-
Networks. IEEE Press, 2001. nische Universität München, München, August 1995.
[3] A Graves, M Liwicki, S Fernandez, R Bertolami, H Bunke, URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
and J Schmidhuber. A Novel Connectionist System 10.1.1.51.3117.
for Improved Unconstrained Handwriting Recognition. [15] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-
IEEE Transactions on Pattern Analysis and Machine Term Memory. Neural Computation, 9(8):1735–1780,
Intelligence, 31(5), 2009. November 1997. ISSN 0899-7667. doi: 10.1162/neco.
[4] Vu Pham, Théodore Bluche, Christopher Kermorvant, and 1997.9.8.1735. URL http://www.bioinf.jku.at/publications/
Jérôme Louradour. Dropout improves Recurrent Neural older/2604.pdf.
Networks for Handwriting Recognition. arXiv:1312.4569 [16] R. L. Anderson. Recent Advances in Finding Best
[cs], November 2013. URL http://arxiv.org/abs/1312. Operating Conditions. Journal of the American Statistical
4569. Association, 48(264):789–798, December 1953. ISSN
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11

0162-1459. doi: 10.2307/2281072. URL http://www.jstor. [30] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever.
org/stable/2281072. An empirical exploration of recurrent network archi-
[17] Francisco J. Solis and Roger J.-B. Wets. Minimization by tectures. In Proceedings of the 32nd International
Random Search Techniques. Mathematics of Operations Conference on Machine Learning (ICML-15), pages 2342–
Research, 6(1):19–30, February 1981. ISSN 0364-765X. 2350, 2015.
doi: 10.1287/moor.6.1.19. URL http://pubsonline.informs. [31] Sebastian Otte, Marcus Liwicki, and Andreas Zell. Dy-
org/doi/abs/10.1287/moor.6.1.19. namic Cortex Memory: Enhancing Recurrent Neural Net-
[18] James Bergstra and Yoshua Bengio. Random search for works for Gradient-Based Sequence Learning. In Artificial
hyper-parameter optimization. The Journal of Machine Neural Networks and Machine Learning – ICANN 2014,
Learning Research, 13(1):281–305, 2012. URL http: number 8681 in Lecture Notes in Computer Science,
//dl.acm.org/citation.cfm?id=2188395. pages 1–8. Springer International Publishing, September
[19] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. 2014. ISBN 978-3-319-11178-0, 978-3-319-11179-7.
An Efficient Approach for Assessing Hyperparameter URL http://link.springer.com/chapter/10.1007/978-3-319-
Importance. pages 754–762, 2014. URL http://jmlr.org/ 11179-7 1.
proceedings/papers/v32/hutter14.html. [32] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,
[20] Alex Graves and Jürgen Schmidhuber. Framewise Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
phoneme classification with bidirectional LSTM and other Learning Phrase Representations using RNN Encoder-
neural network architectures. Neural Networks, 18(5–6): Decoder for Statistical Machine Translation. arXiv
602–610, July 2005. ISSN 0893-6080. doi: 10.1016/j. preprint arXiv:1406.1078, 2014. URL http://arxiv.org/
neunet.2005.06.042. URL http://www.sciencedirect.com/ abs/1406.1078.
science/article/pii/S0893608005001206. [33] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
[21] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. and Yoshua Bengio. Empirical Evaluation of Gated
Learning to forget: Continual prediction with LSTM. Recurrent Neural Networks on Sequence Modeling.
In Artificial Neural Networks, 1999. ICANN 99. Ninth arXiv:1412.3555 [cs], December 2014. URL http://arxiv.
International Conference on (Conf. Publ. No. 470), org/abs/1412.3555.
volume 2, pages 850–855, 1999. [34] David Crystal. Dictionary of linguistics and phonetics,
[22] Felix A. Gers and Jürgen Schmidhuber. Recurrent nets volume 30. John Wiley & Sons, 2011.
that time and count. In Neural Networks, 2000. IJCNN [35] P. Mermelstein. Distance measures for speech recognition:
2000, Proceedings of the IEEE-INNS-ENNS International Psychological and instrumental. In C. H. Chen, editor,
Joint Conference on, volume 3, pages 189–194. IEEE, Pattern Recognition and Artificial Intelligence, pages 374–
2000. ISBN 0769506194. 388. Academic Press, New York, 1976.
[23] AJ Robinson and Frank Fallside. The utility driven [36] Alexander Graves. Supervised Sequence Labelling with
dynamic error propagation network. University of Recurrent Neural Networks. Ph.d., The Technical Univer-
Cambridge Department of Engineering, 1987. sity of Munich, 2008.
[24] R. J. Williams. Complexity of exact gradient computation [37] Andrew K. Halberstadt. Heterogeneous acoustic mea-
algorithms for recurrent neural networks. Technical Report surements and multiple classifiers for speech recognition.
Technical Report NU-CCS-89-27, Boston: Northeastern PhD thesis, Massachusetts Institute of Technology, 1998.
University, College of Computer Science, 1989. [38] Marcus Liwicki and Horst Bunke. IAM-OnDB-an on-line
[25] P. J. Werbos. Generalization of backpropagation with English sentence database acquired from handwritten text
application to a recurrent gas market model. Neural on a whiteboard. In Document Analysis and Recognition,
Networks, 1, 1988. 2005. Proceedings. Eighth International Conference on,
[26] JS Garofolo, LF Lamel, WM Fisher, JG Fiscus, DS Pallett, pages 956–961. IEEE, 2005.
and NL Dahlgren. DARPA TIMIT Acoustic-Phonetic [39] Alex Graves, Santiago Fernández, Faustino Gomez, and
Continuous Speech Corpus CD-ROM. National Institute Jürgen Schmidhuber. Connectionist temporal classifica-
of Standards and Technology, NTIS Order No PB91- tion: labelling unsegmented sequence data with recurrent
505065, 1993. neural networks. In Proceedings of the 23rd international
[27] Felix A. Gers, Juan Antonio Pérez-Ortiz, Douglas Eck, conference on Machine learning, pages 369–376, 2006.
and Jürgen Schmidhuber. DEFK-LSTM. In ESANN 2002, URL http://dl.acm.org/citation.cfm?id=1143891.
Proceedings of the 10th Eurorean Symposium on Artificial [40] Moray Allan and Christopher KI Williams. Harmonising
Neural Networks, 2002. chorales by probabilistic inference. Advances in neural
[28] J Schmidhuber, D Wierstra, M Gagliolo, and F J Gomez. information processing systems, 17:25–32, 2005.
Training Recurrent Networks by EVOLINO. Neural [41] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and
Computation, 19(3):757–779, 2007. Pascal Vincent. Modeling Temporal Dependencies in
[29] Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen High-Dimensional Sequences: Application to Polyphonic
Schmidhuber. Evolving memory cell structures for Music Generation and Transcription. pages 1159–1166,
sequence learning. In Artificial Neural Networks–ICANN 2012. URL http://icml.cc/discuss/2012/590.html.
2009, pages 755–764. Springer, 2009. URL http://link. [42] Ilya Sutskever, James Martens, George Dahl, and Geoffrey
springer.com/chapter/10.1007/978-3-642-04277-5 76. Hinton. On the importance of initialization and momen-
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12

tum in deep learning. In JMLR, pages 1139–1147, 2013. Configuration. In Proc. of LION-5, pages 507–523, 2011.
URL http://jmlr.org/proceedings/papers/v28/sutskever13. [45] Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen
html. Schmidhuber, and Santiago Fernández. Unconstrained
[43] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. on-line handwriting recognition with recurrent neural
Practical Bayesian Optimization of Machine Learning networks. In Advances in Neural Information Processing
Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, Systems, pages 577–584, 2008.
and K. Q. Weinberger, editors, Advances in Neural [46] Giles Hooker. Generalized Functional ANOVA Diagnos-
Information Processing Systems 25, pages 2951–2959. tics for High-Dimensional Functions of Dependent Vari-
Curran Associates, Inc., 2012. ables. Journal of Computational and Graphical Statistics,
[44] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequen- 16(3):709–732, September 2007. ISSN 1061-8600, 1537-
tial Model-Based Optimization for General Algorithm 2715. doi: 10.1198/106186007X237892. URL http://www.
tandfonline.com/doi/abs/10.1198/106186007X237892.
Bas R. Steunebrink is a postdoctoral researcher
at the Swiss AI lab IDSIA. He received his PhD
in 2010 from Utrecht University, the Netherlands.
Klaus Greff received his Diploma in Computer Sci-
Bas’s interests and expertise include Artificial Gen-
ence from the University of Kaiserslautern, Germany
eral Intelligence (AGI), cognitive robotics, machine
in 2011. Currently he is pursuing his PhD at IDSIA
learning, resource-constrained control, and affective
in Lugano, Switzerland, under the supervision of
computing.
Prof. Jürgen Schmidhuber in the field of Machine
Learning. His research interests include Sequence
Learning and Recurrent Neural Networks.

Jürgen Schmidhuber is Professor of Artificial Intel-

Rupesh Srivastava is a PhD student at IDSIA ligence (AI) at USI in Switzerland. He has pioneered
& USI in Switzerland, supervised by Prof. Jürgen self-improving general problem solvers since 1987,
Schmidhuber. He currently works on understand- and Deep Learning Neural Networks (NNs) since
ing and improving neural network architectures. 1991. The recurrent NNs (RNNs) developed by his
In particular, he has worked on understanding the research groups at the Swiss AI Lab IDSIA & USI
beneficial properties of local competition in neural & SUPSI & TU Munich were the first RNNs to
networks, and new architectures which allow gradient- win official international contests. They have helped
based training of extremely deep networks. In the to revolutionize connected handwriting recognition,
past, Rupesh worked on reliability based design speech recognition, machine translation, optical char-
optimization using evolutionary algorithms at the acter recognition, image caption generation, and are
Kanpur Genetic Algorithms Laboratory, supervised now in use at Google, Apple, Microsoft, IBM, Baidu, and many other
by Prof. Kalyanmoy Deb for his Masters thesis. companies. IDSIA’s Deep Learners were also the first to win object detection
and image segmentation contests, and achieved the world’s first superhuman
visual classification results, winning nine international competitions in machine
learning & pattern recognition (more than any other team). They also were
the first to learn control policies directly from high-dimensional sensory input
Jan Koutnı́k received his Ph.D. in computer science using reinforcement learning. His research group also established the field of
from the Czech Technical University in Prague in mathematically rigorous universal AI and optimal universal problem solvers.
2008. He works as machine learning researcher at His formal theory of creativity & curiosity & fun explains art, science, music,
The Swiss AI Lab IDSIA. His research is mainly and humor. He also generalized algorithmic information theory and the many-
focused on artificial neural networks, recurrent neural worlds theory of physics, and introduced the concept of Low-Complexity Art,
networks, evolutionary algorithms and deep-learning the information age’s extreme form of minimal art. Since 2009 he has been
applied to reinforcement learning, control problems, member of the European Academy of Sciences and Arts. He has published
image classification, handwriting and speech recog- 333 peer-reviewed papers, earned seven best paper/best video awards, and is
nition. recipient of the 2013 Helmholtz Award of the International Neural Networks
Society and the 2016 IEEE Neural Networks Pioneer Award.

NLP Using Python
100% (3)
NLP Using Python
12 pages
Leading Dan Lagging Indicators Highlights
No ratings yet
Leading Dan Lagging Indicators Highlights
78 pages
(Cambridge Mathematical Textbooks) Shahriar Shahriari - An Invitation To Combinatorics-Cambridge University Press (2021)
No ratings yet
(Cambridge Mathematical Textbooks) Shahriar Shahriari - An Invitation To Combinatorics-Cambridge University Press (2021)
636 pages
Week 6
No ratings yet
Week 6
60 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Ch2. Basics of Python Programming: Dr. Tulika Assistant Professor Department of Computer Science Miranda House
No ratings yet
Ch2. Basics of Python Programming: Dr. Tulika Assistant Professor Department of Computer Science Miranda House
47 pages
XLSTM: Extended Long Short-Term Memory
No ratings yet
XLSTM: Extended Long Short-Term Memory
55 pages
UNIT-5-Modern Recurrent Neural Networks
No ratings yet
UNIT-5-Modern Recurrent Neural Networks
60 pages
Chapter 12 PartII en
No ratings yet
Chapter 12 PartII en
23 pages
Algebra I
100% (1)
Algebra I
1,115 pages
DL Co-3 PPT 3
No ratings yet
DL Co-3 PPT 3
19 pages
LSTM
No ratings yet
LSTM
19 pages
LSTM Presentation
No ratings yet
LSTM Presentation
23 pages
Longshorttermmemorylstm 231215171600 1feb7b1b
No ratings yet
Longshorttermmemorylstm 231215171600 1feb7b1b
17 pages
Long Short-Term Memory (LSTM) : A Deep Dive Into Sequential Learning
No ratings yet
Long Short-Term Memory (LSTM) : A Deep Dive Into Sequential Learning
17 pages
A Review On The Long Short Term Memory Model
No ratings yet
A Review On The Long Short Term Memory Model
34 pages
DLT Unit-4
No ratings yet
DLT Unit-4
18 pages
A Review On The Long Short Term Memory Model
No ratings yet
A Review On The Long Short Term Memory Model
34 pages
A Review of Recurrent Neural Networks
No ratings yet
A Review of Recurrent Neural Networks
36 pages
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
No ratings yet
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
36 pages
LSTM Networks in Python 1723896317
No ratings yet
LSTM Networks in Python 1723896317
17 pages
Brushless ALTERNATOR Customer Training
100% (1)
Brushless ALTERNATOR Customer Training
61 pages
LSTM 1738024034
No ratings yet
LSTM 1738024034
13 pages
Long Short-Term Memory RNN: Department of Computer Science
No ratings yet
Long Short-Term Memory RNN: Department of Computer Science
16 pages
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network
No ratings yet
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network
43 pages
Long-Short Term Memory
No ratings yet
Long-Short Term Memory
21 pages
Neural Networks
No ratings yet
Neural Networks
22 pages
LSTM
No ratings yet
LSTM
12 pages
LSTM
No ratings yet
LSTM
22 pages
Seminar-For CA-1 of Machine Learning-10200121006
No ratings yet
Seminar-For CA-1 of Machine Learning-10200121006
12 pages
LSTM
No ratings yet
LSTM
10 pages
LSTM
No ratings yet
LSTM
27 pages
Long Short-Term Memory (LSTM) by Mohsin
No ratings yet
Long Short-Term Memory (LSTM) by Mohsin
17 pages
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
No ratings yet
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
14 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
Rat IL - 4 Assay Kit 2014
No ratings yet
Rat IL - 4 Assay Kit 2014
14 pages
Training & Experience and Project Report
100% (1)
Training & Experience and Project Report
16 pages
Long Short-Term Memory (LSTM)
No ratings yet
Long Short-Term Memory (LSTM)
25 pages
Structural Analysis
No ratings yet
Structural Analysis
3 pages
Sherstinsky 2020
No ratings yet
Sherstinsky 2020
28 pages
EPJ LSTM Survey
No ratings yet
EPJ LSTM Survey
14 pages
OmniStudio Build Simple Integration Procedures
No ratings yet
OmniStudio Build Simple Integration Procedures
11 pages
LSTM
No ratings yet
LSTM
12 pages
LSTM
No ratings yet
LSTM
3 pages
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
4 pages
9 - Exp-5 LSTM
No ratings yet
9 - Exp-5 LSTM
10 pages
Unit Iii
No ratings yet
Unit Iii
5 pages
DVE Viscometer
No ratings yet
DVE Viscometer
1 page
What Is LSTM
No ratings yet
What Is LSTM
5 pages
Asp Dac2017 1352 11
No ratings yet
Asp Dac2017 1352 11
6 pages
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
No ratings yet
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
8 pages
Lampiran Diah Ayu BLM Fix
No ratings yet
Lampiran Diah Ayu BLM Fix
22 pages
Hose Reel Calculation
100% (4)
Hose Reel Calculation
2 pages
Global Elevation Data Download Tool - January 15, 2025
No ratings yet
Global Elevation Data Download Tool - January 15, 2025
5 pages
LSTM Detailed Explanation
No ratings yet
LSTM Detailed Explanation
2 pages
Long Short-Term Memory
No ratings yet
Long Short-Term Memory
9 pages
Evolution of Object Approach: C Programming Language
No ratings yet
Evolution of Object Approach: C Programming Language
35 pages
NLP Exp1
No ratings yet
NLP Exp1
5 pages
Long Short-Term Memory Survey Paper
No ratings yet
Long Short-Term Memory Survey Paper
6 pages
Mechanical State Prediction Based On LSTM Neural Netwok
No ratings yet
Mechanical State Prediction Based On LSTM Neural Netwok
6 pages
LSTM Networks Thesis Updated
No ratings yet
LSTM Networks Thesis Updated
5 pages
Quarter 3 Week 5 and 6 Final
No ratings yet
Quarter 3 Week 5 and 6 Final
11 pages
A LSTM Neural Network Applied To Mobile Robots Path Planning
No ratings yet
A LSTM Neural Network Applied To Mobile Robots Path Planning
6 pages
Long Short-Term Memory Networks PDF
No ratings yet
Long Short-Term Memory Networks PDF
22 pages
Addition Multiplication RNN
No ratings yet
Addition Multiplication RNN
7 pages
Mogrifier LSTM
No ratings yet
Mogrifier LSTM
13 pages
14 Slide
No ratings yet
14 Slide
44 pages
Unit 1 & 2
No ratings yet
Unit 1 & 2
26 pages
Comparison of Shielding Methods
No ratings yet
Comparison of Shielding Methods
2 pages
Stock Price Trends Prediction Paper
No ratings yet
Stock Price Trends Prediction Paper
4 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
Modified Long Short-Term Memory and Utilizing in Building Sequential Model
No ratings yet
Modified Long Short-Term Memory and Utilizing in Building Sequential Model
6 pages
Extech Phase Rotation Testers
No ratings yet
Extech Phase Rotation Testers
1 page
Eng CD 2374900 A4-3077475
No ratings yet
Eng CD 2374900 A4-3077475
4 pages
Dayananda Sagar College of Engineering, Department of Computer Science and Engineering
No ratings yet
Dayananda Sagar College of Engineering, Department of Computer Science and Engineering
20 pages
Radiator - Wikipedia
No ratings yet
Radiator - Wikipedia
8 pages
Varela 1979
No ratings yet
Varela 1979
14 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Marantz SR 4500 Brochure
No ratings yet
Marantz SR 4500 Brochure
4 pages
EUROCOD 5 - Design of Timber Structures - General Rules
100% (1)
EUROCOD 5 - Design of Timber Structures - General Rules
72 pages
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
No ratings yet
LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber
1 page
CS5371 Theory of Computation
No ratings yet
CS5371 Theory of Computation
2 pages
Long Short-Term Memory: Machine Learning Data Mining
No ratings yet
Long Short-Term Memory: Machine Learning Data Mining
6 pages
Bidirectional LSTM Networks For Poetry Generation in Hindi
No ratings yet
Bidirectional LSTM Networks For Poetry Generation in Hindi
4 pages
XDM Datasheet
No ratings yet
XDM Datasheet
6 pages
Actuator Valve LPG: Purpose of Actuator Valve With Excess Flow Check Valve
No ratings yet
Actuator Valve LPG: Purpose of Actuator Valve With Excess Flow Check Valve
3 pages
Steam Jet Spindle Operated Thermocompressor
No ratings yet
Steam Jet Spindle Operated Thermocompressor
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

Uploaded by

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

Uploaded by

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

LSTM: A Search Space Odyssey

representative tasks: speech recognition, handwriting recognition,

A. Forward Pass B. Backpropagation Through Time

but not including the recurrent dependencies. The deltas for

B. Peephole Connections The focus of our study is to empirically compare different

joined the two validation sets together, so the final training,

We performed 27 random searches (one for each combination too tightly.

number of parameters ∗105

number of parameters ∗105

number of parameters ∗105

character error rate

number of parameters ∗105

number of parameters ∗105

31 20 1.5 8.6 0.8

11.5 0.50 10.05 0.50

input noise std

input noise std

input noise std

Jürgen Schmidhuber is Professor of Artificial Intel-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.