0% found this document useful (0 votes)
36 views1 page

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

The document analyzes eight variants of long short-term memory (LSTM) neural networks on three tasks: speech recognition, handwriting recognition, and polyphonic music modeling. It finds that none of the variants significantly outperform the standard LSTM architecture and that the forget gate and output activation function are most critical. Guidelines for efficiently adjusting hyperparameters are also derived.

Uploaded by

xing007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views1 page

LSTM: A Search Space Odyssey: Klaus Greff, Rupesh K. Srivastava, Jan Koutn Ik, Bas R. Steunebrink, J Urgen Schmidhuber

The document analyzes eight variants of long short-term memory (LSTM) neural networks on three tasks: speech recognition, handwriting recognition, and polyphonic music modeling. It finds that none of the variants significantly outperform the standard LSTM architecture and that the forget gate and output activation function are most critical. Guidelines for efficiently adjusting hyperparameters are also derived.

Uploaded by

xing007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

LSTM: A Search Space Odyssey


Klaus Greff, Rupesh K. Srivastava, Jan Koutnı́k, Bas R. Steunebrink, Jürgen Schmidhuber

Abstract—Several variants of the Long Short-Term Memory synthesis [10], protein secondary structure prediction [11],
(LSTM) architecture for recurrent neural networks have been analysis of audio [12], and video data [13] among others.
proposed since its inception in 1995. In recent years, these
networks have become the state-of-the-art models for a variety The central idea behind the LSTM architecture is a memory
of machine learning problems. This has led to a renewed interest cell which can maintain its state over time, and non-linear
in understanding the role and utility of various computational gating units which regulate the information flow into and out of
components of typical LSTM variants. In this paper, we present the cell. Most modern studies incorporate many improvements
the first large-scale analysis of eight LSTM variants on three
that have been made to the LSTM architecture since its
arXiv:1503.04069v2 [cs.NE] 4 Oct 2017

representative tasks: speech recognition, handwriting recognition,


and polyphonic music modeling. The hyperparameters of all original formulation [14, 15]. However, LSTMs are now
LSTM variants for each task were optimized separately using applied to many learning problems which differ significantly
random search, and their importance was assessed using the in scale and nature from the problems that these improvements
powerful fANOVA framework. In total, we summarize the results were initially tested on. A systematic study of the utility of
of 5400 experimental runs (≈ 15 years of CPU time), which
makes our study the largest of its kind on LSTM networks. various computational components which comprise LSTMs
Our results show that none of the variants can improve upon (see Figure 1) was missing. This paper fills that gap and
the standard LSTM architecture significantly, and demonstrate systematically addresses the open question of improving the
the forget gate and the output activation function to be its LSTM architecture.
most critical components. We further observe that the studied
hyperparameters are virtually independent and derive guidelines We evaluate the most popular LSTM architecture (vanilla
for their efficient adjustment. LSTM; Section II) and eight different variants thereof on
Index Terms—Recurrent neural networks, Long Short-Term
three benchmark problems: acoustic modeling, handwriting
Memory, LSTM, sequence learning, random search, fANOVA. recognition, and polyphonic music modeling. Each variant
differs from the vanilla LSTM by a single change. This
allows us to isolate the effect of each of these changes
I. I NTRODUCTION on the performance of the architecture. Random search [16–
18] is used to find the best-performing hyperparameters for
Recurrent neural networks with Long Short-Term Memory each variant on each problem, enabling a reliable comparison
(which we will concisely refer to as LSTMs) have emerged as of the performance of different variants. We also provide
an effective and scalable model for several learning problems insights gained about hyperparameters and their interaction
related to sequential data. Earlier methods for attacking these using fANOVA [19].
problems have either been tailored towards a specific problem
or did not scale to long time dependencies. LSTMs on the
other hand are both general and effective at capturing long-
term temporal dependencies. They do not suffer from the
optimization hurdles that plague simple recurrent networks II. VANILLA LSTM
(SRNs) [1, 2] and have been used to advance the state-of-
the-art for many difficult problems. This includes handwriting
recognition [3–5] and generation [6], language modeling [7] The LSTM setup most commonly used in literature was
and translation [8], acoustic modeling of speech [9], speech originally described by Graves and Schmidhuber [20]. We refer
to it as vanilla LSTM and use it as a reference for comparison
2016
c IEEE. Personal use of this material is permitted. Permission from of all the variants. The vanilla LSTM incorporates changes
IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
by Gers et al. [21] and Gers and Schmidhuber [22] into the
purposes, creating new collective works, for resale or redistribution to servers original LSTM [15] and uses full gradient training. Section III
or lists, or reuse of any copyrighted component of this work in other works. provides descriptions of these major LSTM changes.
Manuscript received May 15, 2015; revised March 17, 2016; accepted June 9,
2016. Date of publication July 8, 2016; date of current version June 20, 2016. A schematic of the vanilla LSTM block can be seen in
DOI: 10.1109/TNNLS.2016.2582924 Figure 1. It features three gates (input, forget, output), block
This research was supported by the Swiss National Science Foundation grants
“Theory and Practice of Reinforcement Learning 2” (#138219) and “Advanced input, a single cell (the Constant Error Carousel), an output
Reinforcement Learning” (#156682), and by EU projects “NASCENCE” activation function, and peephole connections1 . The output of
(FP7-ICT-317662), “NeuralDynamics” (FP7-ICT-270247) and WAY (FP7- the block is recurrently connected back to the block input and
ICT-288551).
K. Greff, R. K. Srivastava, J. Koutı́k, B. R. Steunebrink and J. Schmidhuber all of the gates.
are with the Istituto Dalle Molle di studi sull’Intelligenza Artificiale (IDSIA),
the Scuola universitaria professionale della Svizzera italiana (SUPSI), and the
Università della Svizzera italiana (USI).
Author e-mails addresses: {klaus, rupesh, hkou, bas, juergen}@idsia.ch 1 Some studies omit peephole connections, described in Section III-B.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy