0% found this document useful (0 votes)

44 views16 pages

Listen, Attend and Spell

The document presents Listen, Attend and Spell (LAS), a neural network model for speech recognition that learns all components jointly without independence assumptions. LAS consists of a listener encoder RNN that accepts speech features and a speller decoder RNN that outputs characters with attention. On a Google speech task, LAS achieves a 14.1% word error rate without language models and 10.3% with rescoring, compared to 8.0% for the state-of-the-art CLDNN-HMM system.

Uploaded by

letthereberock448

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views16 pages

Listen, Attend and Spell

Uploaded by

letthereberock448

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Listen, Attend and Spell

William Chan Navdeep Jaitly, Quoc V. Le, Oriol Vinyals

Carnegie Mellon University Google Brain
williamchan@cmu.edu {ndjaitly,qvl,vinyals}@google.com
arXiv:1508.01211v2 [cs.CL] 20 Aug 2015

Abstract

We present Listen, Attend and Spell (LAS), a neural network that learns to tran-
scribe speech utterances to characters. Unlike traditional DNN-HMM models, this
model learns all the components of a speech recognizer jointly. Our system has
two components: a listener and a speller. The listener is a pyramidal recurrent net-
work encoder that accepts filter bank spectra as inputs. The speller is an attention-
based recurrent network decoder that emits characters as outputs. The network
produces character sequences without making any independence assumptions be-
tween the characters. This is the key improvement of LAS over previous end-to-
end CTC models. On a subset of the Google voice search task, LAS achieves a
word error rate (WER) of 14.1% without a dictionary or a language model, and
10.3% with language model rescoring over the top 32 beams. By comparison, the
state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

1 Introduction
Deep Neural Networks (DNNs) have led to improvements in various components of speech recog-
nizers. They are commonly used in hybrid DNN-HMM speech recognition systems for acoustic
modeling [1, 2, 3, 4, 5, 6]. DNNs have also produced significant gains in pronunciation models that
map words to phoneme sequences [7, 8]. In language modeling, recurrent models have been shown
to improve speech recognition accuracy by rescoring n-best lists [9]. Traditionally these compo-
nents – acoustic, pronunciation and language models – have all been trained separately, each with a
different objective. Recent work in this area attempts to rectify this disjoint training issue by design-
ing models that are trained end-to-end – from speech directly to transcripts [10, 11, 12, 13, 14, 15].
Two main approaches for this are Connectionist Temporal Classification (CTC) [10] and sequence
to sequence models with attention [16]. Both of these approaches have limitations that we try to
address: CTC assumes that the label outputs are conditionally independent of each other; whereas
the sequence to sequence approach has only been applied to phoneme sequences [14, 15], and not
trained end-to-end for speech recognition.
In this paper we introduce Listen, Attend and Spell (LAS), a neural network that improves upon the
previous attempts [12, 14, 15]. The network learns to transcribe an audio sequence signal to a word
sequence, one character at a time. Unlike previous approaches, LAS does not make independence
assumptions in the label sequence and it does not rely on HMMs. LAS is based on the sequence to
sequence learning framework with attention [17, 18, 16, 14, 15]. It consists of an encoder recurrent
neural network (RNN), which is named the listener, and a decoder RNN, which is named the speller.
The listener is a pyramidal RNN that converts low level speech signals into higher level features.
The speller is an RNN that converts these higher level features into output utterances by specifying
a probability distribution over sequences of characters using the attention mechanism [16, 14, 15].
The listener and the speller are trained jointly.
Key to our approach is the fact that we use a pyramidal RNN model for the listener, which reduces
the number of time steps that the attention model has to extract relevant information from. Rare and
out-of-vocabulary (OOV) words are handled automatically, since the model outputs the character

1
sequence, one character at a time. Another advantage of modeling characters as outputs is that the
network is able to generate multiple spelling variants naturally. For example, for the phrase “triple a”
the model produces both “triple a” and “aaa” in the top beams (see section 4.5). A model like CTC
may have trouble producing such diverse transcripts for the same utterance because of conditional
independence assumptions between frames.
In our experiments, we find that these components are necessary for LAS to work well. Without the
attention mechanism, the model overfits the training data significantly, in spite of our large training
set of three million utterances - it memorizes the training transcripts without paying attention to the
acoustics. Without the pyramid structure in the encoder side, our model converges too slowly - even
after a month of training, the error rates were significantly higher than the errors we report here.
Both of these problems arise because the acoustic signals can have hundreds to thousands of frames
which makes it difficult to train the RNNs. Finally, to reduce the overfitting of the speller to the
training transcripts, we use a sampling trick during training [19].
With these improvements, LAS achieves 14.1% WER on a subset of the Google voice search task,
without a dictionary or a language model. When combined with language model rescoring, LAS
achieves 10.3% WER. By comparison, the Google state-of-the-art CLDNN-HMM system achieves
8.0% WER on the same data set [20].

2 Related Work

Even though deep networks have been successfully used in many applications, until recently, they
have mainly been used in classification: mapping a fixed-length vector to an output category [21].
For structured problems, such as mapping one variable-length sequence to another variable-length
sequence, neural networks have to be combined with other sequential models such as Hidden
Markov Models (HMMs) [22] and Conditional Random Fields (CRFs) [23]. A drawback of this
combining approach is that the resulting models cannot be easily trained end-to-end and they make
simplistic assumptions about the probability distribution of the data.
Sequence to sequence learning is a framework that attempts to address the problem of learning
variable-length input and output sequences [17]. It uses an encoder RNN to map the sequential
variable-length input into a fixed-length vector. A decoder RNN then uses this vector to produce
the variable-length output sequence, one token at a time. During training, the model feeds the
groundtruth labels as inputs to the decoder. During inference, the model performs a beam search to
generate suitable candidates for next step predictions.
Sequence to sequence models can be improved significantly by the use of an attention mechanism
that provides the decoder RNN more information when it produces the output tokens [16]. At each
output step, the last hidden state of the decoder RNN is used to generate an attention vector over
the input sequence of the encoder. The attention vector is used to propagate information from the
encoder to the decoder at every time step, instead of just once, as with the original sequence to
sequence model [17]. This attention vector can be thought of as skip connections that allow the
information and the gradients to flow more effectively in an RNN.
The sequence to sequence framework has been used extensively for many applications: machine
translation [24, 25], image captioning [26, 27], parsing [28] and conversational modeling [29]. The
generality of this framework suggests that speech recognition can also be a direct application [14,
15].

3 Model

In this section, we will formally describe LAS which accepts acoustic features as in-
puts and emits English characters as outputs. Let x = (x1 , . . . , xT ) be our input se-
quence of filter bank spectra features, and let y = (hsosi, y1 , . . . , yS , heosi), yi ∈
{a, b, c, · · · , z, 0, · · · , 9, hspacei, hcommai, hperiodi, hapostrophei, hunki}, be the output se-
quence of characters. Here hsosi and heosi are the special start-of-sentence token, and end-of-
sentence tokens, respectively.

2
We want to model each character output yi as a conditional distribution over the previous characters
y<i and the input signal x using the chain rule:
Y
P (y|x) = P (yi |x, y<i ) (1)
i

Our Listen, Attend and Spell (LAS) model consists of two sub-modules: the listener and the speller.
The listener is an acoustic model encoder, whose key operation is Listen. The speller is an attention-
based character decoder, whose key operation is AttendAndSpell. The Listen function transforms
the original signal x into a high level representation h = (h1 , . . . , hU ) with U ≤ T , while the
AttendAndSpell function consumes h and produces a probability distribution over character se-
quences:
h = Listen(x) (2)
P (y|x) = AttendAndSpell(h, y) (3)

Figure 1 visualizes LAS with these two components. We provide more details of these components
in the following sections.

Speller
y2 y3 y4 heosi Grapheme characters yi are
modelled by the
CharacterDistribution

c1 c2
AttentionContext creates
context vector ci from h
and si
h h h

s1 s2

hsosi y2 y3 yS−1

Long input sequence x is encoded with the pyramidal

h = (h1 , . . . , hU ) BLSTM Listen into shorter sequence h

Listener
h1 h2 hU

x1 x2 x3 x4 x5 x6 x7 x8 xT

Figure 1: Listen, Attend and Spell (LAS) model: the listener is a pyramidal BLSTM encoding our input
sequence x into high level features h, the speller is an attention-based decoder generating the y characters
from h.

3
3.1 Listen

The Listen operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [30, 31, 12]
with a pyramid structure. This modification is required to reduce the length U of h, from T , the
length of the input x, because the input speech signals can be hundreds to thousands of frames long.
A direct application of BLSTM for the operation Listen converged slowly and produced results
inferior to those reported here, even after a month of training time. This is presumably because the
operation AttendAndSpell has a hard time extracting the relevant information from a large number
of input time steps.
We circumvent this problem by using a pyramid BLSTM (pBLSTM) similar to the Clockwork RNN
[33]. In each successive stacked pBLSTM layer, we reduce the time resolution by a factor of 2. In a
typical deep BTLM architecture, the output at the i-th time step, from the j-th layer is computed as
follows:

hji = BLSTM(hji−1 , hij−1 ) (4)

In the pBLSTM model, we concatenate the outputs at consecutive steps of each layer before feeding
it to the next layer, i.e.:
h i
hji = pBLSTM(hji−1 , hj−1
2i , hj−1
2i+1 ) (5)

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution
23 = 8 times. This allows the attention model (see next section) to extract the relevant information
from a smaller number of times steps. In addition to reducing the resolution, the deep architecture al-
lows the model to learn nonlinear feature representations of the data. See Figure 1 for a visualization
of the pBLSTM.
The pyramid structure also reduces the computational complexity. In the next section we show that
the attention mechanism over U features has a computational complexity of O(U S). Thus, reducing
U speeds up learning and inference significantly.

3.2 Attend and Spell

We now describe the AttendAndSpell function. The function is computed using an attention-based
LSTM transducer [16, 15]. At every output step, the transducer produces a probability distribution
over the next character conditioned on all the characters seen previously. The distribution for yi is
a function of the decoder state si and context ci . The decoder state si is a function of the previous
state si−1 , the previously emitted character yi−1 and context ci−1 . The context vector ci is produced
by an attention mechanism. Specifically,

ci = AttentionContext(si , h) (6)
si = RNN(si−1 , yi−1 , ci−1 ) (7)
P (yi |x, y<i ) = CharacterDistribution(si , ci ) (8)

where CharacterDistribution is an MLP with softmax outputs over characters, and RNN is a 2
layer LSTM.
At each time step, i, the attention mechanism, AttentionContext generates a context vector, ci
encapsulating the information in the acoustic signal needed to generate the next character. The
attention model is content based - the contents of the decoder state si are matched to the contents
of hu representing time step u of h, to generate an attention vector αi . αi is used to linearly blend
vectors hu to create ci .
Specifically, at each decoder timestep i, the AttentionContext function computes the scalar energy
ei,u for each time step u, using vector hu ∈ h and si . The scalar energy ei,u is converted into a
probability distribution over times steps (or attention) αi using a softmax function. This is used to

4
create the context vector ci by linearly blending the listener features, hu , at different time steps:
ei,u = hφ(si ), ψ(hu )i (9)
exp(ei,u )
αi,u = P (10)
exp(ei,u )
Xu
ci = αi,u hu (11)
u

where φ and ψ are MLP networks. On convergence, the αi distribution is typically very sharp, and
focused on only a few frames of h; ci can be seen as a continuous bag of weighted features of h.
Figure 1 shows LAS architecture.

3.3 Learning

The Listen and AttendAndSpell functions can be trained jointly for end-to-end speech recognition.
The sequence to sequence methods condition the next step prediction on the previous characters [17,
16] and maximizes the log probability:
X
∗
max log P (yi |x, y<i ; θ) (12)
θ
i
∗
where y<i is the groundtruth of the previous characters.
However during inference, the groundtruth is missing and the predictions can suffer because the
model was not trained to be resilient to feeding in bad predictions at some time steps. To ameliorate
this effect, we use a trick that was proposed in [19]. During training, instead of always feeding in the
ground truth transcript for next step prediction, we sometimes sample from our previous character
distribution and use that as the inputs in the next step predictions:
ỹi ∼ CharacterDistribution(si , ci ) (13)
X
max log P (yi |x, ỹ<i ; θ) (14)
θ
i

where ỹi−1 is the character chosen from the ground truth, or sampled from the model with a certain
sampling rate. Unlike [19], we do not use a schedule and simply use a constant sampling rate of
10% right from the start of training.
As the system is a very deep network it may appear that some type of pretraining would be required.
However, in our experiments, we found no need for pretraining. In particular, we attempted to
pretrain the Listen function with context independent or context dependent phonemes generated
from a conventional GMM-HMM system. A softmax network was attached to the output units
hu ∈ h of the listener and used to make multi-frame phoneme state predictions [34] but led to no
improvements. We also attempted to use the phonemes as a joint objective target [35], but found no
improvements.

3.4 Decoding and Rescoring

During inference we want to find the most likely character sequence given the input acoustics:
ŷ = arg max log P (y|x) (15)
y

Decoding is performed with a simple left-to-right beam search algorithm similar to [17]. We main-
tain a set of β partial hypotheses, starting with the start-of-sentence hsosi token. At each timestep,
each partial hypothesis in the beam is expanded with every possible character and only the β most
likely beams are kept. When the heosi token is encountered, it is removed from the beam and added
to the set of complete hypothesis. A dictionary can optionally be added to constrain the search space
to valid words, however we found that this was not necessary since the model learns to spell real
words almost all the time.
We have vast quantities of text data [36], compared to the amount of transcribed speech utterances.
We can use language models trained on text corpora alone similar to conventional speech systems

5
[37]. To do so we can rescore our beams with the language model. We find that our model has a
small bias for shorter utterances so we normalize our probabilities by the number of characters |y|c
in the hypothesis and combine it with a language model probability PLM (y):
log P (y|x)
s(y|x) = + λ log PLM (y) (16)
|y|c
where λ is our language model weight and can be determined by a held-out validation set.

4 Experiments
We used a dataset approximately three million Google voice search utterances (representing 2000
hours of data) for our experiments. Approximately 10 hours of utterances were randomly selected as
a held-out validation set. Data augmentation was performed using a room simulator, adding different
types of noise and reverberations; the noise sources were obtained from YouTube and environmental
recordings of daily events [20]. This increased the amount of audio data by 20 times. 40-dimensional
log-mel filter bank features were computed every 10ms and used as the acoustic inputs to the listener.
A separate set of 22K utterances representing approximately 16 hours of data were used as the test
data. A noisy test data set was also created using the same corruption strategy that was applied to
the training data. All training sets are anonymized and hand-transcribed, and are representative of
Googles speech traffic.
The text was normalized by converting all characters to lower case English alphanumerics (including
digits). The punctuations: space, comma, period and apostrophe were kept, while all other tokens
were converted to the unknown hunki token. As mentioned earlier, all utterances were padded with
the start-of-sentence hsosi and the end-of-sentence heosi tokens.
The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [20].
The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set.
However, we note that the CLDNN uses unidirectional CLDNNs and would certainly benefit even
further from the use of a bidirectional CLDNN architecture.
For the Listen function we used 3 layers of 512 pBLSTM nodes (i.e., 256 nodes per direction) on
top of a BLSTM that operates on the input. This reduced the time resolution by 8 = 23 times. The
Spell function used a two layer LSTM with 512 nodes each. The weights were initialized with a
uniform distribution U(−0.1, 0.1).
Asynchronous Stochastic Gradient Descent (ASGD) was used for training our model [38]. A learn-
ing rate of 0.2 was used with a geometric decay of 0.98 per 3M utterances (i.e., 1/20-th of an epoch).
We used the DistBelief framework [38] with 32 replicas, each with a minibatch of 32 utterances.
In order to further speed up training, the sequences were grouped into buckets based on their frame
length [17].
The model was trained using groundtruth previous characters until results on the validation set
stopped improving. This took approximately two weeks. The model was decoded using beam width
β = 32 and achieved 16.2% WER on the clean test set and 19.0% WER on the noisy test set without
any dictionary or language model. We found that constraining the beam search with a dictionary had
no impact on the WER. Rescoring the top 32 beams with the same n-gram language model that was
used by the CLDNN system using a language model weight of λ = 0.008 improved the results for
the clean and noisy test sets to 12.6% and 14.7% respectively. Note that for convenience, we did not
decode with a language model, but rather only rescored the top 32 beams. It is possible that further
gains could have been achieved by using the language model during decoding.
As mentioned in Section 3.3, there is a mismatch between training and testing. During training
the model is conditioned on the correct previous characters but during testing mistakes made by
the model corrupt future predictions. We trained another model by sampling from our previous
character distribution with a probability of 10% (we did not use a schedule as described in [19]).
This improved our results on the clean and noisy test sets to 14.1% and 16.5% WER respectively
when no language model rescoring was used. With language model rescoring, we achevied 10.3%
and 12.0% WER on the clean and noisy test sets, respectively. Table 1 summarizes these results.
On the clean test set, this model is within 2.5% absolute WER of the state-of-the-art CLDNN-HMM
system, while on the noisy set it is less than 3.0% absolute WER worse. We suspect that convolu-

6
Table 1: WER comparison on the clean and noisy Google voice search task. The CLDNN-HMM system is
the state-of-the-art system, the Listen, Attend and Spell (LAS) models are decoded with a beam size of 32.
Language Model (LM) rescoring was applied to our beams, and a sampling trick was applied to bridge the gap
between training and inference.

Model Clean WER Noisy WER

CLDNN-HMM [20] 8.0 8.9
LAS 16.2 19.0
LAS + LM Rescoring 12.6 14.7
LAS + Sampling 14.1 16.5
LAS + Sampling + LM Rescoring 10.3 12.0

tional filters could lead to improved results, as they have been reported to improve performance by
5% relative WER on clean speech and 7% relative on noisy speech compared to non-convolutional
architectures [20].

4.1 Attention Visualization

Figure 2: Alignments between character outputs and audio signal produced by the Listen, Attend and Spell
(LAS) model for the utterance “how much would a woodchuck chuck”. The content based attention mechanism
was able to identify the start position in the audio sequence for the first character correctly. The alignment
produced is generally monotonic without a need for any location based priors.

The content-based attention mechanism creates an explicit alignment between the characters and
audio signal. We can visualize the attention mechanism by recording the attention distribution on
the acoustic sequence at every character output timestep. Figure 2 visualizes the attention align-
ment between the characters and the filterbanks for the utterance “how much would a woodchuck
chuck”. For this particular utterance, the model learnt a monotonic distribution without any location
priors. The words “woodchuck” and “chuck” have acoustic similarities, the attention mechanism
was slightly confused when emitting “woodchuck” with a dilution in the distribution. The attention
model was also able to identify the start and end of the utterance properly.

7
In the following sections, we report results of control experiments that were conducted to understand
the effects of beam widths, utterance lengths and word frequency on the WER of our model.

4.2 Effects of Beam Width

We investigate the correlation between the performance of the model and the width of beam search,
with and without the language model rescoring. Figure 3 shows the effect of the decode beam width,
β, on the WER for the clean test set. We see consistent WER improvements by increasing the beam
width up to 16, after which we observe no significant benefits. At a beam width of 32, the WER
is 14.1% and 10.3% after language model rescoring. Rescoring the top 32 beams with an oracle
produces a WER of 4.3% on the clean test set and 5.5% on the noisy test set.

Beam Width vs. WER

20
WER
WER LM
WER Oracle
15
WER

0
12 4 8 16 32
Beam Width
Figure 3: The effect of the decode beam width on WER for the clean Google voice search task. The reported
WERs are without a dictionary or language model, with language model rescoring and the oracle WER for
different beam widths. The figure shows that good results can be obtained even with a relatively small beam
size.

4.3 Effects of Utterance Length

We measure the performance of our model as a function of the number of words in the utterance. We
expect the model to do poorly on longer utterances due to limited number of long training utterances
in our distribution. Hence it is not surprising that longer utterances have a larger error rate. The
deletions dominate the error for long utterances, suggesting we may be missing out on words. It is
surprising that short utterances (e.g., 2 words or less) perform quite poorly. Here, the substitutions
and insertions are the main sources of errors, suggesting the model may split words apart.
Figure 4 also suggests that our model struggles to generalize to long utterances when trained on a
distribution of shorter utterances. It is possible location-based priors may help in these situations as
reported by [15].

4.4 Word Frequency

We study the performance of our model on rare words. We use the recall metric to indicate whether
a word appears in the utterance regardless of position (higher is better). Figure 5 reports the recall
of each word in the test distribution as a function of the word frequency in the training distribution.
Rare words have higher variance and lower recall while more frequent words typically have higher

8
Utterance Length vs. Error
90
Data Distribution
80 Insertion
Deletion
Substitution
70 WER
WER LM
60 WER Oracle
Percentage

50
40
30
20
10
0
5 10 15 20 25+
Number of Words in Utterance
Figure 4: The correlation between error rates (insertion, deletion, substitution and WER) and the number of
words in an utterance. The WER is reported without a dictionary or language model, with language model
rescoring and the oracle WER for the clean Google voice search task. The data distribution with respect to the
number of words in an utterance is overlaid in the figure. LAS performs poorly with short utterances despite
an abundance of data. LAS also fails to generalize well on longer utterances when trained on a distribution
of shorter utterances. Insertions and substitutions are the main sources of errors for short utterances, while
deletions dominate the error for long utterances.

recall. The word “and” occurs 85k times in the training set, however it has a recall of only 80% even
after language model rescoring. The word “and” is frequently mis-transcribed as “in” (which has
95% recall). This suggests improvements are needed in the language model. By contrast, the word
“walkerville” occurs just once in the training set but it has a recall of 100%. This suggests that the
recall for a word depends both on its frequency in the training set and its acoustic uniqueness.

4.5 Interesting Decoding Examples

In this section, we show the outputs of the model on several utterances to demonstrate the capabilities
of LAS. All the results in this section are decoded without a dictionary or a language model.
During our experiments, we observed that LAS can learn multiple spelling variants given the same
acoustics. Table 2 shows top beams for the utterance that includes “triple a”. As can be seen,
the model produces both “triple a” and “aaa” within the top four beams. The decoder is able to
generate such varied parses, because the next step prediction model makes no assumptions on the
probability distribution by using the chain rule decomposition. It would be difficult to produce such
differing transcripts using CTC due to the conditional independence assumptions, where p(yi |x)
is conditionally independent of p(yi+1 |x). Conventional DNN-HMM systems would require both
spellings to be in the pronunciation dictionary to generate both spelling permutations.
It can also be seen that the model produced “xxx” even though acoustically “x” is very different
from “a” - this is presumably because the language model overpowers the acoustic signal in this
case. In the training corpus “xxx” is a very common phrase and we suspect the language model

9
Word Frequency vs. Recall
100
90
80
Word Recall Percentage 70
60
50
40
30
20
10
0
100 101 102 103 104 105 106
Word Frequency
Figure 5: The correlation between word frequency in the training distribution and recall in the test distribution.
In general, rare words report worse recall compared to more frequent words.

Table 2: Example 1: “triple a” vs. “aaa” spelling variants.

Beam Text Log Probability WER

Truth call aaa roadside assistance - -
1 call aaa roadside assistance -0.5740 0.00
2 call triple a roadside assistance -1.5399 50.00
3 call trip way roadside assistance -3.5012 50.00
4 call xxx roadside assistance -4.4375 25.00

implicit in the speller learns to associate “triple” with “xxx”. We note that “triple a” occurs 4 times
in the training distribution and “aaa” (when pronounced “triple a” rather than “a”-“a”-“a”) occurs
only once in the training distribution.
We are also surprised that the model is capable of handling utterances with repeated words despite
the fact that it uses content-based attention. Table 3 shows an example of an utterance with a repeated
word. Since LAS implements content-based attention, it is expected it to “lose its attention” during
the decoding steps and produce a word more or less times than the number of times the word was
spoken. As can be seen from this example, even though “seven” is repeated three times, the model
successfully outputs “seven” three times. This hints that location-based priors (e.g., location based
attention or location based regularization) may not be needed for repeated contents.

Table 3: Example 2: Repeated “seven”s.

Beam Text Log Probability WER

Truth eight nine four minus seven seven seven - -
1 eight nine four minus seven seven seven -0.2145 0.00
2 eight nine four nine seven seven seven -1.9071 14.29
3 eight nine four minus seven seventy seven -4.7316 14.29
4 eight nine four nine s seven seven seven -5.1252 28.57

10
5 Conclusions

We have presented Listen, Attend and Spell (LAS), an attention-based neural network that can di-
rectly transcribe acoustic signals to characters. LAS is based on the sequence to sequence framework
with a pyramid structure in the encoder that reduces the number of timesteps that the decoder has
to attend to. LAS is trained end-to-end and has two main components. The first component, the
listener, is a pyramidal acoustic RNN encoder that transforms the input sequence into a high level
feature representation. The second component, the speller, is an RNN decoder that attends to the
high level features and spells out the transcript one character at a time. Our system does not use
the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. We bypass the
conditional independence assumptions of CTC, and show how we can learn an implicit language
model that can generate multiple spelling variants given the same acoustics. To further improve
the results, we used samples from the softmax classifier in the decoder as inputs to the next step
prediction during training. Finally, we showed how a language model trained on additional text can
be used to rerank our top hypotheses.

Acknowledgements

We thank Tara Sainath, Babak Damavandi for helping us with the data, language models and for
helpful comments. We also thank Andrew Dai, Ashish Agarwal, Samy Bengio, Eugene Brevdo,
Greg Corrado, Andrew Dai, Jeff Dean, Rajat Monga, Christopher Olah, Mike Schuster, Noam
Shazeer, Ilya Sutskever, Vincent Vanhoucke and the Google Brain team for helpful comments, sug-
gestions and technical assistance.

References
[1] Nathaniel Morgan and Herve Bourlard. Continuous Speech Recognition using Multilayer Per-
ceptrons with Hidden Markov Models. In IEEE International Conference on Acoustics, Speech
and Signal Processing, 1990.
[2] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Deep belief networks for
phone recognition. In Neural Information Processing Systems: Workshop on Deep Learning
for Speech Recognition and Related Applications, 2009.
[3] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Large vocabulary continuous speech
recognition with context-dependent dbn-hmms. In IEEE International Conference on Acous-
tics, Speech and Signal Processing, 2011.
[4] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey Hinton. Acoustic modeling us-
ing deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing,
20(1):14–22, 2012.
[5] Navdeep Jaitly, Patrick Nguyen, Andrew W. Senior, and Vincent Vanhoucke. Application
of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition. In INTER-
SPEECH, 2012.
[6] Tara Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep
Convolutional Neural Networks for LVCSR. In IEEE International Conference on Acoustics,
Speech and Signal Processing, 2013.
[7] Kanishka Rao, Fuchun Peng, Hasim Sak, and Francoise Beaufays. Grapheme-to-phoneme
conversion using long short-term memory recurrent neural networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing, 2015.
[8] Kaisheng Yao and Geoffrey Zweig. Sequence-to-Sequence Neural Net Models for Grapheme-
to-Phoneme Conversion. 2015.
[9] Tomas Mikolov, Karafiat Martin, Burget Luka, Eernocky Jan, and Khudanpur Sanjeev. Recur-
rent neural network based language model. In INTERSPEECH, 2010.
[10] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmiduber. Connectionist
Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Net-
works. In International Conference on Machine Learning, 2006.

11
[11] Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Con-
ference on Machine Learning: Representation Learning Workshop, 2012.
[12] Alex Graves and Navdeep Jaitly. Towards End-to-End Speech Recognition with Recurrent
Neural Networks. In International Conference on Machine Learning, 2014.
[13] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Ng. Deep Speech:
Scaling up end-to-end speech recognition. In http://arxiv.org/abs/1412.5567, 2014.
[14] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end Con-
tinuous Speech Recognition using Attention-based Recurrent NN: First Results. In Neural In-
formation Processing Systems: Workshop Deep Learning and Representation Learning Work-
shop, 2014.
[15] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-Based Models for Speech Recognition. In http://arxiv.org/abs/1506.07503, 2015.
[16] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by
Jointly Learning to Align and Translate. In International Conference on Learning Representa-
tions, 2015.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to Sequence Learning with Neural
Networks. In Neural Information Processing Systems, 2014.
[18] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-
Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural
Language Processing, 2014.
[19] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled Sampling for Se-
quence Prediction with Recurrent Neural Networks. In http://arxiv.org/abs/1506.03099, 2015.
[20] Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional, Long Short-
Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on
Acoustics, Speech and Signal Processing, 2015.
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks. In Neural Information Processing Systems, 2012.
[22] Leonard E. Baum and Ted Petrie. Statistical Inference for Probabilistic Functions of Finite
State Markov Chains. The Annals of Mathematical Statistics, 37:1554–1563, 1966.
[23] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Proba-
bilistic Models for Segmenting and Labeling Sequence Data. In International Conference on
Machine Learning, 2001.
[24] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Ad-
dressing the Rare Word Problem in Neural Machine Translation. In Association for Computa-
tional Linguistics, 2015.
[25] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On Using Very
Large Target Vocabulary for Neural Machine Translation. In Association for Computational
Linguistics, 2015.
[26] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural
Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[27] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,
Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation
with Visual Attention. In International Conference on Machine Learning, 2015.
[28] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton.
Grammar as a foreign language. In http://arxiv.org/abs/1412.7449, 2014.
[29] Oriol Vinyals and Quoc V. Le. A Neural Conversational Model. In International Conference
on Machine Learning: Deep Learning Workshop, 2015.
[30] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Computation,
9(8):1735–1780, November 1997.

12
[31] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid Speech Recognition with
Bidirectional LSTM. In Automatic Speech Recognition and Understanding Workshop, 2013.
[32] Salah Hihi and Yoshua Bengio. Hierarchical Recurrent Neural Networks for Long-Term De-
pendencies. In Neural Information Processing Systems, 1996.
[33] Jan Koutnik, Klaus Greff, Faustino Gomez, and Jurgen Schmidhuber. A Clockwork RNN. In
International Conference on Machine Learning, 2014.
[34] Navdeep Jaitly, Vincent Vanhoucke, and Geoffrey Hinton. Autoregressive product of multi-
frame predictions can improve the accuracy of hybrid models. In INTERSPEECH, 2014.
[35] Hasim Sak, Andrew Senior, Kanishka Rao, and Francoise Beaufays. Fast and Accurate Recur-
rent Neural Network Acoustic Models for Speech Recognition. In INTERSPEECH, 2015.
[36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Repre-
sentations of Words and Phrases and their Compositionality. In Neural Information Processing
Systems, 2013.
[37] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra
Goel, Mirko Hannenmann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg
Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit. In Automatic Speech
Recognition and Understanding Workshop, 2011.
[38] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.
Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large
Scale Distributed Deep Networks. In Neural Information Processing Systems, 2012.

13
A Alignment Examples
In this section, we give additional visualization examples of our model and the attention distribution.

Figure 6: The spelling variants of “aaa” vs “triple a” produces different attention distributions, both spelling
variants appear in our top beams. The ground truth is: “aaa emergency roadside service”.

14
Figure 7: The spelling variants of “st” vs “saint” produces different attention distributions, both spelling
variants appear in our top beams. The ground truth is: “st mary’s animal clinic”.

15
Figure 8: The phrase “cancel” is repeated three times. Note the parallel diagonals, the content attention
mechanism gets slightly confused however the model still emits the correct hypothesis. The ground truth is:
“cancel cancel cancel”.

05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
CM Slides On Attention
No ratings yet
CM Slides On Attention
162 pages
Recent Advances in End-to-End Automatic Speech Recognition
No ratings yet
Recent Advances in End-to-End Automatic Speech Recognition
64 pages
cl8 Encdec
No ratings yet
cl8 Encdec
51 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Unit5 3
No ratings yet
Unit5 3
48 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Thesis
No ratings yet
Thesis
37 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Script Free Fire 1411 v3 1 Lua
No ratings yet
Script Free Fire 1411 v3 1 Lua
9 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Sequence Transduction With Recurrent Neural Networks: Alex Graves
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Alex Graves
9 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
End-To-End Speech Recognition Models
No ratings yet
End-To-End Speech Recognition Models
94 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
The Future of
No ratings yet
The Future of
25 pages
Scaling Up Online Speech Recognition Using ConvNets
No ratings yet
Scaling Up Online Speech Recognition Using ConvNets
8 pages
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
No ratings yet
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
36 pages
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
No ratings yet
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
12 pages
Sequence To Sequence
No ratings yet
Sequence To Sequence
4 pages
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
5 pages
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
No ratings yet
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
13 pages
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
No ratings yet
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
10 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
No ratings yet
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
26 pages
Sequence-To-Sequence Automatic Speech Recognition With Word Embedding Regularization and Fused Decoding
No ratings yet
Sequence-To-Sequence Automatic Speech Recognition With Word Embedding Regularization and Fused Decoding
5 pages
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
No ratings yet
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
9 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
2015 Attention Based Models For Speech Recognition Paper
No ratings yet
2015 Attention Based Models For Speech Recognition Paper
9 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
Domain Adap Asr 2
No ratings yet
Domain Adap Asr 2
5 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
No ratings yet
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
16 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-To-Sequence Autoencoder
No ratings yet
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-To-Sequence Autoencoder
5 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Sequence-to-Sequence Neural Net Models For Grapheme-to-Phoneme Conversion
No ratings yet
Sequence-to-Sequence Neural Net Models For Grapheme-to-Phoneme Conversion
5 pages
ANSWERS Passive Mixed Tenses
100% (1)
ANSWERS Passive Mixed Tenses
3 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
6b Gerunds and Infinitives
No ratings yet
6b Gerunds and Infinitives
10 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
3 pages
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
No ratings yet
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
5 pages
A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
No ratings yet
A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
5 pages
1707 06519
No ratings yet
1707 06519
8 pages
A Neural Attention Model For Speech Command Recognition: A B C C
No ratings yet
A Neural Attention Model For Speech Command Recognition: A B C C
18 pages
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
11 pages
Cascaded Encoders For Unifying Streaming and Non-Streaming Asr
No ratings yet
Cascaded Encoders For Unifying Streaming and Non-Streaming Asr
5 pages
Roman Numerals
No ratings yet
Roman Numerals
4 pages
GNS 111 New by Plato
No ratings yet
GNS 111 New by Plato
25 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Ancient Roman Names
No ratings yet
Ancient Roman Names
22 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Advancing RNN Transducer Technology For Speech Recognition
No ratings yet
Advancing RNN Transducer Technology For Speech Recognition
5 pages
Mark
No ratings yet
Mark
3 pages
I. Listen To The Song and Answer The Following Questions.: English Worksheet
100% (1)
I. Listen To The Song and Answer The Following Questions.: English Worksheet
3 pages
A.P History GRP 2 Bits
No ratings yet
A.P History GRP 2 Bits
3 pages
Pidgins and Creoles
No ratings yet
Pidgins and Creoles
25 pages
Penilaian Akhir Semester B. Ing Kelas 7
No ratings yet
Penilaian Akhir Semester B. Ing Kelas 7
8 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Conversational Solfege - Feierabend Association For Music Education - A Tuneful, Beatful, Artful Lea
100% (1)
Conversational Solfege - Feierabend Association For Music Education - A Tuneful, Beatful, Artful Lea
15 pages
Rtyrty Rty Rty Rty y
No ratings yet
Rtyrty Rty Rty Rty y
67 pages
(Lecture Notes) Wai Kiu Chan - Arithmetic of Quaternion Algebra 2012 (2012)
No ratings yet
(Lecture Notes) Wai Kiu Chan - Arithmetic of Quaternion Algebra 2012 (2012)
43 pages
Reading Writing ELL
No ratings yet
Reading Writing ELL
11 pages
Railway Per 1000
No ratings yet
Railway Per 1000
1 page
Vampires
100% (1)
Vampires
21 pages
Allwright - Interaction and Negotiation in The Language Classroom - Their Role in Learner Development PDF
No ratings yet
Allwright - Interaction and Negotiation in The Language Classroom - Their Role in Learner Development PDF
10 pages
Week 6. Parag. Writing 6
No ratings yet
Week 6. Parag. Writing 6
6 pages
Werstgrc Ctdrdrtwtecaewcrrct5cqw4rcw35
No ratings yet
Werstgrc Ctdrdrtwtecaewcrrct5cqw4rcw35
16 pages
Pattern Division Multiple Access (PDMA) - A Novel Non-Orthogonal Multiple Access For 5G Radio Networks
No ratings yet
Pattern Division Multiple Access (PDMA) - A Novel Non-Orthogonal Multiple Access For 5G Radio Networks
12 pages
Wer Rtrte Ert Ert Wery Rtyhu Ty Uj Ry Ur e Ur RT Ru Re6t
No ratings yet
Wer Rtrte Ert Ert Wery Rtyhu Ty Uj Ry Ur e Ur RT Ru Re6t
9 pages
كتاب اليقين
No ratings yet
كتاب اليقين
6 pages
Ell T.A. - On Systems of Linear Quaternion Functions
No ratings yet
Ell T.A. - On Systems of Linear Quaternion Functions
11 pages
AI Sample Syllabus
100% (1)
AI Sample Syllabus
3 pages
Pattern Division Multiple Access: A New Multiple Access Technology For 5G
No ratings yet
Pattern Division Multiple Access: A New Multiple Access Technology For 5G
7 pages
EF4e UpperInt Customizable PCM Communicative 6B
No ratings yet
EF4e UpperInt Customizable PCM Communicative 6B
1 page
Adverbs of Time
No ratings yet
Adverbs of Time
4 pages
Optimum Design of A Novel N-Way Planar Power Divider/combiner With Impedance Matching
No ratings yet
Optimum Design of A Novel N-Way Planar Power Divider/combiner With Impedance Matching
8 pages
A Swarm Intelligence Based (SIB) Method For Optimization in Designs of Experiments
No ratings yet
A Swarm Intelligence Based (SIB) Method For Optimization in Designs of Experiments
9 pages
Mara (Demon)
No ratings yet
Mara (Demon)
6 pages
W5 ListeningP2 Statements
No ratings yet
W5 ListeningP2 Statements
2 pages
REVIEWERS For EXAM English q2
No ratings yet
REVIEWERS For EXAM English q2
5 pages
Department of Education: Simplified Melc-Based Budget of Lessons in English Grade 8
No ratings yet
Department of Education: Simplified Melc-Based Budget of Lessons in English Grade 8
2 pages
Optimum Design of A Modified 3-Way Bagley Rectangular Power Divider
No ratings yet
Optimum Design of A Modified 3-Way Bagley Rectangular Power Divider
4 pages
Guy Peer Software Engineer CV
No ratings yet
Guy Peer Software Engineer CV
2 pages
Pattern Division Multiple Access (PDMA) For Cellular Future Radio Access
No ratings yet
Pattern Division Multiple Access (PDMA) For Cellular Future Radio Access
6 pages
U I L S, PU, C: Niversity Nstitute of Egal Tudies Handigarh
No ratings yet
U I L S, PU, C: Niversity Nstitute of Egal Tudies Handigarh
26 pages
Future Perfect Simple and Future Continuous
No ratings yet
Future Perfect Simple and Future Continuous
1 page
JBoss Enterprise BRMS Platform-5-5.3.1 Release Notes-En-US
No ratings yet
JBoss Enterprise BRMS Platform-5-5.3.1 Release Notes-En-US
24 pages
CAP 04 - Studying Context - A Comparison of Activity Theory, Situated - Action Models, and Distributed Cognition
No ratings yet
CAP 04 - Studying Context - A Comparison of Activity Theory, Situated - Action Models, and Distributed Cognition
18 pages
Tourism Video Collab
No ratings yet
Tourism Video Collab
2 pages
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Listen, Attend and Spell

Uploaded by

Listen, Attend and Spell

Uploaded by

Listen, Attend and Spell

William Chan Navdeep Jaitly, Quoc V. Le, Oriol Vinyals

Long input sequence x is encoded with the pyramidal

hji = BLSTM(hji−1 , hij−1 ) (4)

3.2 Attend and Spell

3.4 Decoding and Rescoring

Model Clean WER Noisy WER

4.1 Attention Visualization

4.2 Effects of Beam Width

Beam Width vs. WER

4.3 Effects of Utterance Length

4.4 Word Frequency

4.5 Interesting Decoding Examples

Table 2: Example 1: “triple a” vs. “aaa” spelling variants.

Beam Text Log Probability WER

Table 3: Example 2: Repeated “seven”s.

Beam Text Log Probability WER

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.