0% found this document useful (0 votes)
5 views5 pages

2102.00291 Bert

This document presents a method for automatic speech recognition (ASR) by fine-tuning the BERT language model, proposing a model called BERT-ASR that treats ASR as a classification problem. The authors demonstrate that even a simple acoustic model can be effective when combined with BERT, and they outline the training process involving pretraining BERT on text data and fine-tuning it with both text and speech data. The effectiveness of the proposed method is evaluated using the AISHELL dataset, showing promising results in speech transcription.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

2102.00291 Bert

This document presents a method for automatic speech recognition (ASR) by fine-tuning the BERT language model, proposing a model called BERT-ASR that treats ASR as a classification problem. The authors demonstrate that even a simple acoustic model can be effective when combined with BERT, and they outline the training process involving pretraining BERT on text data and fine-tuning it with both text and speech data. The effectiveness of the proposed method is evaluated using the AISHELL dataset, showing promising results in speech transcription.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT

Wen-Chin Huang1,2 , Chia-Hua Wu2 , Shang-Bao Luo2 , Kuan-Yu Chen3 , Hsin-Min Wang2 , Tomoki Toda1
1
Nagoya University, Japan 2 Academia Sinica, Taiwan
3
National Taiwan University of Science and Technology, Taiwan

ABSTRACT call BERT-ASR, formulates ASR as a classification problem, where


the objective is to correctly classify the next word given the acoustic
arXiv:2102.00291v1 [cs.SD] 30 Jan 2021

We propose a simple method for automatic speech recognition


speech signals and the history words. We show that even an AM that
(ASR) by fine-tuning BERT, which is a language model (LM)
simply averages the frame-based acoustic features corresponding to
trained on large-scale unlabeled text data and can generate rich
a word can be applied to BERT-ASR to correctly transcribe speech
contextual representations. Our assumption is that given a history
to a certain extent, and the performance can be further boosted by
context sequence, a powerful LM can narrow the range of possi-
using a more complex model.
ble choices and the speech signal can be used as a simple clue.
Hence, comparing to conventional ASR systems that train a pow-
erful acoustic model (AM) from scratch, we believe that speech 2. BERT
recognition is possible by simply fine-tuning a BERT model. As an
initial study, we demonstrate the effectiveness of the proposed idea BERT [4], which stands for Bidirectional Encoder Representations
on the AISHELL dataset and show that stacking a very simple AM from Transformers, is a pretraining method from a LM objective
on top of BERT can yield reasonable performance. with a Transformer encoder architecture [8]. The full power of
BERT can be released only through a pretraining–fine-tuning frame-
Index Terms— speech recognition, BERT, language model work, where the model is first trained on a large-scale unlabeled text
dataset, and then all/some parameters are fine-tuned on a labeled
1. INTRODUCTION dataset for the downstream task.
The original usage of BERT mainly focused on NLP tasks, rang-
Conventional automatic speech recognition (ASR) systems consist ing from token-level and sequence-level classification tasks, includ-
of multiple separately optimized modules, including an acoustic ing question answering [9, 10], document summarization [11, 12],
model (AM), a language model (LM) and a lexicon. In recent years, information retrieval [13, 14], machine translation [15, 16], just to
end-to-end (E2E) ASR models have attracted much attention, due name a few. There has also been attempts to combine BERT in
to the believe that jointly optimizing one single model is benefi- ASR, including rescoring [17, 18] or generating soft labels for train-
cial to avoiding not only task-specific engineering but also error ing [19]. In this section, we review the fundamentals of BERT.
propagation. Current main-stream E2E approaches include connec-
tionist temporal classification (CTC) [1], neural transducers [2], and 2.1. Model architecture and input representations
attention-based sequence-to-sequence (seq2seq) learning [3].
LMs play an essential role in ASR. Even the E2E models that BERT adopts a multi-layer Transformer [8] encoder, where each
implicitly integrate LM into optimization can benefit from LM fu- layer contains a multi-head self-attention sublayer followed by a
sion. It is therefore worthwhile thinking: how can we make use of positionwise fully connected feedforward network. Each layer is
the full power of LMs? Let us consider a situation that we are in the equipped with residual connections and layer normalization.
middle of transcribing a speech utterance, where we have already The input representation of BERT was designed to handle a va-
correctly recognized a sequence of history words, and we want to riety of down-stream tasks, as visualized in Figure 1. First, a token
determine what the next word is being said. From a probabilistic embedding is assigned to each token in the vocabulary. Some special
point of view, a strong LM, can then generate a list of candidates tokens were added to the original BERT, including a classification
where each of them is highly possible to be the next word. The list token ([CLS]) that is padded to the beginning of every sequence,
may be extremely short that there is only one answer left. As a re- where the final hidden state of BERT corresponding to this token
sult, we can use few to no clues in the speech signal to correctly is used as the aggregate sequence representation for classification
recognize the next word. tasks, and a separation token ([SEP]) for separating two sentences.
There has been rapid development of LMs in the field of natu- Second, a segment embedding is added to every token to indicate
ral languages processing (NLP), and one of the most epoch-making whether it belongs to sentence A or B. Finally, a learned positional
approach is BERT [4]. Its success comes from a framework where embedding is added such that the model can be aware of informa-
a pretraining stage is followed by a task-specific fine-tuning stage. tion about the relative or absolute position of each token. The input
Thanks to the un-/self-supervised objectives adopted in pretraining, representation for every token is then constructed by summing the
large-scale unlabeled datasets can be used for training, thus capable corresponding token, segment, and position embeddings.
of learning enriched language representations that are powerful on
various NLP tasks. BERT and its variants have created a dominant 2.2. Training and fine-tuning
paradigm in NLP in the past year [5, 6, 7].
In this work, we propose a novel approach to ASR, which is to Two self-supervised objectives were used for pretraining BERT.
simply fine-tune a pretrained BERT model. Our method, which we The first one is the masked language modeling (MLM), which is
Fig. 1: The input representation of the original BERT
and the proposed BERT-ASR. Fig. 2: Illustration of the decoding process of the proposed BERT-ASR.

a denoising objective that asks the model to reconstruct randomly training samples following the rule in Equation (2):
masked input tokens based on context information. Specifically,
15% of the input tokens are first chosen. Then, each token is (1) 
([CLS])
replaced with [MASK] for 80% of the time, (2) replaced with a 

([CLS], y1 )


random token for 10% of the time, (3) kept unchanged for 10% of
the time. (y1 , ..., yT ) → ([CLS], y1 , y2 ) . (2)

During fine-tuning, depending on the downstream task, minimal . . .



task-specific parameters are introduced so that fine-tuning can be

([CLS], y1 , . . . , yt−1 )
cheap in terms of data and training efforts.
The training of the BERT-LM becomes simply minimizing the fol-
3. PROPOSED METHOD lowing cross-entropy objective:

N X
T
In this section, we explain how we fine-tune a pretrained BERT to X (i) (i) (i)
formulate LM, and then further extend it to consume acoustic speech LLM = − P (yt |[CLS], y1 , . . . , yt−1 ). (3)
signals to achieve ASR. i=1 t=1

Assume we have an ASR training dataset containing N speech


utterances: DASR = hX (i) , y (i) iN
i=1 , with each y = (y1 , ..., yT )
being the transcription consisting of T tokens, and each X =
3.2. BERT-ASR
(x1 , ..., xT 0 ) denoting a sequence of T 0 input acoustic feature
frames. The acoustic features are of dimension d, i.e., xt ∈ Rd , and We introduce our proposed BERT-ASR in this section. Since
the tokens are from a vocabulary of size V . the time resolution of text and speech is at completely different
scales, for the model described in Section 3.1 to be capable of
3.1. Training a probabilistic LM with BERT taking acoustic features as input, we first assume that we know
the nonlinear alignment between an acoustic feature sequence
We first show how we formulate a probabilistic LM using BERT, and the corresponding text, as depicted in Figure 3. Specifically,
which we will refer to as BERT-LM. The probability of observing a for a pair of training transcription and acoustic feature sequence
symbol sequence y can be formulated as: h(x1 , . . . , xT 0 ), (y1 , . . . , yT )i, we denote Fi to be the segment of
features corresponding to yi : Fi = (xti−1 +1 , . . . , xti ) ∈ Rti ×d ,
T
Y which is from frame ti−1 + 1 to ti , and we set t0 = 0. Thus,
P (y) = P (y1 , . . . , yT ) = P (yt |y1 , . . . , yt−1 ). (1) the T 0 acoustic frames can be segmented into T groups: X =
t=1
(F1 , . . . , FT ), and a new dataset containing segmented acoustic
The decoding (or scoring) of a given symbol sequence then becomes feature and text pairs can be obtained: Dseg = hF (i) , y (i) iN i=1 .
an iterative process that calculates all the terms in the product, which We can then augment the BERT-LM into BERT-ASR by inject-
is illustrated in Figure 2. At the t-th time step, the BERT model takes ing the acoustic information extracted from Dseg into BERT-LM.
a sequence of previously decoded symbols and the [CLS] token Specifically, as depicted in Figure 1, an acoustic encoder, which
as input, i.e., ([CLS], y1 , . . . , yt−1 ). Then, the final hidden state will be described later, consumes the raw acoustic feature segments
corresponding to [CLS] is sent into a linear classifier, which then to generate the acoustic embeddings. They are summed with the
outputs the probability distribution P (yt |y1 , . . . , yt−1 ). three types of embeddings in the original BERT, and further sent
To train the model, assume we have a training dataset with N into BERT. Note that the acoustic embedding corresponding to the
sentences: Dtext = {y (i) }N i=1 . An essential technique to train the current word to be transcribed is added to the [CLS] token as shown
model is to exhaustively enumerate all possible training samples. in Figure 2.
That is, each sentence with T symbols is extended to T different The probability of observing a symbol sequence y in Equa-
Fig. 3: Illustration of the alignment between the text and
the acoustic frames. Fig. 4: The average encoder. Fig. 5: The conv1d resnet encoder.

tions (1) can therefore be reformulated as: 3.3.2. Conv1d resnet encoder

T
Y The drawback of the average encoder is that temporal dependencies
P (y) = P (yt |y1 , . . . , yt−1 , F1 , . . . , Ft ). (4) between different acoustic segments was not considered. Therefore,
t=1 we investigate a second encoder, which we will refer to as the conv1d
resnet encoder, as illustrated in Figure 5. While it has the identical
Note that the acoustic segment of the current time step is also taken segment and linear layers as the average encoder, we add L learnable
into consideration, which is essential for the model to correctly tran- residual blocks that operates on X. Each residual block contains two
scribe the current word being said. The training objective can be conv1d layers over the time axis followed by ReLU activations. It
derived by reformulating Equation (3) as: is expected that taking the temporal relationship between segments
into account can boost performance.
N X
X T
(i)
LASR = − P (yt |
4. EXPERIMENTS
i=1 t=1
(i) (i)
h[CLS], Ft i, hy1 , F1 i, . . . , hyt−1 , Ft−1 i). (5) 4.1. Experimental settings

In a nutshell, the training of BERT-ASR involves three steps. We evaluate the proposed method on the AISHELL-1 dataset [20],
which contains 170 hr Mandarin speech. We used the Kaldi toolkit
1. Pretrain a BERT using a large-scale text dataset. [21] to extract 80-dim log Mel-filter bank plus 3-dim pitch features
and normalized them. The training data contained around 120k ut-
2. fine-tune a BERT-LM on the transcriptions of the ASR train-
terances, and the exhaustive enumeration process described in Sec-
ing dataset, as described in Section 3.1,
tion 3.1 resulted in 1.7M training samples. For the first step of the
3. fine-tune a BERT-ASR using both text and speech data of the proposed BERT-ASR, i.e., pretraining a BERT model using a large-
ASR training dataset. scale text dataset (cf. Section 3.2), we adopt an updated version
of BERT, which is whole word masking (WWM), whose effective-
ness was verified in [22]. The major difference between the up-
3.3. Acoustic encoder dated BERT and the classic BERT is in the masking procedure of
MLM training. If a masked token belongs to a word, then all the
We now describe two kinds of architectures for the acoustic en-
tokens that complete the word will be masked altogether. This is a
coder mentioned in Section 3.2. Formally, the acoustic encoder
much more challenging task, since the model is forced to recover the
takes the whole acoustic frame sequence X as input, and outputs
whole word rather than just recovering tokens. we directly used the
the corresponding acoustic embeddings (AE1 , . . . , AET ), where
hfl/chinese-bert-wwm pretrained model provided by [22]1 ,
AEt ∈ Rdmodel with dmodel being the BERT embedding dimen-
which was trained on Chinese Wikipedia. The modeling unit was
sion. The acoustic encoder must contain the segment layer to obtain
Mandarin character. We conducted the experiments using the Hug-
the acoustic segments.
gingFace Transformer toolkit [23]. The alignment we used dur-
ing training was obtained by forced alignment with an HMM/DNN
3.3.1. Average encoder model trained on the same AISHELL-1 training set.
We considered two decoding scenarios w.r.t the alignment strat-
We first consider a very simple average encoder, as depicted in Fig- egy. First, the oracle decoding is where we assumed that alignment
ure 4. First, the segmentation is performed. Then, for each Ft , we is accessible. Second, to match a practical decoding setting, as a
average on the time axis, and then the resulting vector is passed naive attempt, we assumed that the alignment between each utter-
through a linear layer to distill the useful information, while scal- ance and the underlying text is linear, and partitioned the acoustic
ing the dimension from d to dmodel . Simple as it seems, as we will frames into segments of equal lengths. The length was calculated as
show later, initial results can already be obtained with this average
encoder. 1 https://github.com/ymcui/Chinese-BERT-wwm
Table 1: Results on the AISHELL-1 dataset. ”Orac.” and ”Prac.” denote the oracle decoding and practical decoding, respectively. ”Conv1d
resnet X” denotes the conv1d resnet encoder with X resnet blocks. Best performance of the BERT-ASR are shown in bold.

Perplexity CER (Orac.) CER (Prac.) SER (Orac.) SER (Prac.)


Model Acoustic encoder
Dev Test Dev Test Dev Test Dev Test Dev Test
Trigram-LM - 133.32 127.88 - - - -
LSTM-LM - 79.97 78.80 - - - -
BERT-LM - 39.74 41.72 - - - -
Average 5.88 9.02 65.8 68.9 96.4 105.8 60.3 63.5 91.5 100.3
Conv1d resnet 1 4.91 7.63 55.8 59.0 89.6 99.6 50.0 53.8 84.4 94.1
BERT-ASR Conv1d resnet 2 4.77 6.94 54.6 58.8 89.7 99.1 49.5 53.6 84.6 93.5
Conv1d resnet 3 4.83 7.41 54.8 58.9 89.8 99.4 49.6 53.6 84.6 93.9
Conv1d-resnet 4 4.78 7.29 54.6 59.0 89.5 99.3 49.4 53.9 84.4 93.8
GMM-HMM - - - 10.4 12.2 - -
DNN-HMM - - - 7.2 8.4 - -

the average number of frames per word in the training set, which was Table 2: Development set results with a ratio of the leading
25 frames. In both scenarios, we considered beam decoding with a characters correctly recognized.
beam size of 10.
4.2. Main results Model Ratio CER SER
0 54.8 49.6
We reported the perplexity (PPL) and character error rate (CER) in
Conv1d resnet 3 1/3 61.9 55.3
Table 1, with the former being a metric to compare the performance
1/2 57.3 51.4
of LMs and the latter to compare performance of ASR systems. As
a reference, we first compared the PPL between different LMs. It
can be clearly observed that the BERT-LM outperformed the con- on a character-based BERT, it might be infeasible for the model to
ventional trigram-LM and LSTM-LM, again showing the power of learn to map the same acoustic signal to different characters. To
BERT as a LM. examine if our model actually suffered from this problem, the sylla-
We then compare BERT-ASR with BERT-LM. By using a sim- ble error rates (SERs) were calculated and reported in Table 1. It is
ple average encoder, a significantly lower PPL could be obtained, obvious that the SERs are much lower than the CERs, showing the
showing that using acoustic clues can greatly help guide recognition. existence of this problem. Thus, learning phonetically-aware repre-
Moreover, models with a complex acoustic encoder like the conv1d sentations will be a future direction.
resnet encoder could further reduce PPL. Looking at the CERs, we
observed that even with the simple average encoder, a preliminary
success could still be obtained. Furthermore, the conv1d resnet en- 4.3.2. Error propagation
coders reduced the CER by almost 10%, showing that it is essential BERT benefits from the self-attention mechanism and is therefore
to have access to global temporal dependencies before segmentation. known for its ability to capture global relationship and long-term
We finally consider the practical decoding scenario. There is dependencies. The full-power of BERT may not be exerted with a
a significant performance degradation with the equal segmentation, relatively short context. That is to say, our BERT-ASR can be poor
and it is evidence to the nonlinear relationship of the alignment. at the earlier decoding steps. As a result, the error in the beginning
Thus, finding an alignment-free approach will be an urgent future might propagate due to the recursive decoding process. To examine
work [24, 25]. The performance of two conventional ASR systems this problem, we assume that the starting characters up to a certain
directly from the original paper of AISHELL-1 [20] are also listed, ratio are correctly recognized, and start the decoding process de-
and a significant gap exists between our method and the baselines, pending on those characters. Although we expected the error rates
showing that there is still much room for improvement. Neverthe- to decrease as the ratio increases, as shown in Table 2, the CERs and
less, to the best of our knowledge, this is the first study to obtain an SERs were not lower. Thus, we conclude that error propagation was
ASR system by fine-tuning a pretrained large-scale LM. Moreover, it not a major issue.
is worth noticing that the proposed method is readily prepared for n-
best re-scoring [17], though in this paper we mainly focus on build-
ing an ASR system. We thus leave this as future work. 5. CONCLUSION

4.3. Error Analysis In this work, we proposed a novel approach to ASR by simply fine-
tuning BERT, and described the detailed formulation and several es-
In this section, we examine two possible reasons for the current un- sential techniques. To verify the proposed BERT-ASR, we demon-
satisfying results. strated initial results on the Mandarin AISHELL-1 dataset, and an-
alyzed two possible sources of error. In the future, we will investi-
4.3.1. Polyphone in Mandarin gate more complex model architectures and the possibility of multi-
task learning, in order to close the gap between our and conventional
Mandarin is a character-based language, where the same pronuncia- ASR systems. We also plan to evaluate the BERT-ASR on other lan-
tion can be mapped to different characters. As our method is based guages, and apply the proposed method for n-best re-scoring [17].
6. REFERENCES [19] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and
T. Kawahara, “Distilling the knowledge of bert for sequence-
[1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, to-sequence asr,” arXiv preprint arXiv:2008.03822, 2020.
“Connectionist temporal classification: labelling unsegmented
[20] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An
sequence data with recurrent neural networks,” in Proc. ICML,
open-source mandarin speech corpus and a speech recognition
2006, pp. 369–376.
baseline,” in Proc. O-COCOSDA, 2017, pp. 1–5.
[2] A. Graves, “Sequence transduction with recurrent neural net-
[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
works,” arXiv preprint arXiv:1211.3711, 2012.
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
[3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
spell: A neural network for large vocabulary conversational recognition toolkit,” in Proc. ASRU, Dec. 2011.
speech recognition,” in Proc. ICASSp, 2016, pp. 4960–4964.
[22] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-trained models for chinese natural language processing,” in
Pre-training of deep bidirectional transformers for language Proc. Findings of EMNLP, 2020.
understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
[5] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Hugging-
“Improving language understanding by generative pre- face’s transformers: State-of-the-art natural language process-
training,” 2018. ing,” arXiv preprint arXiv:1910.03771, 2019.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [24] N. Moritz, T. Hori, and J. L. Roux, “Triggered attention for
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- end-to-end speech recognition,” in Proc. ICASSP, 2019, pp.
bustly optimized bert pretraining approach,” arXiv preprint 5666–5670.
arXiv:1907.11692, 2019.
[25] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and end-to-end speech recognition,” in Proc. ICASSP, 2020, pp.
R. Soricut, “Albert: A lite bert for self-supervised learning of 6079–6083.
language representations,” arXiv preprint arXiv:1909.11942,
2019.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N
Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you
Need,” in Proc. NeruIPS, pp. 5998–6008. 2017.
[9] Z. Zhao Z. Wang Q. Ju H. Deng P. Wang W. Liu, P. Zhou,
“K-BERT: Enabling language representation with knowledge
graph,” in Proc. AAAI, 2020.
[10] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer,
“Bert with history answer embedding for conversational ques-
tion answering,” in Proc. SIGIR, 2019, pp. 1133–1136.
[11] Y. Liu, “Fine-tune bert for extractive summarization,” arXiv
preprint arXiv:1903.10318, 2019.
[12] J. Xu, Z. Gan, Y. Cheng, and J. Liu, “Discourse-aware neural
extractive text summarization,” in Proc. ACL, 2020, pp. 5021–
5031.
[13] W. Lu, J. Jiao, and R. Zhang, “Twinbert: Distilling knowledge
to twin-structured bert models for efficient retrieval,” arXiv
preprint arXiv:2002.06275, 2020.
[14] R. Nogueira and K. Cho, “Passage re-ranking with bert,” arXiv
preprint arXiv:1901.04085, 2019.
[15] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and
T. Liu, “Incorporating bert into neural machine translation,” in
Proc. ICLR, 2020.
[16] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and
Y. Artzi, “Bertscore: Evaluating text generation with bert,”
in Proc. ICLR, 2020.
[17] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring
method using bert for speech recognition,” in Proc. ACML,
2019, vol. 101, pp. 1081–1093.
[18] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked
language model scoring,” in Proc. ACL, July 2020, pp. 2699–
2712.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy