2102.00291 Bert
2102.00291 Bert
Wen-Chin Huang1,2 , Chia-Hua Wu2 , Shang-Bao Luo2 , Kuan-Yu Chen3 , Hsin-Min Wang2 , Tomoki Toda1
1
Nagoya University, Japan 2 Academia Sinica, Taiwan
3
National Taiwan University of Science and Technology, Taiwan
a denoising objective that asks the model to reconstruct randomly training samples following the rule in Equation (2):
masked input tokens based on context information. Specifically,
15% of the input tokens are first chosen. Then, each token is (1)
([CLS])
replaced with [MASK] for 80% of the time, (2) replaced with a
([CLS], y1 )
random token for 10% of the time, (3) kept unchanged for 10% of
the time. (y1 , ..., yT ) → ([CLS], y1 , y2 ) . (2)
During fine-tuning, depending on the downstream task, minimal . . .
task-specific parameters are introduced so that fine-tuning can be
([CLS], y1 , . . . , yt−1 )
cheap in terms of data and training efforts.
The training of the BERT-LM becomes simply minimizing the fol-
3. PROPOSED METHOD lowing cross-entropy objective:
N X
T
In this section, we explain how we fine-tune a pretrained BERT to X (i) (i) (i)
formulate LM, and then further extend it to consume acoustic speech LLM = − P (yt |[CLS], y1 , . . . , yt−1 ). (3)
signals to achieve ASR. i=1 t=1
tions (1) can therefore be reformulated as: 3.3.2. Conv1d resnet encoder
T
Y The drawback of the average encoder is that temporal dependencies
P (y) = P (yt |y1 , . . . , yt−1 , F1 , . . . , Ft ). (4) between different acoustic segments was not considered. Therefore,
t=1 we investigate a second encoder, which we will refer to as the conv1d
resnet encoder, as illustrated in Figure 5. While it has the identical
Note that the acoustic segment of the current time step is also taken segment and linear layers as the average encoder, we add L learnable
into consideration, which is essential for the model to correctly tran- residual blocks that operates on X. Each residual block contains two
scribe the current word being said. The training objective can be conv1d layers over the time axis followed by ReLU activations. It
derived by reformulating Equation (3) as: is expected that taking the temporal relationship between segments
into account can boost performance.
N X
X T
(i)
LASR = − P (yt |
4. EXPERIMENTS
i=1 t=1
(i) (i)
h[CLS], Ft i, hy1 , F1 i, . . . , hyt−1 , Ft−1 i). (5) 4.1. Experimental settings
In a nutshell, the training of BERT-ASR involves three steps. We evaluate the proposed method on the AISHELL-1 dataset [20],
which contains 170 hr Mandarin speech. We used the Kaldi toolkit
1. Pretrain a BERT using a large-scale text dataset. [21] to extract 80-dim log Mel-filter bank plus 3-dim pitch features
and normalized them. The training data contained around 120k ut-
2. fine-tune a BERT-LM on the transcriptions of the ASR train-
terances, and the exhaustive enumeration process described in Sec-
ing dataset, as described in Section 3.1,
tion 3.1 resulted in 1.7M training samples. For the first step of the
3. fine-tune a BERT-ASR using both text and speech data of the proposed BERT-ASR, i.e., pretraining a BERT model using a large-
ASR training dataset. scale text dataset (cf. Section 3.2), we adopt an updated version
of BERT, which is whole word masking (WWM), whose effective-
ness was verified in [22]. The major difference between the up-
3.3. Acoustic encoder dated BERT and the classic BERT is in the masking procedure of
MLM training. If a masked token belongs to a word, then all the
We now describe two kinds of architectures for the acoustic en-
tokens that complete the word will be masked altogether. This is a
coder mentioned in Section 3.2. Formally, the acoustic encoder
much more challenging task, since the model is forced to recover the
takes the whole acoustic frame sequence X as input, and outputs
whole word rather than just recovering tokens. we directly used the
the corresponding acoustic embeddings (AE1 , . . . , AET ), where
hfl/chinese-bert-wwm pretrained model provided by [22]1 ,
AEt ∈ Rdmodel with dmodel being the BERT embedding dimen-
which was trained on Chinese Wikipedia. The modeling unit was
sion. The acoustic encoder must contain the segment layer to obtain
Mandarin character. We conducted the experiments using the Hug-
the acoustic segments.
gingFace Transformer toolkit [23]. The alignment we used dur-
ing training was obtained by forced alignment with an HMM/DNN
3.3.1. Average encoder model trained on the same AISHELL-1 training set.
We considered two decoding scenarios w.r.t the alignment strat-
We first consider a very simple average encoder, as depicted in Fig- egy. First, the oracle decoding is where we assumed that alignment
ure 4. First, the segmentation is performed. Then, for each Ft , we is accessible. Second, to match a practical decoding setting, as a
average on the time axis, and then the resulting vector is passed naive attempt, we assumed that the alignment between each utter-
through a linear layer to distill the useful information, while scal- ance and the underlying text is linear, and partitioned the acoustic
ing the dimension from d to dmodel . Simple as it seems, as we will frames into segments of equal lengths. The length was calculated as
show later, initial results can already be obtained with this average
encoder. 1 https://github.com/ymcui/Chinese-BERT-wwm
Table 1: Results on the AISHELL-1 dataset. ”Orac.” and ”Prac.” denote the oracle decoding and practical decoding, respectively. ”Conv1d
resnet X” denotes the conv1d resnet encoder with X resnet blocks. Best performance of the BERT-ASR are shown in bold.
the average number of frames per word in the training set, which was Table 2: Development set results with a ratio of the leading
25 frames. In both scenarios, we considered beam decoding with a characters correctly recognized.
beam size of 10.
4.2. Main results Model Ratio CER SER
0 54.8 49.6
We reported the perplexity (PPL) and character error rate (CER) in
Conv1d resnet 3 1/3 61.9 55.3
Table 1, with the former being a metric to compare the performance
1/2 57.3 51.4
of LMs and the latter to compare performance of ASR systems. As
a reference, we first compared the PPL between different LMs. It
can be clearly observed that the BERT-LM outperformed the con- on a character-based BERT, it might be infeasible for the model to
ventional trigram-LM and LSTM-LM, again showing the power of learn to map the same acoustic signal to different characters. To
BERT as a LM. examine if our model actually suffered from this problem, the sylla-
We then compare BERT-ASR with BERT-LM. By using a sim- ble error rates (SERs) were calculated and reported in Table 1. It is
ple average encoder, a significantly lower PPL could be obtained, obvious that the SERs are much lower than the CERs, showing the
showing that using acoustic clues can greatly help guide recognition. existence of this problem. Thus, learning phonetically-aware repre-
Moreover, models with a complex acoustic encoder like the conv1d sentations will be a future direction.
resnet encoder could further reduce PPL. Looking at the CERs, we
observed that even with the simple average encoder, a preliminary
success could still be obtained. Furthermore, the conv1d resnet en- 4.3.2. Error propagation
coders reduced the CER by almost 10%, showing that it is essential BERT benefits from the self-attention mechanism and is therefore
to have access to global temporal dependencies before segmentation. known for its ability to capture global relationship and long-term
We finally consider the practical decoding scenario. There is dependencies. The full-power of BERT may not be exerted with a
a significant performance degradation with the equal segmentation, relatively short context. That is to say, our BERT-ASR can be poor
and it is evidence to the nonlinear relationship of the alignment. at the earlier decoding steps. As a result, the error in the beginning
Thus, finding an alignment-free approach will be an urgent future might propagate due to the recursive decoding process. To examine
work [24, 25]. The performance of two conventional ASR systems this problem, we assume that the starting characters up to a certain
directly from the original paper of AISHELL-1 [20] are also listed, ratio are correctly recognized, and start the decoding process de-
and a significant gap exists between our method and the baselines, pending on those characters. Although we expected the error rates
showing that there is still much room for improvement. Neverthe- to decrease as the ratio increases, as shown in Table 2, the CERs and
less, to the best of our knowledge, this is the first study to obtain an SERs were not lower. Thus, we conclude that error propagation was
ASR system by fine-tuning a pretrained large-scale LM. Moreover, it not a major issue.
is worth noticing that the proposed method is readily prepared for n-
best re-scoring [17], though in this paper we mainly focus on build-
ing an ASR system. We thus leave this as future work. 5. CONCLUSION
4.3. Error Analysis In this work, we proposed a novel approach to ASR by simply fine-
tuning BERT, and described the detailed formulation and several es-
In this section, we examine two possible reasons for the current un- sential techniques. To verify the proposed BERT-ASR, we demon-
satisfying results. strated initial results on the Mandarin AISHELL-1 dataset, and an-
alyzed two possible sources of error. In the future, we will investi-
4.3.1. Polyphone in Mandarin gate more complex model architectures and the possibility of multi-
task learning, in order to close the gap between our and conventional
Mandarin is a character-based language, where the same pronuncia- ASR systems. We also plan to evaluate the BERT-ASR on other lan-
tion can be mapped to different characters. As our method is based guages, and apply the proposed method for n-best re-scoring [17].
6. REFERENCES [19] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and
T. Kawahara, “Distilling the knowledge of bert for sequence-
[1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, to-sequence asr,” arXiv preprint arXiv:2008.03822, 2020.
“Connectionist temporal classification: labelling unsegmented
[20] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An
sequence data with recurrent neural networks,” in Proc. ICML,
open-source mandarin speech corpus and a speech recognition
2006, pp. 369–376.
baseline,” in Proc. O-COCOSDA, 2017, pp. 1–5.
[2] A. Graves, “Sequence transduction with recurrent neural net-
[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
works,” arXiv preprint arXiv:1211.3711, 2012.
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
[3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
spell: A neural network for large vocabulary conversational recognition toolkit,” in Proc. ASRU, Dec. 2011.
speech recognition,” in Proc. ICASSp, 2016, pp. 4960–4964.
[22] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-trained models for chinese natural language processing,” in
Pre-training of deep bidirectional transformers for language Proc. Findings of EMNLP, 2020.
understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
[5] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Hugging-
“Improving language understanding by generative pre- face’s transformers: State-of-the-art natural language process-
training,” 2018. ing,” arXiv preprint arXiv:1910.03771, 2019.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [24] N. Moritz, T. Hori, and J. L. Roux, “Triggered attention for
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- end-to-end speech recognition,” in Proc. ICASSP, 2019, pp.
bustly optimized bert pretraining approach,” arXiv preprint 5666–5670.
arXiv:1907.11692, 2019.
[25] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and end-to-end speech recognition,” in Proc. ICASSP, 2020, pp.
R. Soricut, “Albert: A lite bert for self-supervised learning of 6079–6083.
language representations,” arXiv preprint arXiv:1909.11942,
2019.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N
Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you
Need,” in Proc. NeruIPS, pp. 5998–6008. 2017.
[9] Z. Zhao Z. Wang Q. Ju H. Deng P. Wang W. Liu, P. Zhou,
“K-BERT: Enabling language representation with knowledge
graph,” in Proc. AAAI, 2020.
[10] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer,
“Bert with history answer embedding for conversational ques-
tion answering,” in Proc. SIGIR, 2019, pp. 1133–1136.
[11] Y. Liu, “Fine-tune bert for extractive summarization,” arXiv
preprint arXiv:1903.10318, 2019.
[12] J. Xu, Z. Gan, Y. Cheng, and J. Liu, “Discourse-aware neural
extractive text summarization,” in Proc. ACL, 2020, pp. 5021–
5031.
[13] W. Lu, J. Jiao, and R. Zhang, “Twinbert: Distilling knowledge
to twin-structured bert models for efficient retrieval,” arXiv
preprint arXiv:2002.06275, 2020.
[14] R. Nogueira and K. Cho, “Passage re-ranking with bert,” arXiv
preprint arXiv:1901.04085, 2019.
[15] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and
T. Liu, “Incorporating bert into neural machine translation,” in
Proc. ICLR, 2020.
[16] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and
Y. Artzi, “Bertscore: Evaluating text generation with bert,”
in Proc. ICLR, 2020.
[17] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring
method using bert for speech recognition,” in Proc. ACML,
2019, vol. 101, pp. 1081–1093.
[18] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked
language model scoring,” in Proc. ACL, July 2020, pp. 2699–
2712.