0% found this document useful (0 votes)

5 views5 pages

2102.00291 Bert

This document presents a method for automatic speech recognition (ASR) by fine-tuning the BERT language model, proposing a model called BERT-ASR that treats ASR as a classification problem. The authors demonstrate that even a simple acoustic model can be effective when combined with BERT, and they outline the training process involving pretraining BERT on text data and fine-tuning it with both text and speech data. The effectiveness of the proposed method is evaluated using the AISHELL dataset, showing promising results in speech transcription.

Uploaded by

mastersujithaditya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

2102.00291 Bert

Uploaded by

mastersujithaditya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT

Wen-Chin Huang1,2 , Chia-Hua Wu2 , Shang-Bao Luo2 , Kuan-Yu Chen3 , Hsin-Min Wang2 , Tomoki Toda1
1
Nagoya University, Japan 2 Academia Sinica, Taiwan
3
National Taiwan University of Science and Technology, Taiwan

ABSTRACT call BERT-ASR, formulates ASR as a classification problem, where

the objective is to correctly classify the next word given the acoustic
arXiv:2102.00291v1 [cs.SD] 30 Jan 2021

We propose a simple method for automatic speech recognition

speech signals and the history words. We show that even an AM that
(ASR) by fine-tuning BERT, which is a language model (LM)
simply averages the frame-based acoustic features corresponding to
trained on large-scale unlabeled text data and can generate rich
a word can be applied to BERT-ASR to correctly transcribe speech
contextual representations. Our assumption is that given a history
to a certain extent, and the performance can be further boosted by
context sequence, a powerful LM can narrow the range of possi-
using a more complex model.
ble choices and the speech signal can be used as a simple clue.
Hence, comparing to conventional ASR systems that train a pow-
erful acoustic model (AM) from scratch, we believe that speech 2. BERT
recognition is possible by simply fine-tuning a BERT model. As an
initial study, we demonstrate the effectiveness of the proposed idea BERT [4], which stands for Bidirectional Encoder Representations
on the AISHELL dataset and show that stacking a very simple AM from Transformers, is a pretraining method from a LM objective
on top of BERT can yield reasonable performance. with a Transformer encoder architecture [8]. The full power of
BERT can be released only through a pretraining–fine-tuning frame-
Index Terms— speech recognition, BERT, language model work, where the model is first trained on a large-scale unlabeled text
dataset, and then all/some parameters are fine-tuned on a labeled
1. INTRODUCTION dataset for the downstream task.
The original usage of BERT mainly focused on NLP tasks, rang-
Conventional automatic speech recognition (ASR) systems consist ing from token-level and sequence-level classification tasks, includ-
of multiple separately optimized modules, including an acoustic ing question answering [9, 10], document summarization [11, 12],
model (AM), a language model (LM) and a lexicon. In recent years, information retrieval [13, 14], machine translation [15, 16], just to
end-to-end (E2E) ASR models have attracted much attention, due name a few. There has also been attempts to combine BERT in
to the believe that jointly optimizing one single model is benefi- ASR, including rescoring [17, 18] or generating soft labels for train-
cial to avoiding not only task-specific engineering but also error ing [19]. In this section, we review the fundamentals of BERT.
propagation. Current main-stream E2E approaches include connec-
tionist temporal classification (CTC) [1], neural transducers [2], and 2.1. Model architecture and input representations
attention-based sequence-to-sequence (seq2seq) learning [3].
LMs play an essential role in ASR. Even the E2E models that BERT adopts a multi-layer Transformer [8] encoder, where each
implicitly integrate LM into optimization can benefit from LM fu- layer contains a multi-head self-attention sublayer followed by a
sion. It is therefore worthwhile thinking: how can we make use of positionwise fully connected feedforward network. Each layer is
the full power of LMs? Let us consider a situation that we are in the equipped with residual connections and layer normalization.
middle of transcribing a speech utterance, where we have already The input representation of BERT was designed to handle a va-
correctly recognized a sequence of history words, and we want to riety of down-stream tasks, as visualized in Figure 1. First, a token
determine what the next word is being said. From a probabilistic embedding is assigned to each token in the vocabulary. Some special
point of view, a strong LM, can then generate a list of candidates tokens were added to the original BERT, including a classification
where each of them is highly possible to be the next word. The list token ([CLS]) that is padded to the beginning of every sequence,
may be extremely short that there is only one answer left. As a re- where the final hidden state of BERT corresponding to this token
sult, we can use few to no clues in the speech signal to correctly is used as the aggregate sequence representation for classification
recognize the next word. tasks, and a separation token ([SEP]) for separating two sentences.
There has been rapid development of LMs in the field of natu- Second, a segment embedding is added to every token to indicate
ral languages processing (NLP), and one of the most epoch-making whether it belongs to sentence A or B. Finally, a learned positional
approach is BERT [4]. Its success comes from a framework where embedding is added such that the model can be aware of informa-
a pretraining stage is followed by a task-specific fine-tuning stage. tion about the relative or absolute position of each token. The input
Thanks to the un-/self-supervised objectives adopted in pretraining, representation for every token is then constructed by summing the
large-scale unlabeled datasets can be used for training, thus capable corresponding token, segment, and position embeddings.
of learning enriched language representations that are powerful on
various NLP tasks. BERT and its variants have created a dominant 2.2. Training and fine-tuning
paradigm in NLP in the past year [5, 6, 7].
In this work, we propose a novel approach to ASR, which is to Two self-supervised objectives were used for pretraining BERT.
simply fine-tune a pretrained BERT model. Our method, which we The first one is the masked language modeling (MLM), which is
Fig. 1: The input representation of the original BERT
and the proposed BERT-ASR. Fig. 2: Illustration of the decoding process of the proposed BERT-ASR.

a denoising objective that asks the model to reconstruct randomly training samples following the rule in Equation (2):
masked input tokens based on context information. Specifically,
15% of the input tokens are first chosen. Then, each token is (1) 
([CLS])
replaced with [MASK] for 80% of the time, (2) replaced with a 

([CLS], y1 )


random token for 10% of the time, (3) kept unchanged for 10% of
the time. (y1 , ..., yT ) → ([CLS], y1 , y2 ) . (2)

During fine-tuning, depending on the downstream task, minimal . . .



task-specific parameters are introduced so that fine-tuning can be

([CLS], y1 , . . . , yt−1 )
cheap in terms of data and training efforts.
The training of the BERT-LM becomes simply minimizing the fol-
3. PROPOSED METHOD lowing cross-entropy objective:

N X
T
In this section, we explain how we fine-tune a pretrained BERT to X (i) (i) (i)
formulate LM, and then further extend it to consume acoustic speech LLM = − P (yt |[CLS], y1 , . . . , yt−1 ). (3)
signals to achieve ASR. i=1 t=1

Assume we have an ASR training dataset containing N speech

utterances: DASR = hX (i) , y (i) iN
i=1 , with each y = (y1 , ..., yT )
being the transcription consisting of T tokens, and each X =
3.2. BERT-ASR
(x1 , ..., xT 0 ) denoting a sequence of T 0 input acoustic feature
frames. The acoustic features are of dimension d, i.e., xt ∈ Rd , and We introduce our proposed BERT-ASR in this section. Since
the tokens are from a vocabulary of size V . the time resolution of text and speech is at completely different
scales, for the model described in Section 3.1 to be capable of
3.1. Training a probabilistic LM with BERT taking acoustic features as input, we first assume that we know
the nonlinear alignment between an acoustic feature sequence
We first show how we formulate a probabilistic LM using BERT, and the corresponding text, as depicted in Figure 3. Specifically,
which we will refer to as BERT-LM. The probability of observing a for a pair of training transcription and acoustic feature sequence
symbol sequence y can be formulated as: h(x1 , . . . , xT 0 ), (y1 , . . . , yT )i, we denote Fi to be the segment of
features corresponding to yi : Fi = (xti−1 +1 , . . . , xti ) ∈ Rti ×d ,
T
Y which is from frame ti−1 + 1 to ti , and we set t0 = 0. Thus,
P (y) = P (y1 , . . . , yT ) = P (yt |y1 , . . . , yt−1 ). (1) the T 0 acoustic frames can be segmented into T groups: X =
t=1
(F1 , . . . , FT ), and a new dataset containing segmented acoustic
The decoding (or scoring) of a given symbol sequence then becomes feature and text pairs can be obtained: Dseg = hF (i) , y (i) iN i=1 .
an iterative process that calculates all the terms in the product, which We can then augment the BERT-LM into BERT-ASR by inject-
is illustrated in Figure 2. At the t-th time step, the BERT model takes ing the acoustic information extracted from Dseg into BERT-LM.
a sequence of previously decoded symbols and the [CLS] token Specifically, as depicted in Figure 1, an acoustic encoder, which
as input, i.e., ([CLS], y1 , . . . , yt−1 ). Then, the final hidden state will be described later, consumes the raw acoustic feature segments
corresponding to [CLS] is sent into a linear classifier, which then to generate the acoustic embeddings. They are summed with the
outputs the probability distribution P (yt |y1 , . . . , yt−1 ). three types of embeddings in the original BERT, and further sent
To train the model, assume we have a training dataset with N into BERT. Note that the acoustic embedding corresponding to the
sentences: Dtext = {y (i) }N i=1 . An essential technique to train the current word to be transcribed is added to the [CLS] token as shown
model is to exhaustively enumerate all possible training samples. in Figure 2.
That is, each sentence with T symbols is extended to T different The probability of observing a symbol sequence y in Equa-
Fig. 3: Illustration of the alignment between the text and
the acoustic frames. Fig. 4: The average encoder. Fig. 5: The conv1d resnet encoder.

tions (1) can therefore be reformulated as: 3.3.2. Conv1d resnet encoder

T
Y The drawback of the average encoder is that temporal dependencies
P (y) = P (yt |y1 , . . . , yt−1 , F1 , . . . , Ft ). (4) between different acoustic segments was not considered. Therefore,
t=1 we investigate a second encoder, which we will refer to as the conv1d
resnet encoder, as illustrated in Figure 5. While it has the identical
Note that the acoustic segment of the current time step is also taken segment and linear layers as the average encoder, we add L learnable
into consideration, which is essential for the model to correctly tran- residual blocks that operates on X. Each residual block contains two
scribe the current word being said. The training objective can be conv1d layers over the time axis followed by ReLU activations. It
derived by reformulating Equation (3) as: is expected that taking the temporal relationship between segments
into account can boost performance.
N X
X T
(i)
LASR = − P (yt |
4. EXPERIMENTS
i=1 t=1
(i) (i)
h[CLS], Ft i, hy1 , F1 i, . . . , hyt−1 , Ft−1 i). (5) 4.1. Experimental settings

In a nutshell, the training of BERT-ASR involves three steps. We evaluate the proposed method on the AISHELL-1 dataset [20],
which contains 170 hr Mandarin speech. We used the Kaldi toolkit
1. Pretrain a BERT using a large-scale text dataset. [21] to extract 80-dim log Mel-filter bank plus 3-dim pitch features
and normalized them. The training data contained around 120k ut-
2. fine-tune a BERT-LM on the transcriptions of the ASR train-
terances, and the exhaustive enumeration process described in Sec-
ing dataset, as described in Section 3.1,
tion 3.1 resulted in 1.7M training samples. For the first step of the
3. fine-tune a BERT-ASR using both text and speech data of the proposed BERT-ASR, i.e., pretraining a BERT model using a large-
ASR training dataset. scale text dataset (cf. Section 3.2), we adopt an updated version
of BERT, which is whole word masking (WWM), whose effective-
ness was verified in [22]. The major difference between the up-
3.3. Acoustic encoder dated BERT and the classic BERT is in the masking procedure of
MLM training. If a masked token belongs to a word, then all the
We now describe two kinds of architectures for the acoustic en-
tokens that complete the word will be masked altogether. This is a
coder mentioned in Section 3.2. Formally, the acoustic encoder
much more challenging task, since the model is forced to recover the
takes the whole acoustic frame sequence X as input, and outputs
whole word rather than just recovering tokens. we directly used the
the corresponding acoustic embeddings (AE1 , . . . , AET ), where
hfl/chinese-bert-wwm pretrained model provided by [22]1 ,
AEt ∈ Rdmodel with dmodel being the BERT embedding dimen-
which was trained on Chinese Wikipedia. The modeling unit was
sion. The acoustic encoder must contain the segment layer to obtain
Mandarin character. We conducted the experiments using the Hug-
the acoustic segments.
gingFace Transformer toolkit [23]. The alignment we used dur-
ing training was obtained by forced alignment with an HMM/DNN
3.3.1. Average encoder model trained on the same AISHELL-1 training set.
We considered two decoding scenarios w.r.t the alignment strat-
We first consider a very simple average encoder, as depicted in Fig- egy. First, the oracle decoding is where we assumed that alignment
ure 4. First, the segmentation is performed. Then, for each Ft , we is accessible. Second, to match a practical decoding setting, as a
average on the time axis, and then the resulting vector is passed naive attempt, we assumed that the alignment between each utter-
through a linear layer to distill the useful information, while scal- ance and the underlying text is linear, and partitioned the acoustic
ing the dimension from d to dmodel . Simple as it seems, as we will frames into segments of equal lengths. The length was calculated as
show later, initial results can already be obtained with this average
encoder. 1 https://github.com/ymcui/Chinese-BERT-wwm
Table 1: Results on the AISHELL-1 dataset. ”Orac.” and ”Prac.” denote the oracle decoding and practical decoding, respectively. ”Conv1d
resnet X” denotes the conv1d resnet encoder with X resnet blocks. Best performance of the BERT-ASR are shown in bold.

Perplexity CER (Orac.) CER (Prac.) SER (Orac.) SER (Prac.)

Model Acoustic encoder
Dev Test Dev Test Dev Test Dev Test Dev Test
Trigram-LM - 133.32 127.88 - - - -
LSTM-LM - 79.97 78.80 - - - -
BERT-LM - 39.74 41.72 - - - -
Average 5.88 9.02 65.8 68.9 96.4 105.8 60.3 63.5 91.5 100.3
Conv1d resnet 1 4.91 7.63 55.8 59.0 89.6 99.6 50.0 53.8 84.4 94.1
BERT-ASR Conv1d resnet 2 4.77 6.94 54.6 58.8 89.7 99.1 49.5 53.6 84.6 93.5
Conv1d resnet 3 4.83 7.41 54.8 58.9 89.8 99.4 49.6 53.6 84.6 93.9
Conv1d-resnet 4 4.78 7.29 54.6 59.0 89.5 99.3 49.4 53.9 84.4 93.8
GMM-HMM - - - 10.4 12.2 - -
DNN-HMM - - - 7.2 8.4 - -

the average number of frames per word in the training set, which was Table 2: Development set results with a ratio of the leading
25 frames. In both scenarios, we considered beam decoding with a characters correctly recognized.
beam size of 10.
4.2. Main results Model Ratio CER SER
0 54.8 49.6
We reported the perplexity (PPL) and character error rate (CER) in
Conv1d resnet 3 1/3 61.9 55.3
Table 1, with the former being a metric to compare the performance
1/2 57.3 51.4
of LMs and the latter to compare performance of ASR systems. As
a reference, we first compared the PPL between different LMs. It
can be clearly observed that the BERT-LM outperformed the con- on a character-based BERT, it might be infeasible for the model to
ventional trigram-LM and LSTM-LM, again showing the power of learn to map the same acoustic signal to different characters. To
BERT as a LM. examine if our model actually suffered from this problem, the sylla-
We then compare BERT-ASR with BERT-LM. By using a sim- ble error rates (SERs) were calculated and reported in Table 1. It is
ple average encoder, a significantly lower PPL could be obtained, obvious that the SERs are much lower than the CERs, showing the
showing that using acoustic clues can greatly help guide recognition. existence of this problem. Thus, learning phonetically-aware repre-
Moreover, models with a complex acoustic encoder like the conv1d sentations will be a future direction.
resnet encoder could further reduce PPL. Looking at the CERs, we
observed that even with the simple average encoder, a preliminary
success could still be obtained. Furthermore, the conv1d resnet en- 4.3.2. Error propagation
coders reduced the CER by almost 10%, showing that it is essential BERT benefits from the self-attention mechanism and is therefore
to have access to global temporal dependencies before segmentation. known for its ability to capture global relationship and long-term
We finally consider the practical decoding scenario. There is dependencies. The full-power of BERT may not be exerted with a
a significant performance degradation with the equal segmentation, relatively short context. That is to say, our BERT-ASR can be poor
and it is evidence to the nonlinear relationship of the alignment. at the earlier decoding steps. As a result, the error in the beginning
Thus, finding an alignment-free approach will be an urgent future might propagate due to the recursive decoding process. To examine
work [24, 25]. The performance of two conventional ASR systems this problem, we assume that the starting characters up to a certain
directly from the original paper of AISHELL-1 [20] are also listed, ratio are correctly recognized, and start the decoding process de-
and a significant gap exists between our method and the baselines, pending on those characters. Although we expected the error rates
showing that there is still much room for improvement. Neverthe- to decrease as the ratio increases, as shown in Table 2, the CERs and
less, to the best of our knowledge, this is the first study to obtain an SERs were not lower. Thus, we conclude that error propagation was
ASR system by fine-tuning a pretrained large-scale LM. Moreover, it not a major issue.
is worth noticing that the proposed method is readily prepared for n-
best re-scoring [17], though in this paper we mainly focus on build-
ing an ASR system. We thus leave this as future work. 5. CONCLUSION

4.3. Error Analysis In this work, we proposed a novel approach to ASR by simply fine-
tuning BERT, and described the detailed formulation and several es-
In this section, we examine two possible reasons for the current un- sential techniques. To verify the proposed BERT-ASR, we demon-
satisfying results. strated initial results on the Mandarin AISHELL-1 dataset, and an-
alyzed two possible sources of error. In the future, we will investi-
4.3.1. Polyphone in Mandarin gate more complex model architectures and the possibility of multi-
task learning, in order to close the gap between our and conventional
Mandarin is a character-based language, where the same pronuncia- ASR systems. We also plan to evaluate the BERT-ASR on other lan-
tion can be mapped to different characters. As our method is based guages, and apply the proposed method for n-best re-scoring [17].
6. REFERENCES [19] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and
T. Kawahara, “Distilling the knowledge of bert for sequence-
[1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, to-sequence asr,” arXiv preprint arXiv:2008.03822, 2020.
“Connectionist temporal classification: labelling unsegmented
[20] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An
sequence data with recurrent neural networks,” in Proc. ICML,
open-source mandarin speech corpus and a speech recognition
2006, pp. 369–376.
baseline,” in Proc. O-COCOSDA, 2017, pp. 1–5.
[2] A. Graves, “Sequence transduction with recurrent neural net-
[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
works,” arXiv preprint arXiv:1211.3711, 2012.
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
[3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
spell: A neural network for large vocabulary conversational recognition toolkit,” in Proc. ASRU, Dec. 2011.
speech recognition,” in Proc. ICASSp, 2016, pp. 4960–4964.
[22] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-trained models for chinese natural language processing,” in
Pre-training of deep bidirectional transformers for language Proc. Findings of EMNLP, 2020.
understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[23] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
[5] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Hugging-
“Improving language understanding by generative pre- face’s transformers: State-of-the-art natural language process-
training,” 2018. ing,” arXiv preprint arXiv:1910.03771, 2019.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [24] N. Moritz, T. Hori, and J. L. Roux, “Triggered attention for
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- end-to-end speech recognition,” in Proc. ICASSP, 2019, pp.
bustly optimized bert pretraining approach,” arXiv preprint 5666–5670.
arXiv:1907.11692, 2019.
[25] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and end-to-end speech recognition,” in Proc. ICASSP, 2020, pp.
R. Soricut, “Albert: A lite bert for self-supervised learning of 6079–6083.
language representations,” arXiv preprint arXiv:1909.11942,
2019.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N
Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you
Need,” in Proc. NeruIPS, pp. 5998–6008. 2017.
[9] Z. Zhao Z. Wang Q. Ju H. Deng P. Wang W. Liu, P. Zhou,
“K-BERT: Enabling language representation with knowledge
graph,” in Proc. AAAI, 2020.
[10] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer,
“Bert with history answer embedding for conversational ques-
tion answering,” in Proc. SIGIR, 2019, pp. 1133–1136.
[11] Y. Liu, “Fine-tune bert for extractive summarization,” arXiv
preprint arXiv:1903.10318, 2019.
[12] J. Xu, Z. Gan, Y. Cheng, and J. Liu, “Discourse-aware neural
extractive text summarization,” in Proc. ACL, 2020, pp. 5021–
5031.
[13] W. Lu, J. Jiao, and R. Zhang, “Twinbert: Distilling knowledge
to twin-structured bert models for efficient retrieval,” arXiv
preprint arXiv:2002.06275, 2020.
[14] R. Nogueira and K. Cho, “Passage re-ranking with bert,” arXiv
preprint arXiv:1901.04085, 2019.
[15] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and
T. Liu, “Incorporating bert into neural machine translation,” in
Proc. ICLR, 2020.
[16] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and
Y. Artzi, “Bertscore: Evaluating text generation with bert,”
in Proc. ICLR, 2020.
[17] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring
method using bert for speech recognition,” in Proc. ACML,
2019, vol. 101, pp. 1081–1093.
[18] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked
language model scoring,” in Proc. ACL, July 2020, pp. 2699–
2712.

PHP Laravel Syllabus 2025
No ratings yet
PHP Laravel Syllabus 2025
7 pages
Zero Trust Presentation
No ratings yet
Zero Trust Presentation
14 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Quality Agreement Template 4.28.10
No ratings yet
Quality Agreement Template 4.28.10
19 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
No ratings yet
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
5 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
BERT
No ratings yet
BERT
4 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Intbert Acl19paper-3
No ratings yet
Intbert Acl19paper-3
8 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
11 Bert
No ratings yet
11 Bert
66 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
Incorporating BERT Into NMT-1
No ratings yet
Incorporating BERT Into NMT-1
20 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Lec 02
No ratings yet
Lec 02
33 pages
2024 Semeval-1 72
No ratings yet
2024 Semeval-1 72
6 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
A Primer in BERTology - What We Know About How BERT Works
No ratings yet
A Primer in BERTology - What We Know About How BERT Works
23 pages
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
CNER-DyLex Incorporating Dynamic Lexicons Into BERT For Sequence Labeling
No ratings yet
CNER-DyLex Incorporating Dynamic Lexicons Into BERT For Sequence Labeling
15 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
9 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
BART: Denoising Sequence-to-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
Integration Empirical Study
No ratings yet
Integration Empirical Study
16 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
Leveraging LLM For Asr Uncertainty
No ratings yet
Leveraging LLM For Asr Uncertainty
5 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
ACM Conference Proceedings Primary Article Template
No ratings yet
ACM Conference Proceedings Primary Article Template
2 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
Promptasr For Contextualized Asr With Controllable Style
No ratings yet
Promptasr For Contextualized Asr With Controllable Style
5 pages
Interpreting Language Models Through Knowledge Graph Extraction
No ratings yet
Interpreting Language Models Through Knowledge Graph Extraction
13 pages
BERT
No ratings yet
BERT
98 pages
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
No ratings yet
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
12 pages
4032 Whispering LLaMA A Cross
No ratings yet
4032 Whispering LLaMA A Cross
10 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
Tacl A 00300
No ratings yet
Tacl A 00300
14 pages
Text Paraphrase Detection
No ratings yet
Text Paraphrase Detection
37 pages
BERT Language Model
No ratings yet
BERT Language Model
7 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Missing Neighbors in WCDMA Analysis Guide
100% (2)
Missing Neighbors in WCDMA Analysis Guide
15 pages
Data Sheet 6ES7155-6MU00-0CN0: General Information
No ratings yet
Data Sheet 6ES7155-6MU00-0CN0: General Information
4 pages
Digital Learning Resources and Support Features Matrix
No ratings yet
Digital Learning Resources and Support Features Matrix
9 pages
L1000A TM EN TOEP C710616 134G 6 0 Addendum
No ratings yet
L1000A TM EN TOEP C710616 134G 6 0 Addendum
106 pages
Stacks and Queues
No ratings yet
Stacks and Queues
3 pages
The Intel Pentium
No ratings yet
The Intel Pentium
10 pages
D4u - Techspec - en Vol.3.4
No ratings yet
D4u - Techspec - en Vol.3.4
75 pages
Make A Rainbow: Strand Topic Primary SOL 5.3
No ratings yet
Make A Rainbow: Strand Topic Primary SOL 5.3
6 pages
Practice - Creating A Discount Modifier Using Qualifiers
No ratings yet
Practice - Creating A Discount Modifier Using Qualifiers
37 pages
High Fidelity UI Design Report
No ratings yet
High Fidelity UI Design Report
3 pages
Gateforum Ece Question Paper
No ratings yet
Gateforum Ece Question Paper
17 pages
2960 Switch Cisco Catalyst 48 Port Switch
No ratings yet
2960 Switch Cisco Catalyst 48 Port Switch
1 page
1769 td006 - en P PDF
No ratings yet
1769 td006 - en P PDF
132 pages
Database Deign UG - G Assignment 1 Semester 1 2021
No ratings yet
Database Deign UG - G Assignment 1 Semester 1 2021
4 pages
Dharmesh Vaya GDE
No ratings yet
Dharmesh Vaya GDE
2 pages
Docker Kubernetes Made Easy Interactive Ebook FINAL
No ratings yet
Docker Kubernetes Made Easy Interactive Ebook FINAL
7 pages
Chapter 5
No ratings yet
Chapter 5
41 pages
Creating The LACE (V5.1)
No ratings yet
Creating The LACE (V5.1)
29 pages
Sanjay Shelar Pune 9890704605
No ratings yet
Sanjay Shelar Pune 9890704605
9 pages
AI Machine Learning Paper For The Indian Navy CDR Godbole Final Version GH 1
No ratings yet
AI Machine Learning Paper For The Indian Navy CDR Godbole Final Version GH 1
18 pages
Problem Sheet Solution
No ratings yet
Problem Sheet Solution
11 pages
Difference Between Microkernel and Exokernel
No ratings yet
Difference Between Microkernel and Exokernel
4 pages
W4 C2 Student Worksheet PDF
No ratings yet
W4 C2 Student Worksheet PDF
9 pages
6 Step File Prep Guide For Adobe Illustration To Ezcad
No ratings yet
6 Step File Prep Guide For Adobe Illustration To Ezcad
13 pages
From Prototypical To Prototyping: Mass - Customization Versus 20TH Century Utopias in Architecture and Urban Design
No ratings yet
From Prototypical To Prototyping: Mass - Customization Versus 20TH Century Utopias in Architecture and Urban Design
10 pages
Data Warehouse 21reg
100% (1)
Data Warehouse 21reg
2 pages
Children and Young People's Home Use of ICT For Educational Purposes: The Impact On Attainment at Key Stages 1-4
No ratings yet
Children and Young People's Home Use of ICT For Educational Purposes: The Impact On Attainment at Key Stages 1-4
106 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2102.00291 Bert

Uploaded by

2102.00291 Bert

Uploaded by

SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT

ABSTRACT call BERT-ASR, formulates ASR as a classification problem, where

We propose a simple method for automatic speech recognition

Assume we have an ASR training dataset containing N speech

Perplexity CER (Orac.) CER (Prac.) SER (Orac.) SER (Prac.)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.