0% found this document useful (0 votes)
6 views12 pages

ASR Ec LLM

This paper investigates the use of large language models (LLMs) for error correction in automatic speech recognition (ASR) systems, focusing on enhancing transcription quality through methods like N-best list integration and constrained decoding. The authors propose both fine-tuning and zero-shot approaches to improve ASR error correction, demonstrating the effectiveness of these methods across various datasets and ASR architectures. The study highlights the advantages of utilizing N-best lists for richer contextual information and introduces advanced decoding strategies to enhance model robustness and performance.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

ASR Ec LLM

This paper investigates the use of large language models (LLMs) for error correction in automatic speech recognition (ASR) systems, focusing on enhancing transcription quality through methods like N-best list integration and constrained decoding. The authors propose both fine-tuning and zero-shot approaches to improve ASR error correction, demonstrating the effectiveness of these methods across various datasets and ASR architectures. The study highlights the advantages of utilizing N-best lists for richer contextual information and introduces advanced decoding strategies to enhance model robustness and performance.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO.

XX, XXXX XX 1

ASR Error Correction using Large Language


Models
Rao Ma∗ , Mengjie Qian∗ , Mark Gales, Fellow, IEEE, Kate Knill, Senior Member, IEEE

Abstract—Error correction (EC) models play a crucial role to enhance such restricted access ASR systems. Two common
in refining Automatic Speech Recognition (ASR) transcriptions, methods are language model (LM) rescoring and error cor-
enhancing the readability and quality of transcriptions. Without rection (EC). LM rescoring involves reranking the N-best list
requiring access to the underlying code or model weights, EC
generated by the ASR system using an external LM, which
arXiv:2409.09554v1 [cs.CL] 14 Sep 2024

can improve performance and provide domain adaptation for


black-box ASR systems. This work investigates the use of large can improve the overall performance of the ASR system [8].
language models (LLMs) for error correction across diverse Recent research, however, has shown that E2E ASR models
scenarios. 1-best ASR hypotheses are commonly used as the often learn an internal language model (ILM) on the training
input to EC models. We propose building high-performance data, which can reduce the effectiveness of traditional shallow
EC models using ASR N-best lists which should provide more
contextual information for the correction process. Additionally, fusion techniques [9]. Methods to address the impact of ILMs,
the generation process of a standard EC model is unrestricted such as those proposed by [10]–[12], generally involve code
in the sense that any output sequence can be generated. For modification during the inference stage, therefore they are out
some scenarios, such as unseen domains, this flexibility may of the scope of discussion in this paper.
impact performance. To address this, we introduce a constrained
decoding approach based on the N-best list or an ASR lattice. Error correction, applied as a post-processing step for
Finally, most EC models are trained for a specific ASR system ASR systems, offers a promising alternative [13]–[15]. This
requiring retraining whenever the underlying ASR system is approach requires only the decoding hypotheses and reference
changed. This paper explores the ability of EC models to operate data to train a model, eliminating the need for deep access
on the output of different ASR systems. This concept is further
extended to zero-shot error correction using LLMs, such as to the ASR system. Early work in this area focused on rule-
ChatGPT. Experiments on three standard datasets demonstrate based systems which rely on statistical analysis [16]. More
the efficacy of our proposed methods for both Transducer and recent developments have introduced end-to-end models with
attention-based encoder-decoder ASR systems. In addition, the attention modules, which can automatically identify errors
proposed method can serve as an effective method for model within sentences and learn to generate the correct counterparts
ensembling.
implicitly [17]–[19]. Large-scale pre-trained language models
Index Terms—Automatic speech recognition, error correction, (PLMs) are trained on massive and diverse text datasets, far
large language model, supervised training, zero-shot prompting exceeding the scale of data used in ASR training. Approaches
to transfer knowledge from PLMs for accurate ASR error
I. I NTRODUCTION detection and correction have been recently proposed. For
example, Hrinchuk et al. [14] propose a Transformer-based
Automatic speech recognition (ASR) aims to transcribe
architecture to “translate” an ASR model output into grammat-
speech audio into text and is the key component for human-
ically and semantically correct text. Zhao et al. [15] introduce
computer interaction [1]. In recent years, the performance of
a BART-based semantic correction system for the Mandarin
ASR technology has dramatically advanced, evolving from
ASR system. Shen et al. [20] propose a masking strategy to
traditional Hidden Markov Model (HMM)-based architectures
train the model to correct the original error tokens and predict
to modern end-to-end (E2E) systems like Listen, Attend and
the masked tokens based on their context information. Ma et.
Spell (LAS) or RNN-T [2]–[5]. Large-scale models such as
al [21], [22] propose an N-best T5 model based on pre-trained
Whisper [6] and Google USM [7] have demonstrated state-of-
T5 models to perform error correction using the ASR N-best
the-art performance, leveraging vast amounts of labeled and
list. By fine-tuning these pre-trained large language models
unlabeled speech data, which can be costly to obtain. While
(LLMs), the implicit knowledge acquired from vast amounts
achieving impressive results on test sets, practical deployment
of text data can be effectively transferred to the target error
of ASR systems encounters challenges, especially when faced
correction task.
with domain-specific or previously unseen speech data.
Accessing ASR services via APIs has emerged as a popular The recent advent of generative LLMs has further advanced
alternative to training in-house models, offering a pragmatic EC techniques. Within the field of NLP, studies such as [23],
and economical choice. Fine-tuning these models for specific [24] have applied ChatGPT models to grammatical error
tasks, however, is often impractical due to restricted access to correction tasks. In the context of ASR error correction,
proprietary models. Various approaches have been proposed previous research has examined zero-shot performance using
LLMs [25]. Everson et al. [26] utilized word confusion net-
∗ Equal Contribution. works generated by the ASR system and performed EC with
0000–0000/00$00.00 © 2021 IEEE
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 2

in-context learning, demonstrating improved performance with N-best list as input, has demonstrated significant performance
one-shot examples compared to 1-best hypotheses. Chen et improvements in error correction compared to the original
al. [27] generated N-best lists in the ASR decoding and built model [21]. The rationale behind this is that the N-best list
LLM EC systems using various methods including fine-tuning, contains alternative sequences that have a strong possibility of
LoRA tuning, and in-context learning. Hu et al. developed a being the correct transcription, thus providing valuable cues
multi-modal EC model incorporating audio as an additional for the EC model during predictions. We modify the input
input [28] and used a cloze-test task approach instead of of the EC model to be sentences in the N-best list, sorted
a generative correction method. Additionally, Li et al. [29] based on ASR scores and concatenated with a special token,
explored knowledge transfer within LLMs by fine-tuning a improving both interpretability and effectiveness.
multilingual LLM across various languages to correct 1-best In our study, we introduce methods for developing both fine-
hypothesis errors from different speech foundation models. tuning and zero-shot error correction models using ASR N-
Previous research has also explored various methods to best lists and propose several decoding methods. In traditional
improve ASR error correction by leveraging N-best lists, ASR error correction models, the decoding process allows the
which offer richer information compared to single 1-best model to generate any sequence based on the given context,
hypotheses. For instance, Guo et al. [18] generates an 8-best meaning that the output space is not constrained, which is
list with the ASR model and rescored candidates with an denoted as unconstrained decoding (uncon). When an N-best
LSTM language model [30]. Zhu et al. [31] concatenated list is used, we can constrain the model to allow it to generate
N-best hypotheses for input to a bidirectional encoder, and from a limited space. For this, we propose N-best constrained
Leng et al. [32] investigated non-autoregressive models with decoding (constr) and N-best closest (closest). The details of
similar approaches. More recent work by Ma et al. [25] and these decoding methods will be introduced in Section III.
Chen et al. [27] has integrated N-best lists with generative
LLMs to enhance error correction performance. B. Fine-tuning based Approach
Building on these advances, our paper introduces a novel
approach that uses LLMs to improve ASR error correction. We
compare fine-tuning versus zero-shot error correction methods
and investigate how ASR N-best lists can be effectively
utilized. A major contribution of our work is the innovative
use of ASR N-best lists as extended inputs, which provides
richer context and more accurate cues for error correction.
Additionally, we introduce advanced decoding strategies, in-
cluding constrained decoding, to enhance model robustness
and alignment with the original utterance. Our approach also
Fig. 1. The model structure of a supervised error correction model using
addresses data contamination concerns by developing methods ASR N-best lists as input. Here, we set N to 2 for illustration.
to evaluate the impact of training data biases. By integrating
these elements, our study sets a new benchmark in ASR error EC models aim to correct recognition errors in ASR
correction, advancing the state-of-the-art in the field. transcriptions, serving as a post-processing step for speech
This paper is structured as follows. In Section II we present recognition. A standard supervised EC model adopts an E2E
the error correction method utilizing foundation language structure, taking the ASR transcription as input, and is trained
models, including a supervised approach employing the T5 to generate the corrected sentence [14], [15], [18]. Adapting
model and a zero-shot approach based on generative LLMs. an EC model from a PLM yields superior performance when
Section III introduces several decoding algorithms to build compared to training the EC model from scratch, as this
more robust ASR error correction systems, aiming to address approach leverages the prior knowledge embedded in the
the inherent problems of the standard beam search approach. language models [21]. When training an EC model, direct
In Section IV, we describe the method used to investigate data access to the ASR system is unnecessary as only the decoded
contamination. The experimental setup and results are detailed hypotheses are required. This flexibility in data accessibility
in Section V, while Section VI covers N-best analysis, an abla- makes the method highly practical, especially in situations
tion study of the proposed approach, and a discussion on data where adapting a black-box, cloud-based speech-to-text sys-
contamination. Finally, Section VII presents the conclusions. tem is essential.
The structure of our proposed EC model is illustrated in
II. ASR E RROR C ORRECTION M ODELS Figure 1, where ASR N-best lists are given as input to the
encoder and the model is trained to generate the manual ref-
A. ASR Error Correction using N-best lists erence. Here, sentences are concatenated with a special token
Building upon previous findings, we explore new method- [SEP]. The model is trained to automatically detect errors
ologies for effectively incorporating supplementary ASR out- and generate the corrected hypothesis on specific training data.
puts into LLM-based EC models. One promising approach For supervised models, generalization capability is crucial,
involves utilizing N-best ASR hypotheses, which are generated as it enhances their applicability across various practical
by the ASR system as a byproduct of the beam search scenarios. A model that generalizes well can be used in diverse
process [33], [34]. The integration of N-best T5, using the and practical contexts without the need for further updates. In
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 3

Fig. 2. Prompt design for zero-shot ASR error correction. Here we use a 3-best list generated by the ASR system as input to ChatGPT for illustration.

our experiments, we applied a model trained on transcriptions evaluate the impact on performance. Another intriguing aspect
of corpus A generated by a specific ASR system to out-of- is guiding the decoding process to achieve controllable gener-
domain test sets and outputs from other ASR systems. By ation. In this section, we delve into this challenge, exploring
doing this, we demonstrate the generalization ability of our methods to direct the model’s output in a more controlled and
proposed method, highlighting its robustness and adaptability predictable manner.
across varying ASR outputs and domains.
A. Unconstrained Decoding
C. Zero-shot Approach
For an ASR error correction model with parameters θEC ,
Supervised training has long been popular for developing the N-best input is denoted as Z = {ẑ (1) , ẑ (2) , · · · , ẑ (n) }.
EC systems, however, it requires the availability of training Our decoding objective is to find ŷuncon that satisfies
data and the systems can be computationally expensive to
build. To address these constraints, we present our approach ŷuncon = arg maxy log P (y|Z; θEC ) (1)
to utilizing generative LLMs for zero-shot error correction,
using ChatGPT as an illustrative example. This task can be where y presents potential output sequences. Given the com-
challenging as ChatGPT lacks prior knowledge about the putational cost of finding the globally optimal sequence,
error patterns of the ASR system and has no access to the heuristic algorithms such as beam search are commonly used
original utterance. To mitigate the difficulty, similar to the for decoding. In this context, the decoding method is referred
supervised approach, we provide hypotheses from the N-best to as unconstrained decoding (uncon), as no explicit con-
list as valuable hints, helping the model to detect and correct straints are applied to the generated sequences.
errors effectively. The prompts we used in the experiments Beam search is a practical tool for approximating optimal
are shown in Figure 2. In the prompt design, hypotheses are generation results across the entire decoding space, balancing
sorted by the descending ASR posterior score. Additionally, efficiency and performance. However, this method grants
tags such as <hypothesis1> and </hypothesis1> the model too much freedom, limiting our control over the
enclose each N-best hypothesis. We experimented with other decoding process [35]. Specifically, we aim for the proposed
input formats, such as using numbers instead of tags or em- model to retain correct words from the original transcription
ploying plain sentences without explicitly specified order, but and only correct inaccurate ones. In addition, the model
these variants showed degraded performance compared to our is expected to generate homophones for detected erroneous
chosen prompt. In the ablation study detailed in Section VI-A, words. While these aspects can be implicitly learned from
we highlight the importance of using a reasonable number of the training data, there is no guarantee they will be applied
N for the model to achieve optimal performance. When only during decoding. The model might produce synonyms with
the top one ASR hypothesis is used as input, ChatGPT-based high embedding similarity to words in the reference text,
error correction may experience a degradation in performance. which can be problematic. To address these concerns, we
Additionally, initial experiments explored the few-shot setting, introduce several alternative decoding algorithms to enhance
revealing unstable performance improvements compared to the the model’s ability to achieve the desired decoding objectives.
zero-shot approach and higher computational costs. Therefore These methods aim to exert more control over the output,
we mainly focus on the zero-shot results in this paper. ensuring that corrections are precise and that the generated
sequences meet specific criteria for accuracy and relevance.
III. ASR E RROR C ORRECTION D ECODING
Previous sections introduced the proposed fine-tuning and B. N-best Constrained Decoding
zero-shot error correction methods. Specifically, we high- In unconstrained decoding, the decoding space of the EC
lighted how to incorporate more information into the input model is unbounded. However, we want the generated cor-
space of the proposed model with ASR N-best lists and rection results to closely resemble the original utterance. One
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 4

method to introduce constraints on the decoding space involves


leveraging the ASR N-best list, which comprises the top
N hypotheses generated by the ASR system, representing
the transcriptions most likely to be correct given the input
audio. This approach, denoted as N-best constrained decoding (a)
(constr), forces the model to only generate sentences within
the ASR N-best list.
Furthermore, each path in the N-best list is associated with a
score calculated by the ASR system, indicating the likelihood
of it being the correct output. These scores can be combined
with the scores from the EC model, using an interpolation (b)
weight λ to gain insights from both models. To be more
specific, the decoding result ŷconstr is derived by maximizing
the equation:
ŷconstr = arg maxy∈Z [(1 − λ) · log P (y|x; θASR )
(2) (c)
+ λ · log P (y|Z; θEC )]
Fig. 3. (a) Example BPE-level lattice generated in ASR decoding. (b)
where x and Z denote the input acoustic features of the ASR Converted word lattice. (c) Converted lattice with LLM BPE tokens.
system and the obtained ASR N-best list, respectively. When
λ is set to 1, the scores from the ASR system are ignored, result for the given utterance. Notably, this method is different
and only the probabilities from the EC model are considered. from the N-best constrained method, which explicitly tasks
This approach requires obtaining the probability scores from the LLMs with selecting from the N-best list. The N-best
the EC model, which can be implemented in the supervised closest approach does not constrain the output space initially
EC method. Section V-B will demonstrate its effectiveness and but rather finds the closest match within the N-best list after
highlight situations where its utility becomes evident. the unconstrained generation.
In zero-shot EC scenarios, the method is applied differently.
Instead of generating a correction from scratch, ChatGPT is
D. Lattice Constrained Decoding
tasked with selecting the most likely correct ASR transcription
from a list of candidates. As illustrated in Figure 2, all the N-
Algorithm 1 Lattice Constrained Decoding for N-best T5
best sentences are listed as input, such as <option1> ASR
Data: lattice node set V, lattice edge set E, beam width b, T5 encoder outputs
hypothesis </option1>, and ChatGPT is instructed to {hj }
return the selected option in the format of <option?> The 1: Q ← topological sort(V)
selected ASR transcription </option?>. While 2: for v in V do
this method is similar to language model rescoring to some 3: Hv ← min heap()
4: end for
extent, it differs in that the selection occurs in a single step. 5: n0 .history = ϵ, n0 .score = 0
More importantly, ChatGPT sees all the candidates before 6: Hstart .put(n0 )
determining the best one. This contrasts with the rescoring 7: for v in Q do
8: o = Decoder({hj }, n.history, v.word)
process, where language model scores are generated individ- 9: for ⟨v, x⟩ in E do
ually for each of the N-best hypotheses without considering 10: for n in Hv do
their similarity and correlation. 11: n′ .history = concat(n.history, v.word)
12: n′ .score = n.score + λ · log(o[x.word]) + (1 − λ) · log svx
13: if Hx .size ≥ b ∧ Hx .score.min() < n′ .score then
14: Hx .remove min()
C. N-best Closest Decoding 15: end if
The closest mapping method (closest) is based on the 16: if Hx .size < b then
17: Hx .put(n′ )
assumption that during unconstrained error correction, LLMs 18: end if
first choose the best hypothesis from the given N-best list and 19: end for
then modify this sentence to yield the final output. In experi- 20: end for
21: end for
ments, we aim to identify this “closest match” by performing 22: I = max heap(Hend .items)
a reverse process, in which we locate the hypothesis within the 23: return I.max()
ASR N-best list that has the smallest Levenshtein distance to
the unconstrained generation result, as shown in Equation 3: While an ASR N-best list is useful for capturing likely
ŷuncon = arg maxy log P (y|Z; θEC ) correct candidates, it represents only a limited set of possible
(3) decoding outputs. Instead of strictly constraining to the N-best
ŷclosest = arg minz∈Z LevenshteinDist(ŷuncon , z)
list, we can explore a more flexible approach by expanding
To illustrate, consider the zero-shot uncon example in Figure 2, the decoding space to include the lattice generated through
where the Levenshtein distance of the ChatGPT output to the path merging. As depicted in Equation 4, we focus on the
3-best ASR hypotheses is 1, 0, 1, respectively. In this scenario, paths G within the lattice and integrate ASR scores during de-
the second hypothesis would be selected as the corrected coding. This lattice-constrained decoding approach enhances
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 5

Fig. 5. Demonstration of the data contamination quiz. Here, answer A) is


selected from the original test set, and B) is the paraphrased one. We also
switch the order of options A) and B) to mitigate the possible positional bias.

TABLE I
Fig. 4. Prompt for generating options for the data contamination quiz. S TATISTICS OF ASR TEST SETS USED IN THE EXPERIMENTS .

Dataset Subset # Utts # Words Hours


flexibility, allowing for a broader exploration of the potential train 281,231 9.4M 960.9
corrections beyond the N-best list, and hence has the potential LibriSpeech test clean 2,620 53K 5.4
test other 2,939 52K 5.1
to improve overall decoding accuracy.
TED-LIUM3 test 1,155 28K 2.6
ŷlattice = arg maxy∈G [(1 − λ) · log P (y|x; θASR ) Artie Bias test 1,712 15K 2.4
(4)
+ λ · log P (y|Z; θEC )]
Since this method requires ASR decoding probabilities at generate paraphrased candidates. As presented in Figure 4,
each time step to generate the lattice, it needs access to the instructions are given to GPT-4 to generate options by altering
ASR model, hence it is only applicable in certain scenarios. the words without affecting the sentence’s meaning. We then
We tested this approach in the supervised EC model, N- randomly select one paraphrased sentence and feed it to LLM
best T5 model to be specific. Notably, the ASR model and along with the original test sample to evaluate the degree of
the pre-trained language model employ different tokenizers. data contamination.
Consequently, we need to convert the original lattice into an Figure 5 depicts the basic format of the designed 3-choice
equivalent form suitable for the N-best T5 to process. To data contamination quiz. In this example, both sentences
achieve this, the lattice with ASR BPE tokens is first converted convey similar meanings while answer A) is copied from the
into a word lattice with dynamic programming, as shown in test set and answer B) is the paraphrased one generated by a
Figure 3b. Then the words along the edges are segmented into LLM. In addition, we provide option C), which denotes the
BPE tokens using the T5 tokenizer, as demonstrated in Figure non-appearance of both sentences. If the model generates A)
3c. The decoding algorithm is adapted from [36], and the in this scenario, it will suggest potential data contamination in
details presented in Algorithm 1. This method requires access the model pre-training. We estimate the level of contamination
to the beam search candidates in the decoding process, we only with the percentage of the test samples where the model
applied it to T5 rather than the closed-form LLM ChatGPT selected the original sentence. This value is expected to be
in the experiments. Our preliminary experiments indicate that close to zero for a model that has never been pre-trained on the
performance remains consistent as beam sizes increase while test set. Previous works have observed implicit positional bias
the associated costs rise. Therefore in our experiments, we run in the LLM generation process [40]. To target this problem, we
the lattice-constrained decoding with the beam size of 1. run the method twice, changing the order of the given options,
and compute the overall classification performance.
IV. DATA C ONTAMINATION
In the context of commercial use, most generative LLMs are V. E XPERIMENTS
developed with closed access to their training data. This lack
of transparency presents challenges in determining if specific A. Experimental Setup
test data has been encountered during the model’s pre-training Three standard datasets are used for training and evalua-
phase. Utilizing test data that has already been exposed to tion, namely LibriSpeech [41], TED-LIUM3 [42], and Artie
the LLM during pre-training can skew evaluations, resulting bias corpus [43]. LibriSpeech is an audiobook-based English
in inaccurate assessments of model performance – an issue speech corpus, which covers a wide range of speakers with
commonly referred to as data contamination [37], [38]. different accents. TED-LIUM3 is an audio dataset collected
To address this concern, we adapted the method from [39] from TED talks, encompassing various topics such as science,
to measure the degree of the data contamination problem on education, and entertainment. The Artie bias corpus is a subset
the datasets we utilized in this paper. For each test utterance, of the Common Voice dataset [44] which is also read speech.
we use GPT-4 to rewrite the original manual reference and Detailed statistics of these datasets are listed in Table I.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 6

The experiments were conducted on two ASR models: a TABLE II


Conformer-Transducer model [45] and the OpenAI Whisper R ESULTS (% WER) FOR A C ONFORMER -T RANSDUCER SYSTEM AND
W HISPER USING A T5 ERROR CORRECTION MODEL , COMPARING
ASR [6]. The Conformer-Transducer’s encoder features 12 DIFFERENT ( UN ) CONSTRAINED DECODING ALGORITHMS .
Conformer layers with a hidden size of 512 and its predictor
has one LSTM layer. Both the jointer and predictor have Transducer Whisper
System
hidden dimensions of 512. This model is trained on the clean other clean other
960hr LibriSpeech dataset following the ESPnet recipe [46]. Baseline 2.79 6.90 3.52 7.37
SpecAugment [47] and speed perturbation are used for training 5-best Oracle 1.42 4.59 2.38 5.24
10-best Oracle 1.31 4.25 2.14 4.73
data augmentation. We use a beam size of 10 in the decoding
and save the generated top hypotheses. For Whisper, we uncon 2.54 6.37 2.90 6.39
constr 2.42 6.15 3.10 6.69
adopt the small.en model due to its comparative performance 10-best T5
closest 2.50 6.24 3.11 6.52
to larger models and faster processing speed. The original lattice 2.41 6.10 - -
decoding result only returns the 1-best hypothesis and the
sentence-level confidence score for each utterance. We mod-
ified the code to also save the 10-best lists during inference
and to extract token-level softmax probabilities for calculating model trained for the Transducer model with its outputs shows
word-level ASR confidence scores. This is to simulate the improved performance when additional constraints are applied
scenario where the ASR service provides extra information for during decoding. Specifically, in the constrained decoding
downstream tasks. In the evaluation, we run text normalization process, optimal interpolation weight λ is searched within
scripts on both reference and hypothesis before calculating the range [0.0, 1.0] with a grid size of 0.05. With N-best
WER results following [6]. constrained decoding and closest mapping, the model effec-
For our fine-tuning ASR error correction method, we ex- tively generates homophones for the mistaken words. Lattice-
perimented with a T5 base model, an encoder-decoder model constrained decoding, which provides more potential paths
pre-trained on various text-to-text tasks. We aimed to build than an N-best list and thus has a lower oracle WER, results
each EC model using the same ASR training corpus with in slightly better performance on the test sets (13.2% and
input transcriptions generated by the corresponding trained 11.6% WERR on test clean and test other sets, respectively).
ASR system. The Transducer model fit the ASR training set Unlike the N-best T5 for the Transducer ASR, the EC model
so well that it achieved an extremely low WER, making the tailored for Whisper outputs adeptly detects and corrects errors
development of an ASR error correction model impractical. for LibriSpeech corpus utterances in unconstrained decod-
To address this, we employed data augmentation methods to ing, yielding WERR of 17.6% on test clean and 13.3% on
generate erroneous transcriptions for training the correction test other. This indicates that the model has effectively learned
model. Specifically, we applied SpecAugment to each utter- to correct errors in the Whisper ASR from its training on
ance in the ASR decoding process. We utilized two frequency 960hr LibriSpeech speech. Constrained decoding yields less
masks with F = 30, eight time masks with T = 40, and time improvement. This limits the model to the N-best list which
warping with W = 40 on the training speech data. In the might have restricted performance compared to unconstrained
decoding results, sentences with WERs higher than 0.25 were decoding. Given the strong results from unconstrained decod-
filtered out, resulting in a training text corpus comprising 262K ing with Whisper and the complexity involved in generating
sentence pairs. The T5 model is fine-tuned for 3 epochs on this lattices, lattice-constrained decoding is not considered here.
corpus using the AdamW [48] optimizer. The initial learning Results in Table II have shown a significant improvement in
rate is set to 5e-5 and the training batch size is 32. A dropout recognition accuracy using the fine-tuned N-best T5 method
rate of 0.1 is applied to the network to prevent overfitting. For on the target test sets. However, it will be more applicable if it
the zero-shot experiments, we used two versions of ChatGPT can be applied to out-of-domain datasets or on outputs from a
models: gpt-3.5-turbo-0613 and gpt-4-0125-preview, which are different ASR system. We therefore evaluate the generalization
abbreviated into GPT-3.5 and GPT-4 in the paper. ability of the fine-tuning EC method from both perspectives.
Generalization on out-of-domain datasets: To examine
B. Experiments on Fine-tuning Approach the generalization ability of the proposed method on out-of-
Table II presents the results of two supervised error cor- domain datasets, we directly applied the EC models trained
rection models trained for two ASR models with their out- on LibriSpeech transcriptions to the ASR outputs from other
puts, evaluated on the LibriSpeech test clean and test other test sets (TED-LIUM and Artie bias corpus) without fine-
datasets. For the Transducer model, the oracle WER improves tuning. The results are presented in Table III, and we studied
by 33.5% with the 5-best list and 38.4% with the 10-best list the performance of two supervised EC models trained with
on the test other set compared to the baseline. Similarly, for Transducer ASR and Whisper ASR outputs, respectively. In
the Whisper model, the 5-best and 10-best lists achieve oracle the baseline results, Whisper achieved much lower WERs for
WER improvements of 29.0% and 35.8% on the same set. these datasets than the Transducer model, which was only
These results suggest that the N-best lists have potential in trained with LibriSpeech data. The N-best T5 model, trained
helping the models recover the correct transcription. for Transducer outputs, improves ASR performance on out-of-
We compare the WER results of the 10-best T5 model using domain datasets like TED and Artie without any fine-tuning,
various decoding algorithms, as detailed in Section III. The EC in both unconstrained and constrained decoding settings. Un-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 7

TABLE III TABLE IV


E RROR CORRECTION RESULTS (% WER) FOR C ONFORMER -T RANSDUCER R ESULTS (% WER) FOR A C ONFORMER -T RANSDUCER SYSTEM AND
AND W HISPER SYSTEMS ON OUT- OF - DOMAIN ASR TEST SETS AND ON W HISPER WITH Z ERO - SHOT ERROR CORRECTION USING GPT S .
OUTPUTS FROM A DIFFERENT ASR.
Transducer Whisper
System
EC training source Transducer Whisper Whisper LB TED Artie LB TED Artie
EC applied to source Transducer Whisper Transducer
Test sets TED Artie TED Artie clean other Baseline 6.90 13.53 23.67 7.37 3.89 9.03
5-best Oracle 4.59 10.71 17.95 5.24 2.59 5.59
Baseline 13.53 23.67 3.89 9.03 2.79 6.90 10-best T5 6.10 - - 6.39 - -
10-best Oracle 10.21 16.69 2.59 5.60 1.31 4.25
uncon 6.64 11.35 18.73 7.71 5.84 8.30
uncon 12.00 21.24 4.56 9.16 3.86 7.72 GPT-3.5 constr 6.52 12.61 21.88 7.24 4.19 8.47
10-best T5
constr 12.12 21.36 3.64 8.14 2.62 6.65 closest 6.29 11.97 20.64 7.15 4.56 8.21
uncon 5.79 9.09 17.35 6.67 4.60 7.53
GPT-4 constr 6.55 11.91 20.97 7.17 4.58 8.46
closest 5.98 11.67 20.40 6.76 4.25 7.86
constrained decoding yields slightly better performance than
N-best constrained decoding, with relative word-error-rates
(WERRs) of 11.3% and 10.3% on TED and Artie, respectively. When comparing the zero-shot method with the fine-tuned
Although Whisper demonstrates very low baseline WERs, N-best T5 model, we notice that GPT-3.5 matches the per-
making further improvements challenging in a zero-shot trans- formance of the 10-best T5 for the Transducer ASR, while
fer setting, the N-best T5 model still manages to reduce WER GPT-4 exceeds it across all three test sets. However, this
by 6.4% and 9.9% for TED and Artie, respectively, using N- success does not extend to outputs from the Whisper ASR,
best constrained decoding. where LLMs in the zero-shot setting struggle more with error
Generalization on other ASR systems: The practical detection and correction. Using the closest match decoding
utility of the EC model increases significantly if it can ef- approach, GPT-3.5 outperforms the baseline for the LB test
fectively correct outputs from ASR systems different from the set and Artie but does not reach the 10-best T5’s performance.
one used for its training. In Table III, we investigated this Notably, even GPT-4 falls short of surpassing the baseline on
aspect of the model’s generalization ability. We applied the the TED test set. The ineffectiveness of zero-shot methods
EC model trained with LibriSpeech transcriptions generated on TED-LIUM3 decoded by Whisper will be discussed in
with the Whisper model directly to the ASR outputs from the Section VI-C). Specifically, GPT-4 achieves an average WERR
Transducer model. Under unconstrained decoding conditions, of 25.2% on the three test sets for Transducer outputs, while
the model struggled to achieve performance gains, highlighting only an average of 2.6% WERR is achieved on the three sets
the challenge of domain mismatch. However, employing con- for Whisper outputs. This is likely due to the nature of the
strained decoding successfully improved performance across N-best lists, Section VI-C will discuss this in detail.
both LibriSpeech test clean and test other datasets, resulting
in a reduction of WER by 6.1% and 3.6%, respectively. This VI. D ISCUSSION
underscores the robustness of the proposed EC method in A. Ablation on N-best Inputs
addressing different ASR system outputs.
This section presents the ablation experiments with diverse
N-best inputs for fine-tuning and zero-shot EC methods.
C. Experiments on Zero-shot Approach Fine-tuned EC approach: Table V presents results with
This section will introduce zero-shot experiments and results baseline and N-best T5 models for the two ASR models on
using LLMs. Table IV presents the performance of ChatGPT the LibriSpeech test clean and test other datasets. Training
(GPT-3.5 and GPT-4) for ASR error correction using either an EC model using the 1-best hypotheses from the Transducer
a Transducer-based ASR model or a Whisper model on the ASR, as described in Section II, did not improve performance,
test other data set. We compare zero-shot error correction highlighting the challenge of surpassing a strong baseline with
results under different decoding constraints. GPT-3.5 with limited input information. However, using 5-best and 10-best
unconstrained decoding shows improvement, however, further lists as model input, the T5 EC model can achieve relative
analysis reveals an increase in deletion errors on the test sets. performance gains of 6.0% and 7.7% on the test other set,
This is due to the fact that some sentences are truncated in the respectively. For the Whisper small.en model, the supervised
ChatGPT output, displaying only the initial few words instead error correction method showed performance gains even with
of complete sentences. Constrained generation methods limit the 1-best hypothesis, with improvements of 10.2% on the
the output to the N-best list, effectively reducing deletions. test clean set and 4.6% on the test other sets. Using the 5-
The closest method, which identifies the closest hypothesis in best and 10-best lists yielded even better results, with the
the N-best list with the generated corrected one, outperforms 10-best lists achieving a reduction of 17.6% and 13.3% on
methods that require ChatGPT to directly select the best option the test clean and test other sets, respectively. These findings
from the N-best list. Compared to GPT-3.5, GPT-4 shows indicate that while a larger N provides more diverse input and
improved performance on all test sets. The results indicate can improve error detection and correction, increasing N does
that for a powerful LLM like GPT-4, unconstrained decoding, not always guarantee proportional benefits.
giving the model more freedom, yields better results than Zero-shot EC approach: Table VI demonstrates the impact
constrained decoding approaches. of varying N for the zero-shot error correction method, with
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 8

TABLE V TABLE VII


R ESULTS FOR ASR BASELINE AND N- BEST T5 MODELS WITH DIFFERENT A BLATION OF ZERO - SHOT ERROR CORRECTION RESULTS ON DIFFERENT
MODEL INPUTS ON L IBRI S PEECH TEST CLEAN AND TEST OTHER SETS . SIZES OF W HISPER .

Transducer Whisper Oracle Baseline +GPT-4


Model System WERR
clean other clean other WER All Sub Del Ins All Sub Del Ins
Baseline 2.79 6.90 3.52 7.37 base.en 6.75 9.50 7.1 1.0 1.4 7.91 5.6 1.0 1.3 17.6%
small.en 5.24 7.37 4.9 1.7 0.8 6.67 4.3 1.6 0.8 9.5%
1-best T5 2.94 7.00 3.16 7.03 medium.en 3.82 5.60 4.1 0.8 0.7 5.18 3.6 0.9 0.7 7.5%
5-best T5 2.63 6.49 2.86 6.59 large-v2 3.57 4.93 3.5 0.8 0.7 4.86 3.3 0.8 0.7 1.4%
10-best T5 2.54 6.37 2.90 6.39

TABLE VI TABLE VIII


A BLATION OF N- BEST LIST SIZES UTILIZING GPT-3.5 FOR ERROR A BLATION ANALYSIS WITH DISTURBED 10- BEST LIST FOR N- BEST T5
CORRECTION ON C ONFORMER -T RANSDUCER OUTPUTS ACROSS THREE MODELS .
TEST SETS WITH UNCONSTRAINED DECODING (%WER).
LibriSpeech Other sets
10-best
Method Input LB TED Artie clean other TED Artie MGB

Baseline - 6.90 13.53 23.67 Sorted 2.90 6.39 3.64 8.14 12.71
Randomized 3.31 6.82 3.74 8.50 13.01
1-best 8.25 11.95 21.19 Reversed 3.50 7.18 3.75 8.57 12.99
3-best 7.01 11.31 18.84
GPT-3.5
5-best 6.64 11.35 18.73
10-best 6.69 11.29 18.72
the correction model infer and utilize this ranking knowledge
from the input to enhance its performance? To investigate this,
the details of the method introduced in Section II-C. Specifi- we conducted experiments with N-best lists that were either
cally, we tested the GPT-3.5 model on the Transducer outputs randomly shuffled or sorted in reverse order of ASR scores.
across three datasets: LibriSpeech test other, TED-LIUM, and As shown in Table VIII, applying unconstrained decoding to
Artie, using the unconstrained decoding setup where the model the LibriSpeech test sets and N-best constrained decoding to
generates sequences based on the input context. The model other datasets, we found that randomizing the N-best list led
faces challenges in enhancing performance when using only to performance degradation, while reversing the order of input
the top one ASR hypothesis as input. Notably, while perfor- hypotheses resulted in the worst performance. This indicates
mance improved on TED and Artie datasets, it declined on that the ranking information is implicitly learned and crucial
LibriSpeech. Increasing the number of input contexts generally for the N-best T5 model to perform well. However, when
helps the model detect and correct errors more effectively. applying similar randomization or reversal strategies to the
However, our findings indicate that increasing N does not GPT-3.5 and GPT-4 models in zero-shot experiments, there
consistently improve results; for example, performance with was no significant difference in performance. This suggests
the 10-best context was comparable to that with the 5-best. that, unlike the N-best T5 model, the GPTs might not rely on
This finding is consistent with the observation on the fine- or benefit from the ranking information in the same way.
tuning EC method. Experiments in Section V-C reveal that our proposed meth-
ods are less effective on Whisper outputs in some cases. To
B. Ablation on ASR Model Sizes examine this, we calculate Uniq and Cross WER metrics in
Table IX. When calculating statistics, we remove punctuation
Previous experiments on Whisper are based on the small.en
and special symbols from the ASR hypotheses, leaving only
model, which yields good performance with relatively low
English characters and numbers to focus on the meaningful
decoding latency. In Table VII we test Whisper models with
content. The Uniq metric represents the average number of
different sizes as the underlying ASR model and apply GPT-4
unique hypotheses within an N-best list in the test set. For
as the correction approach. The results indicate that increasing
Transducer outputs, this number is close to 5, matching the
the ASR model size generally improves baseline ASR perfor-
size of the given N-best list. However, Whisper outputs show
mance, but also makes the correction task more challenging.
more repeated entries. This occurs because Whisper learns
Specifically, error correction yields higher WERR with smaller
to generate sentences with inverse text normalisation (ITN)
Whisper models. Except for Whisper large-v2, our method
to enhance readability, i.e. adding capitalisation, including
achieves a WERR ranging from 7.5% to 17.6%, demonstrating
punctuation, and removing disfluencies. As a result, multiple
the effectiveness of zero-shot error correction.
hypotheses in an N-best list often differ only in format rather
than content. This limits the diversity of the N-best list which
C. N-best Analysis is crucial for our proposed methods to work well.
In Section VI-A, we observed that using an N-best list Another notable observation is that for Whisper, even when
instead of the 1-best transcription significantly improves error the N-best list contains diverse hypotheses, the differences
correction performance. For the N-best T5 model, hypotheses often come from the omission or insertion of irrelevant words.
are concatenated sequentially without explicitly encoding the This is demonstrated by the Cross WER metric in Table IX. In
ranking information. This raises an important question: Can this evaluation, we retain all unique hypotheses in an N-best
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 9

TABLE IX outputs acts as a form of model ensembling, thus is expected


S TATISTICS OF THE ASR 5- BEST LISTS GENERATED BY THE to improve the system performance.
C ONFORMER -T RANSDUCER AND THE W HISPER MODEL ON L IBRI S PEECH
(LB), TED-LIUM3 (TED) AND A RTIE B IAS TEST SETS . We also draw on the Recognizer Output Voting Error
Reduction (ROVER) technique, which employs a majority
% Cross WER voting approach to combine the recognition results of sev-
Data Model Uniq
All Sub Del Ins eral ASR systems into a single recognition hypothesis [49].
LB
Transducer 4.9 9.1 7.1 1.0 1.0 ROVER converts multiple ASR outputs into Word Transition
Whisper 3.0 12.9 7.5 2.7 2.7 Networks (WTNs), aligns and combines these WTNs using
TED
Transducer 5.0 7.4 5.4 1.0 1.0 edit distance, and then uses weighted voting to determine
Whisper 2.6 9.9 3.9 3.0 3.0 the final hypothesis. Since ROVER is a simple, training-free
Transducer 4.8 19.9 15.3 2.3 2.3 technique for integrating information from different sentences,
Artie
Whisper 2.9 21.1 14.5 3.3 3.3
we use it as a baseline in the experiments.
In Table XI, we list the ASR error correction results using
the N-best list generated by a single system, denoted as
list and then calculate the WER between each pair of hypothe- T1 T2 T3 T4 T5 (shortened as T ) for Transducer outputs and
ses, summing the results for the entire set. This metric helps E1 E2 E3 E4 E5 (shortened as E) for Whisper small.en outputs.
measure the difference between hypotheses within the same Ti and Ei refer to the i-best hypothesis generated by the
N-best list. The results indicate that Whisper has significantly Transducer and the Whisper model respectively. For the output
higher deletion and insertion rates on Cross WER compared combination experiments, we take the 5-best lists generated
to the Transducer model, particularly on TED-LIUM3. This by both the Transducer and the Whisper model for each
suggests that Whisper may struggle to consistently transcribe test utterance. There are multiple ways to combine the two
utterances accurately across all N-best hypotheses, resulting N-best lists to form a new 5-best list. In our preliminary
in sentences of varying lengths. ChatGPT tends to select the experiments, we altered different ways of combining the N-
more coherent hypotheses in the zero-shot setting, leading to best hypotheses and different sentence orders to find the best
a higher rate of deletion errors in the output. combination. We use ROVER to determine the performance
In Table X we demonstrate the outputs from different mod- of these different inputs, achieving the best performance with
els on a specific example. Here, the unconstrained decoding an input of E1 E2 T1 T2 T3 , denoted as Comb1 in Table XI,
is employed when 5-best lists are used as the model input yielding a WER of 5.95%. Experiments on GPT models
for error correction. Compared to the desired reference, we show that using outputs from diverse systems rather than
highlight the errors in the generated ASR hypotheses and the a single system leads to performance boosts. LLMs could
error correction outputs from LLMs. The word ligatures, a utilize information from both model outputs to generate a more
medical term related to surgery, is wrongly transcribed in the robust answer. The best WER performance is 4.72%, which
Top-1 ASR hypothesis. However, it appears in the correct is 32% and 36% lower than the Transducer ASR and Whisper
format in the second-best hypothesis. With the T5 model, ASR baselines, respectively.
errors are not recovered in the output. Meanwhile, as GPT Additionally, we combine the ASR N-best lists generated
models are pre-trained on more data, they present a better by two different versions of Whisper models – small and
understanding of general world knowledge, leading to fewer small.en, denoted as S for N-best list S1 S2 S3 S4 S5 from
errors utilizing the given contextual information. Whisper small model and E for N-best list E1 E2 E3 E4 E5
from Whisper small.en model, respectively. Both models are
the same size but are trained on different training data.
D. Multi-Model N-best Lists
The small.en model was pre-trained on English-only weakly
In previous experiments, we demonstrated that LLMs can supervised data, while the small one was pre-trained on a
make use of N-best lists generated by a single ASR model larger, multilingual dataset. Experiments on ROVER show that
to perform error correction and enhance ASR performance. In the combination of S1 S2 E1 E2 E3 (Comb2) works the best,
this section, we extend this approach by combining the N-best leading to a WER of 6.82% on the test set. The zero-shot
decoding hypotheses from different ASR systems with LLMs. error correction results indicate that LLMs serve as an effective
We experiment with two scenarios: (a) combining outputs from method for model ensembling. With uncon decoding using
ASR models with different architectures and (b) combining GPT-4, WERRs of 23% and 21% over ASR baselines can be
outputs from ASR models trained on different datasets. ASR seen on the test set. This showcases that LLMs can effectively
systems with different architectures exhibit unique strengths improve ASR accuracy.
and weaknesses. For instance, a LAS model is good at utilizing
global context information from the given input but tends
to be less robust. An RNN-T model alleviates the problem E. Potential Data Contamination
of repeating and skipping word chunks compared to a LAS Previous results highlight the effectiveness of zero-shot error
model although it shows worse performance in general. By correction using LLMs, attributed to their robust language
combining outputs, a more robust ASR system that takes understanding capabilities. However, a key concern with this
advantage of different components can be built. Additionally, approach is the possibility that text from ASR test sets may
for ASR models trained on diverse datasets, combining the have been included in the LLM pre-training data, potentially
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 10

TABLE X
C ASE ANALYSIS FOR UNCONDITIONAL ERROR CORRECTION RESULTS ON C ONFORMER -T RANSDUCER OUTPUTS .

Ref the gut and the gullet being cut across between these ligatures the stomach may be removed entire without spilling its contents
Hyp-1 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning its contents
Hyp-2 the gut and the gullet being cut across between these ligatures the stomach may be removed entire without spinning its contents
Hyp-3 the gut and the gullet being cut across between these ligages the stomach may be removed entire without spinning its contents
Hyp-4 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning as contents
Hyp-5 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning his contents
5-best T5 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning its contents
GPT-3.5 The gut and the gullet being cut across between these ligatures the stomach may be removed entire without spinning its contents.
GPT-4 The gut and the gullet being cut across between these ligatures the stomach may be removed entire without spilling its contents.

TABLE XI TABLE XII


S YSTEM COMBINATION RESULTS ON TEST OTHER USING 5- BEST LISTS R ESULTS OF THE DATA CONTAMINATION QUIZ . A LOWER PERCENTAGE
FROM T RANSDUCER (T ), W HISPER SMALL . EN (E) AND SMALL (S) IMPLIES LESS CONTAMINATION IN THE LLM PRE - TRAINING .
MODELS .
Datasets GPT-3.5 GPT-4
N-best Input T E S Comb1 Comb2
LibriSpeech (test other) 0.07 0.33
Baseline 6.90 7.37 7.20 5.95 6.82 TED-LIUM 3 (test) 0.05 0.22
5-best Oracle 4.59 5.24 5.21 3.35 4.40 Artie Bias (test) 0.14 0.15
uncon 6.64 7.71 7.46 6.78 7.08 MGB-3 (test) 0.04 0.02
GPT-3.5
closest 6.29 7.15 7.06 6.01 6.46 Linguaskill (ling test general) 0.03 0.09
uncon 5.79 6.67 6.49 4.72 5.70
GPT-4
closest 5.98 6.76 6.51 5.00 5.80
VII. C ONCLUSION
In this work, we proposed and thoroughly investigated two
leading to biased evaluations in error correction tasks. In this advanced error correction methods to enhance ASR accuracy:
section, we explore the potential issue of data leakage from supervised EC with pre-trained language models and zero-shot
ASR test sets during LLM pre-training. EC with LLMs. Various decoding strategies were explored
Following the practice in [39], we randomly select 100 for both supervised and zero-shot EC methods, including
utterances from each test set for evaluation. In addition to the unconstrained decoding, N-best constrained decoding, and
three public test sets, we also applied our proposed method to closest mapping decoding, each offering unique advantages in
two internal ASR datasets: MGB-3 [50] and Linguaskill [51] different scenarios. Our experiments demonstrated the robust-
that are less likely to be contaminated. These internal datasets ness and generalization capabilities of the proposed methods
are less susceptible to contamination due to their unique across multiple dimensions. First, we tested our models trained
characteristics and restricted access. MGB-3 is a dataset on outputs from a specific ASR on outputs from different ASR
specifically designed for the multi-genre broadcast challenge, systems, showcasing their adaptability. Second, we evaluated
containing broadcast media recordings that are carefully cu- the models on datasets from diverse domains and applied an
rated and controlled. Linguaskill, on the other hand, consists of EC model trained on one dataset to other datasets, proving the
educational and skill-based assessments that are not publicly versatility of the method. We also extended the approach to
available, ensuring a low risk of data contamination. Results incorporate N-best lists from multiple ASR systems, demon-
on these datasets are intended to be contrasted with the results strating the model can serve as effective model ensembling.
on the public datasets. Another crucial aspect of our study was addressing the poten-
tial data contamination issue, particularly in the use of LLMs
For the designed data contamination quiz, the LLM is asked
for ASR error correction. Our systematic evaluation, using
to identify which sentence is from the ASR test set, choosing
a combination of public and proprietary datasets to ensure
between an actual ASR test reference and a rewritten sentence.
comprehensive coverage, revealed minimal contamination in
A lower percentage of correct selections by the LLM indicates
GPT-3.5 but identified some level of contamination in GPT-
a less severe data contamination. The results in Table XII sug-
4, especially on certain datasets. These findings underline the
gest that data contamination is not a significant issue for GPT-
importance of vigilance in data handling and provide valuable
3.5. Although GPT-4 shows some level of data contamination
guidelines for future research.
on LibriSpeech and TED-LIUM3, this does not invalidate
our method. The potential slight contamination observed with
ACKNOWLEDGMENTS
GPT-4 highlights an area for future improvement and caution
but does not undermine the robustness and effectiveness of our This paper reports on research supported by EPSRC Project
proposed approach. The method shows consistent performance EP/V006223/1 (Multimodal Video Search by Examples) and
improvements across different datasets and ASR systems, Cambridge University Press & Assessment, a department of
indicating its generalizability and robustness despite the slight The Chancellor, Masters, and Scholars of the University of
contamination in some instances. Cambridge.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 11

R EFERENCES Methods in Natural Language Processing. Association for Computa-


tional Linguistics, 2022, pp. 10 367–10 380.
[1] C. M. Rebman Jr, M. W. Aiken, and C. G. Cegielski, “Speech recog- [21] R. Ma, M. J. F. Gales, K. M. Knill, and M. Qian, “N-best T5: Robust
nition in the human–computer interface,” Information & Management, ASR Error Correction using Multiple Input Hypotheses and Constrained
vol. 40, no. 6, pp. 509–519, 2003. Decoding Space,” in Proc. INTERSPEECH, 2023, pp. 3267–3271.
[2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Attend and Spell: A [22] R. Ma, M. Qian, M. J. F. Gales, and K. M. Knill, “Adapting an
neural network for large vocabulary conversational speech recognition,” Unadaptable ASR System,” in Proc. INTERSPEECH 2023, 2023, pp.
in Proc. IEEE International Conference on Acoustics, Speech and Signal 989–993.
Processing (ICASSP). IEEE, 2016, pp. 4960–4964. [23] H. Wu, W. Wang, Y. Wan, W. Jiao, and M. Lyu, “ChatGPT or Gram-
[3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with marly? evaluating ChatGPT on grammatical error correction bench-
recurrent neural networks,” in Proc. International Conference on Ma- mark,” arXiv preprint arXiv:2303.13648, 2023.
chine Learning. PMLR, 2014, pp. 1764–1772. [24] T. Fang, S. Yang, K. Lan, D. F. Wong, J. Hu, L. S. Chao, and Y. Zhang,
[4] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, “Is ChatGPT a highly fluent grammatical error correction system? a
C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep comprehensive evaluation,” arXiv preprint arXiv:2304.01746, 2023.
Speech 2: End-to-end speech recognition in English and Mandarin,” in [25] R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can generative
International conference on machine learning. PMLR, 2016, pp. 173– large language models perform asr error correction?” arXiv preprint
182. arXiv:2307.04172, 2023.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [26] K. Everson, Y. Gu, H. Yang, P. G. Shivakumar, G.-T. Lin,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in J. Kolehmainen, I. Bulyko, A. Gandhe, S. Ghosh, W. Hamza et al.,
neural information processing systems, vol. 30, 2017. “Towards ASR robust spoken language understanding through in-context
[6] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and learning with word confusion networks,” in ICASSP 2024-2024 IEEE
I. Sutskever, “Robust speech recognition via large-scale weak super- International Conference on Acoustics, Speech and Signal Processing
vision,” in International Conference on Machine Learning. PMLR, (ICASSP). IEEE, 2024, pp. 12 856–12 860.
2023, pp. 28 492–28 518. [27] C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and
[7] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, E.-S. Chng, “Hyporadise: An open baseline for generative speech
V. Axelrod, G. Wang et al., “Google USM: Scaling Automatic Speech recognition with large language models,” in Proceedings of the 37th
Recognition Beyond 100 Languages,” arXiv preprint arXiv:2303.01037, International Conference on Neural Information Processing Systems,
2023. 2023, pp. 31 665–31 688.
[28] Y. Hu, C. Chen, C. Qin, Q. Zhu, E. S. Chng, and R. Li, “Listen again and
[8] T. Mikolov, M. Karafiát, L. Burget, J. ČernockÝ, and S. Khudanpur,
choose the right answer: A new paradigm for automatic speech recog-
“Recurrent neural network based language model,” in Proc. Interspeech
nition with large language models,” arXiv preprint arXiv:2405.10025,
2010, 2010, pp. 1045–1048.
2024.
[9] Z. Meng, S. Parthasarathy, E. Sun, Y. Gaur, N. Kanda, L. Lu, X. Chen,
[29] S. Li, C. Chen, C. Y. Kwok, C. Chu, E. S. Chng, and H. Kawai, “Inves-
R. Zhao, J. Li, and Y. Gong, “Internal language model estimation for
tigating asr error correction with large language model and multilingual
domain-adaptive end-to-end speech recognition,” in 2021 IEEE Spoken
1-best hypotheses,” in Interspeech 2024, 2024, pp. 1315–1319.
Language Technology Workshop (SLT). IEEE, 2021, pp. 243–250.
[30] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for
[10] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to
language modeling.” in Interspeech, vol. 2012, 2012, pp. 194–197.
language model fusion in end-to-end automatic speech recognition,” in
[31] L. Zhu, W. Liu, L. Liu, and E. Lin, “Improving ASR error correction
2019 IEEE Automatic Speech Recognition and Understanding Workshop
using n-best hypotheses,” in 2021 IEEE Automatic Speech Recognition
(ASRU), 2019, pp. 434–441.
and Understanding Workshop (ASRU). IEEE, 2021, pp. 83–89.
[11] Y. Liu, R. Ma, H. Xu, Y. He, Z. Ma, and W. Zhang, “Internal [32] Y. Leng, X. Tan, R. Wang, L. Zhu, J. Xu, W. Liu, L. Liu, X.-Y.
Language Model Estimation Through Explicit Context Vector Learning Li, T. Qin, E. Lin et al., “FastCorrect 2: Fast Error Correction on
for Attention-based Encoder-decoder ASR,” in Proc. Interspeech 2022, Multiple Candidates for Automatic Speech Recognition,” in Findings
2022, pp. 1666–1670. of the Association for Computational Linguistics: EMNLP 2021, 2021,
[12] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and pp. 4328–4337.
H. Ney, “Investigating Methods to Improve Language Model Integration [33] X. Liu, M. Li, L. Chen, P. Wanigasekara, W. Ruan, H. Khan, W. Hamza,
for Attention-Based Encoder-Decoder ASR Models,” in Proc. Inter- and C. Su, “ASR N-best fusion nets,” in ICASSP 2021-2021 IEEE
speech 2021, 2021, pp. 2856–2860. International Conference on Acoustics, Speech and Signal Processing
[13] R. Errattahi, A. El Hannani, and H. Ouahmane, “Automatic speech (ICASSP). IEEE, 2021, pp. 7618–7622.
recognition errors detection and correction: A review,” Procedia Com- [34] K. Ganesan, P. Bamdev, B. Jaivarsan, A. Venugopal, and A. Tushar, “N-
puter Science, vol. 128, pp. 32–37, 2018. Best ASR Transformer: Enhancing SLU Performance using Multiple
[14] O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of automatic ASR Hypotheses,” in Proceedings of the 59th Annual Meeting of the
speech recognition with transformer sequence-to-sequence model,” in Association for Computational Linguistics and the 11th International
Proc. 2020 IEEE International Conference on Acoustics, Speech and Joint Conference on Natural Language Processing (Volume 2: Short
Signal Processing (ICASSP). IEEE, 2020, pp. 7074–7078. Papers), 2021, pp. 93–98.
[15] Y. Zhao, X. Yang, J. Wang, Y. Gao, C. Yan, and Y. Zhou, “BART [35] S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosin-
Based Semantic Correction for Mandarin Automatic Speech Recognition ski, and R. Liu, “Plug and Play Language Models: A Simple Approach
System,” in Proc. Interspeech 2021, 2021, pp. 2017–2021. to Controlled Text Generation,” in Proc. International Conference on
[16] H. Cucu, A. Buzo, L. Besacier, and C. Burileanu, “Statistical error Learning Representations, 2020.
correction methods for domain-specific ASR systems,” in Statistical [36] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint language and
Language and Speech Processing: First International Conference, SLSP translation modeling with recurrent neural networks,” in Proc. EMNLP,
2013, Tarragona, Spain, July 29-31, 2013. Proceedings 1. Springer, 2013.
2013, pp. 83–92. [37] O. Sainz, J. Campos, I. Garcı́a-Ferrero, J. Etxaniz, O. L. de Lacalle, and
[17] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, E. Agirre, “NLP Evaluation in trouble: On the Need to Measure LLM
“Fastspeech: Fast, robust and controllable text to speech,” Advances in Data Contamination for each Benchmark,” in Findings of the Association
neural information processing systems, vol. 32, 2019. for Computational Linguistics: EMNLP 2023, 2023, pp. 10 776–10 787.
[18] J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model [38] C. Li and J. Flanigan, “Task contamination: Language models may
for end-to-end speech recognition,” in Proc. 2019 IEEE International not be few-shot anymore,” in Proceedings of the AAAI Conference on
Conference on Acoustics, Speech and Signal Processing (ICASSP). Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 471–18 480.
IEEE, 2019, pp. 5651–5655. [39] S. Golchin and M. Surdeanu, “Data contamination quiz: A tool to detect
[19] A. Mani, S. Palaskar, N. V. Meripo, S. Konam, and F. Metze, “ASR and estimate contamination in large language models,” arXiv preprint
error correction and domain adaptation using machine translation,” in arXiv:2311.06233, 2023.
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech [40] A. Liusie, P. Manakul, and M. Gales, “Mitigating word bias in zero-shot
and Signal Processing (ICASSP). IEEE, 2020, pp. 6344–6348. prompt-based classifiers,” in Findings of the Association for Computa-
[20] K. Shen, Y. Leng, X. Tan, S. Tang, Y. Zhang, W. Liu, and E. Lin, tional Linguistics: IJCNLP-AACL 2023 (Findings), 2023, pp. 327–335.
“Mask the Correct Tokens: An Embarrassingly Simple Approach for [41] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
Error Correction,” in Proceedings of the 2022 Conference on Empirical an ASR corpus based on public domain audio books,” in 2015 IEEE
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 12

international conference on acoustics, speech and signal processing


(ICASSP). IEEE, 2015, pp. 5206–5210.
[42] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Es-
teve, “TED-LIUM 3: Twice as much data and corpus repartition for
experiments on speaker adaptation,” in Speech and Computer: 20th
International Conference, SPECOM 2018, Leipzig, Germany, September
18–22, 2018, Proceedings 20. Springer, 2018, pp. 198–208.
[43] J. Meyer, L. Rauchenstein, J. D. Eisenberg, and N. Howell, “Artie
bias corpus: An open dataset for detecting demographic bias in speech
applications,” in Proceedings of the Twelfth Language Resources and
Evaluation Conference, 2020, pp. 6462–6468.
[44] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty,
R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common Voice: A
Massively-Multilingual Speech Corpus,” in Proceedings of the Twelfth
Language Resources and Evaluation Conference, 2020, pp. 4218–4222.
[45] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,
S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented
Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020,
pp. 5036–5040.
[46] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.-
E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “ESPnet: End-to-
End Speech Processing Toolkit,” in Proc. Interspeech 2018, 2018, pp.
2207–2211.
[47] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,
and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method
for Automatic Speech Recognition,” in Proc. Interspeech 2019, 2019,
pp. 2613–2617.
[48] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,”
in International Conference on Learning Representations, 2019.
[49] J. G. Fiscus, “A post-processing system to yield reduced word error
rates: Recognizer output voting error reduction (ROVER),” in IEEE
Workshop on Automatic Speech Recognition and Understanding Pro-
ceedings. IEEE, 1997, pp. 347–354.
[50] P. Bell, M. J. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu,
A. McParland, S. Renals, O. Saz, M. Wester et al., “The MGB challenge:
Evaluating multi-genre broadcast media recognition,” in 2015 IEEE
Workshop on Automatic Speech Recognition and Understanding (ASRU).
IEEE, 2015, pp. 687–693.
[51] J. Xu, M. Brenchley, E. Jones, A. Pinnington, T. Benjamin, K. Knill,
G. Seal-Coon, M. Robinson, and A. Geranpayeh, “Linguaskill building
a validity argument for the speaking test,” Linguaskill Research Reports,
UCLES, Tech. Rep, 2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy