ASR Ec LLM
ASR Ec LLM
XX, XXXX XX 1
Abstract—Error correction (EC) models play a crucial role to enhance such restricted access ASR systems. Two common
in refining Automatic Speech Recognition (ASR) transcriptions, methods are language model (LM) rescoring and error cor-
enhancing the readability and quality of transcriptions. Without rection (EC). LM rescoring involves reranking the N-best list
requiring access to the underlying code or model weights, EC
generated by the ASR system using an external LM, which
arXiv:2409.09554v1 [cs.CL] 14 Sep 2024
in-context learning, demonstrating improved performance with N-best list as input, has demonstrated significant performance
one-shot examples compared to 1-best hypotheses. Chen et improvements in error correction compared to the original
al. [27] generated N-best lists in the ASR decoding and built model [21]. The rationale behind this is that the N-best list
LLM EC systems using various methods including fine-tuning, contains alternative sequences that have a strong possibility of
LoRA tuning, and in-context learning. Hu et al. developed a being the correct transcription, thus providing valuable cues
multi-modal EC model incorporating audio as an additional for the EC model during predictions. We modify the input
input [28] and used a cloze-test task approach instead of of the EC model to be sentences in the N-best list, sorted
a generative correction method. Additionally, Li et al. [29] based on ASR scores and concatenated with a special token,
explored knowledge transfer within LLMs by fine-tuning a improving both interpretability and effectiveness.
multilingual LLM across various languages to correct 1-best In our study, we introduce methods for developing both fine-
hypothesis errors from different speech foundation models. tuning and zero-shot error correction models using ASR N-
Previous research has also explored various methods to best lists and propose several decoding methods. In traditional
improve ASR error correction by leveraging N-best lists, ASR error correction models, the decoding process allows the
which offer richer information compared to single 1-best model to generate any sequence based on the given context,
hypotheses. For instance, Guo et al. [18] generates an 8-best meaning that the output space is not constrained, which is
list with the ASR model and rescored candidates with an denoted as unconstrained decoding (uncon). When an N-best
LSTM language model [30]. Zhu et al. [31] concatenated list is used, we can constrain the model to allow it to generate
N-best hypotheses for input to a bidirectional encoder, and from a limited space. For this, we propose N-best constrained
Leng et al. [32] investigated non-autoregressive models with decoding (constr) and N-best closest (closest). The details of
similar approaches. More recent work by Ma et al. [25] and these decoding methods will be introduced in Section III.
Chen et al. [27] has integrated N-best lists with generative
LLMs to enhance error correction performance. B. Fine-tuning based Approach
Building on these advances, our paper introduces a novel
approach that uses LLMs to improve ASR error correction. We
compare fine-tuning versus zero-shot error correction methods
and investigate how ASR N-best lists can be effectively
utilized. A major contribution of our work is the innovative
use of ASR N-best lists as extended inputs, which provides
richer context and more accurate cues for error correction.
Additionally, we introduce advanced decoding strategies, in-
cluding constrained decoding, to enhance model robustness
and alignment with the original utterance. Our approach also
Fig. 1. The model structure of a supervised error correction model using
addresses data contamination concerns by developing methods ASR N-best lists as input. Here, we set N to 2 for illustration.
to evaluate the impact of training data biases. By integrating
these elements, our study sets a new benchmark in ASR error EC models aim to correct recognition errors in ASR
correction, advancing the state-of-the-art in the field. transcriptions, serving as a post-processing step for speech
This paper is structured as follows. In Section II we present recognition. A standard supervised EC model adopts an E2E
the error correction method utilizing foundation language structure, taking the ASR transcription as input, and is trained
models, including a supervised approach employing the T5 to generate the corrected sentence [14], [15], [18]. Adapting
model and a zero-shot approach based on generative LLMs. an EC model from a PLM yields superior performance when
Section III introduces several decoding algorithms to build compared to training the EC model from scratch, as this
more robust ASR error correction systems, aiming to address approach leverages the prior knowledge embedded in the
the inherent problems of the standard beam search approach. language models [21]. When training an EC model, direct
In Section IV, we describe the method used to investigate data access to the ASR system is unnecessary as only the decoded
contamination. The experimental setup and results are detailed hypotheses are required. This flexibility in data accessibility
in Section V, while Section VI covers N-best analysis, an abla- makes the method highly practical, especially in situations
tion study of the proposed approach, and a discussion on data where adapting a black-box, cloud-based speech-to-text sys-
contamination. Finally, Section VII presents the conclusions. tem is essential.
The structure of our proposed EC model is illustrated in
II. ASR E RROR C ORRECTION M ODELS Figure 1, where ASR N-best lists are given as input to the
encoder and the model is trained to generate the manual ref-
A. ASR Error Correction using N-best lists erence. Here, sentences are concatenated with a special token
Building upon previous findings, we explore new method- [SEP]. The model is trained to automatically detect errors
ologies for effectively incorporating supplementary ASR out- and generate the corrected hypothesis on specific training data.
puts into LLM-based EC models. One promising approach For supervised models, generalization capability is crucial,
involves utilizing N-best ASR hypotheses, which are generated as it enhances their applicability across various practical
by the ASR system as a byproduct of the beam search scenarios. A model that generalizes well can be used in diverse
process [33], [34]. The integration of N-best T5, using the and practical contexts without the need for further updates. In
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 3
Fig. 2. Prompt design for zero-shot ASR error correction. Here we use a 3-best list generated by the ASR system as input to ChatGPT for illustration.
our experiments, we applied a model trained on transcriptions evaluate the impact on performance. Another intriguing aspect
of corpus A generated by a specific ASR system to out-of- is guiding the decoding process to achieve controllable gener-
domain test sets and outputs from other ASR systems. By ation. In this section, we delve into this challenge, exploring
doing this, we demonstrate the generalization ability of our methods to direct the model’s output in a more controlled and
proposed method, highlighting its robustness and adaptability predictable manner.
across varying ASR outputs and domains.
A. Unconstrained Decoding
C. Zero-shot Approach
For an ASR error correction model with parameters θEC ,
Supervised training has long been popular for developing the N-best input is denoted as Z = {ẑ (1) , ẑ (2) , · · · , ẑ (n) }.
EC systems, however, it requires the availability of training Our decoding objective is to find ŷuncon that satisfies
data and the systems can be computationally expensive to
build. To address these constraints, we present our approach ŷuncon = arg maxy log P (y|Z; θEC ) (1)
to utilizing generative LLMs for zero-shot error correction,
using ChatGPT as an illustrative example. This task can be where y presents potential output sequences. Given the com-
challenging as ChatGPT lacks prior knowledge about the putational cost of finding the globally optimal sequence,
error patterns of the ASR system and has no access to the heuristic algorithms such as beam search are commonly used
original utterance. To mitigate the difficulty, similar to the for decoding. In this context, the decoding method is referred
supervised approach, we provide hypotheses from the N-best to as unconstrained decoding (uncon), as no explicit con-
list as valuable hints, helping the model to detect and correct straints are applied to the generated sequences.
errors effectively. The prompts we used in the experiments Beam search is a practical tool for approximating optimal
are shown in Figure 2. In the prompt design, hypotheses are generation results across the entire decoding space, balancing
sorted by the descending ASR posterior score. Additionally, efficiency and performance. However, this method grants
tags such as <hypothesis1> and </hypothesis1> the model too much freedom, limiting our control over the
enclose each N-best hypothesis. We experimented with other decoding process [35]. Specifically, we aim for the proposed
input formats, such as using numbers instead of tags or em- model to retain correct words from the original transcription
ploying plain sentences without explicitly specified order, but and only correct inaccurate ones. In addition, the model
these variants showed degraded performance compared to our is expected to generate homophones for detected erroneous
chosen prompt. In the ablation study detailed in Section VI-A, words. While these aspects can be implicitly learned from
we highlight the importance of using a reasonable number of the training data, there is no guarantee they will be applied
N for the model to achieve optimal performance. When only during decoding. The model might produce synonyms with
the top one ASR hypothesis is used as input, ChatGPT-based high embedding similarity to words in the reference text,
error correction may experience a degradation in performance. which can be problematic. To address these concerns, we
Additionally, initial experiments explored the few-shot setting, introduce several alternative decoding algorithms to enhance
revealing unstable performance improvements compared to the the model’s ability to achieve the desired decoding objectives.
zero-shot approach and higher computational costs. Therefore These methods aim to exert more control over the output,
we mainly focus on the zero-shot results in this paper. ensuring that corrections are precise and that the generated
sequences meet specific criteria for accuracy and relevance.
III. ASR E RROR C ORRECTION D ECODING
Previous sections introduced the proposed fine-tuning and B. N-best Constrained Decoding
zero-shot error correction methods. Specifically, we high- In unconstrained decoding, the decoding space of the EC
lighted how to incorporate more information into the input model is unbounded. However, we want the generated cor-
space of the proposed model with ASR N-best lists and rection results to closely resemble the original utterance. One
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 4
TABLE I
Fig. 4. Prompt for generating options for the data contamination quiz. S TATISTICS OF ASR TEST SETS USED IN THE EXPERIMENTS .
Baseline - 6.90 13.53 23.67 Sorted 2.90 6.39 3.64 8.14 12.71
Randomized 3.31 6.82 3.74 8.50 13.01
1-best 8.25 11.95 21.19 Reversed 3.50 7.18 3.75 8.57 12.99
3-best 7.01 11.31 18.84
GPT-3.5
5-best 6.64 11.35 18.73
10-best 6.69 11.29 18.72
the correction model infer and utilize this ranking knowledge
from the input to enhance its performance? To investigate this,
the details of the method introduced in Section II-C. Specifi- we conducted experiments with N-best lists that were either
cally, we tested the GPT-3.5 model on the Transducer outputs randomly shuffled or sorted in reverse order of ASR scores.
across three datasets: LibriSpeech test other, TED-LIUM, and As shown in Table VIII, applying unconstrained decoding to
Artie, using the unconstrained decoding setup where the model the LibriSpeech test sets and N-best constrained decoding to
generates sequences based on the input context. The model other datasets, we found that randomizing the N-best list led
faces challenges in enhancing performance when using only to performance degradation, while reversing the order of input
the top one ASR hypothesis as input. Notably, while perfor- hypotheses resulted in the worst performance. This indicates
mance improved on TED and Artie datasets, it declined on that the ranking information is implicitly learned and crucial
LibriSpeech. Increasing the number of input contexts generally for the N-best T5 model to perform well. However, when
helps the model detect and correct errors more effectively. applying similar randomization or reversal strategies to the
However, our findings indicate that increasing N does not GPT-3.5 and GPT-4 models in zero-shot experiments, there
consistently improve results; for example, performance with was no significant difference in performance. This suggests
the 10-best context was comparable to that with the 5-best. that, unlike the N-best T5 model, the GPTs might not rely on
This finding is consistent with the observation on the fine- or benefit from the ranking information in the same way.
tuning EC method. Experiments in Section V-C reveal that our proposed meth-
ods are less effective on Whisper outputs in some cases. To
B. Ablation on ASR Model Sizes examine this, we calculate Uniq and Cross WER metrics in
Table IX. When calculating statistics, we remove punctuation
Previous experiments on Whisper are based on the small.en
and special symbols from the ASR hypotheses, leaving only
model, which yields good performance with relatively low
English characters and numbers to focus on the meaningful
decoding latency. In Table VII we test Whisper models with
content. The Uniq metric represents the average number of
different sizes as the underlying ASR model and apply GPT-4
unique hypotheses within an N-best list in the test set. For
as the correction approach. The results indicate that increasing
Transducer outputs, this number is close to 5, matching the
the ASR model size generally improves baseline ASR perfor-
size of the given N-best list. However, Whisper outputs show
mance, but also makes the correction task more challenging.
more repeated entries. This occurs because Whisper learns
Specifically, error correction yields higher WERR with smaller
to generate sentences with inverse text normalisation (ITN)
Whisper models. Except for Whisper large-v2, our method
to enhance readability, i.e. adding capitalisation, including
achieves a WERR ranging from 7.5% to 17.6%, demonstrating
punctuation, and removing disfluencies. As a result, multiple
the effectiveness of zero-shot error correction.
hypotheses in an N-best list often differ only in format rather
than content. This limits the diversity of the N-best list which
C. N-best Analysis is crucial for our proposed methods to work well.
In Section VI-A, we observed that using an N-best list Another notable observation is that for Whisper, even when
instead of the 1-best transcription significantly improves error the N-best list contains diverse hypotheses, the differences
correction performance. For the N-best T5 model, hypotheses often come from the omission or insertion of irrelevant words.
are concatenated sequentially without explicitly encoding the This is demonstrated by the Cross WER metric in Table IX. In
ranking information. This raises an important question: Can this evaluation, we retain all unique hypotheses in an N-best
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX XX 9
TABLE X
C ASE ANALYSIS FOR UNCONDITIONAL ERROR CORRECTION RESULTS ON C ONFORMER -T RANSDUCER OUTPUTS .
Ref the gut and the gullet being cut across between these ligatures the stomach may be removed entire without spilling its contents
Hyp-1 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning its contents
Hyp-2 the gut and the gullet being cut across between these ligatures the stomach may be removed entire without spinning its contents
Hyp-3 the gut and the gullet being cut across between these ligages the stomach may be removed entire without spinning its contents
Hyp-4 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning as contents
Hyp-5 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning his contents
5-best T5 the gut and the gullet being cut across between these ligatches the stomach may be removed entire without spinning its contents
GPT-3.5 The gut and the gullet being cut across between these ligatures the stomach may be removed entire without spinning its contents.
GPT-4 The gut and the gullet being cut across between these ligatures the stomach may be removed entire without spilling its contents.