Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset
Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset
Abstract—Speech is inherently continuous, where discrete Recurrent Neutral Networks (RNN) has long been the go-
words, phonemes and other units are not clearly segmented, to solution for sequence to sequence machine translation tasks
and so speech recognition has been an active research problem like Speech Recognition. However, it runs into the problem of
for decades. In this work we have fine-tuned wav2vec 2.0 to
recognize and transcribe Bengali speech – training it on the vanishing gradients and very high computational cost owing
Bengali Common Voice Speech Dataset. After training for 71 to the lack of parallelizability when fitting long data, such as
epochs, on a training set consisting of 36919 mp3 files, we speech. Each second of audio sampled at 16KHz produces to
achieved a training loss of 0.3172 and WER of 0.2524 on a 16000 elements in its vector representation. Modifications to
validation set of size 7,747. Using a 5-gram language model, RNN such as Long short-term memory (LSTM) and Gated
the Levenshtein Distance was 2.6446 on a test set of size 7,747.
Then the training set and validation set were combined, shuffled Recurrent Unit (GRU) can mitigate, but not quite overcome
and split into 85-15 ratio. Training for 7 more epochs on this the limitations of a recurrence based approach.
combined dataset yielded an improved Levenshtein Distance of The Transformer [2] is a model architecture which eschews
2.60753 on the test set. Our model was the best performing one recurrence and instead relies entirely on an attention mech-
on a hidden dataset, achieving a Levenshtein Distance of 6.2341 , anism to draw dependencies between input and output. The
which was 1.1049 units lower than other competing submissions.
Transformer allows for significantly more parallelization and
have reached a new state of the art in sequence to sequence
I. I NTRODUCTION translations.
Speech recognition and transcription is one of the This brings us to Wav2vec2.0, a great match for our
quintessential applications of Machine Learning, with every requirements of an unsupervised model suited to low resource
advancement having wide reaching effects on society as a languages. Wav2vec2.0 is “a framework for self-supervised
whole. English speech recognition in particular is at the stage learning of representations from raw audio data” [3] . This
of commercial viability, with products like Alexa, Google model uses a multi-layer convolutional neural network (CNN)
Assistant and Siri becoming household names. Unfortunately, to encode speech audio and produce latent speech repre-
the state of the art for Bengali speech recognition is lagging sentations. Spans of these representations are masked and
behind, especially considering that it is one of the most widely contextualized using a Transformer network. The model then
spoken languages in the world. Several factors such as the demarcates true latent from distractors through training via
lack of commercial incentive, geological clustering of native contrastive tasks. The model learns discrete speech units
speakers and a relatively young Bengali IT industry have as part of this pre-training on unlabeled speech. Afterward,
caused this. labeled data is used to fine-tune the model with a Connectionist
“LRLs (Low resource Languages) can be understood as less Temporal Classification (CTC) loss. This aids in downstream
studied, resource scarce, less computerized, less privileged, speech recognition tasks.
less commonly taught, or low density, among other denomina- The CNN based speech-to-embedding method applied by
tions.” [1] Bengali can be considered a low-resource language the wav2vec2.0 model is not language dependent, rather it
in the sense that, transcribed speech for Bengali (to be used is able to learn the patterns that all spoken languages have
in supervised learning) is very scarce. in common. As such, pretraining on multiple languages was
found to be beneficial even when the target language was
1 https://www.kaggle.com/competitions/dlsprint/discussion/349991 not included in the pretraining phase [4]. As such, we chose
facebook/wav2vec2-large-xlsr-53, which was pretrained on same batch. For faster training, we removed the starting and
about 56 thousand hours of multilingual speech from the MLS, ending portions of the each sound array if the value at that
CommonVoice and BABEL datasets 2 . index was less than some fraction of the maximum value in
Recently, an addition to the repository of Bengali speech that array. After testing different values, the cutoff threshold
transcription was made by Bengali Common Voice Speech x < max(array)
30 was chosen.
Dataset [5]. This dataset allows us to fine-tune Meta’s pre- Finally, only sound files between 1 and 10 seconds were
trained wav2vec 2.0 to Bengali to allow better transcription chosen for training, yielding a final train set of length 36919.
specific to Bengali. In this paper we showcase the training
C. Training
pipeline used to train wav2vec 2.0 to Bengali Common Voice
Speech Dataset to achieve a considerable Word Error Rate For training we used the pytorch [9] based implementation
(WER). of Transformer [2] model provided and maintained by hug-
gingface.co. We fine-tuned the pretrained facebook/wav2vec2-
II. M ETHODOLOGY
large-xlsr-53 which was trained to on unlabelled multilingual
A. Model Selection speech and intended for further downstream training on la-
wav2vec2 is currently yields state-of-the-art WER 1.4% / belled data.
2.6% on the LibriSpeech test/test-other sets [6]. As such it was First Phase Training was done for a total of 71 epochs,
a natural choice for Bengali ASR. Two different approaches totalling to around 100 hours of training on a single Nvidia
were considered as the starting point for training, namely a K80 GPU. The training run-time was approximately 80 hours
self-supervised pre-trained model (facebook/wav2vec2-large- for 71 epochs of training. The AdamW Optimizer was used
xlsr-53)3 and an already fine-tuned model (arijitx/wav2vec2- starting with learning rate 5 × 10−4 and weight delay of 2.5 ×
xls-r-300m-bengali)4 convergent on another similar dataset [7] 10−6 . These values were decided on after some preliminary
(i.e. transfer learning). After preliminary testing, we found attempts which either failed to converge with larger hyper-
that further fine-tuning an already convergent model slightly parameters or were slow to show appreciable convergence with
degrades performance on the target dataset. Our results concur lower rates.
with similar findings where wav2vec2 model was applied to Around 0.25 WER (∼ Epoch 55), we noticed the training
CALLHOME-MA dataset [8]. and evaluation metrics plateauing considerably. On 7747 hid-
Therefore, we determined that the self-supervised pretrained den test data, we achieved a Levenshtein Distance of 2.64446.
model facebook/wav2vec2-large-xlsr-53 should serve as the Speech recognition tasks often hinge on exposure to a large
basis for further fine-tuning. vocabulary and since we initially limited ourselves to a subset
B. Preprocessing of the entire dataset, we used a portion of the validation set
for further training.
The train split of Bengali Common Voice Speech Dataset The main motivation behind two phase training was to
comprises of 206,951 mp3 files with their corresponding increase the model’s exposure to new speech and the cor-
Bengali transcription along with some meta-data such as up- responding text labels. We merged training and validation
votes, downvotes, gender etc. On quick analysis, the following datasets and applied a 85-15 split. With a weight decay of
metrics were found. 2.5 × 10−9 and learning rate 5 × 10−6 , we trained for a
TABLE I further 7 epochs, which took a further 9 hours on the Nvidia
DATASET ANALYSIS FINDINGS K80 GPU. The hyper-parameters were significantly lowered
to allow prevent the model weights from rapidly deviating.
Criterion Count
upvotes > downvotes 37405 This produced a slight but noticeable improvement to the
upvotes < downvotes 5536 Levenshtein Distance, now 2.60753 (down from 2.6446).
upvotes = downvotes = 0 161380
upvotes > downvotes and 1 ≤ duration ≤ 10 36919 D. Post-processing
Three separate post processing functions were used to
Since 5536 (∼13 percent) out of 42941 voted data was generate the final prediction on the hidden test set, namely n-
unreliable, we opted to train using only the subset with gram Language Model decoding, Unicode normalization and
upvotes > downvotes. appending a common character to the end of sentences.
The mp3 files were then sampled at 16kHz and characters Our Language Model of choice was forked from the
in the label strings were cast to integer hashes using the huggingface model arijitx/wav2vec2-xls-r-300m-bengali, gen-
a vocabulary dictionary forked from the huggingface model erated from IndicCorp language corpus [10]. The language
arijitx/wav2vec2-xls-r-300m-bengali. model improved spelling accuracy and the Levenshtein Dis-
The Transformers [2] implementation from the Transform- tance decreased from 3.30648 to 2.72243 once applied.
ers library5 pads each array to match the longest array in the Unicode Normalization using bnunicodenormalizer6 and
2 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vev appending the ”devanagari danda” [Unicode#2404] to output
3 https://huggingface.co/facebook/wav2vec2-large-xlsr-53
sentences also resulted in slight improvements.
4 https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali
5 https://huggingface.co/docs/transformers/index 6 https://github.com/mnansary/bnUnicodeNormalizer
Since most punctuation was removed during preprocessing, From figure 2 we can see that WER has plateaued at around
the model outputs didn’t contain them. A suitable language 60 epochs. Whereas training loss, as in figure 1 showed steady
model might could possibly reinsert punctuation but we did descent.
not explore this possibility in this iteration.
TABLE II
T RAINING P HASE 2 M ETRICS
III. R ESULTS
epoch Training Loss Eval.WER
After a total of about 77 epochs of training, in two phases 1.26 0.2552 0.1497
- our model settled on a Levenshtein Mean Distance Score 2.51 0.2482 0.1500
3.77 0.2497 0.1498
of 2.60753 on the test set. The training error and evaluation 5.02 0.2474 0.1499
word error metric gradually decreased through out the training 6.28 0.2493 0.1499
process. After training phase 1, the model had a training loss
of 0.3172, and an evaluation WER of 0.2524 on the validation
set. IV. C OMPARISON WITH OTHER WORKS
The final model scored a Levenshtein Mean Distance of This work was the champion on the kaggle community com-
6.234 on a new hidden dataset.7 petition ”DL Sprint”8 , under the name ”YellowKing”. Among
59 participating teams, our model was the best performing
1 one on a hidden dataset, achieving a Levenshtein Distance of
6.2349, which was 1.1049 units lower than other competing
0.9 submissions. Overall, this training pipeline generalized well
0.8 on completely unseen data compared to other models. Our
0.7 submission also achieved the highest combined score of 92.59
Training Loss
0.6 out of 100, taking the public dataset, hidden dataset and
methodology into account.
0.5
0.4 A. Base Model
0.3 Most of the contenders chose arijitx/wav2vec2-xls-r-300m-
bengali as the model to start training. That model was al-
0.2
ready fine-tuned for 180,000 steps, on the OPENSLR-SLR53-
0.1 Bengali dataset [7]. The rationale for this choice was to utilize
0 the already learned weights from that model, to incorporate
10 20 30 40 50 60 70 80 transfer learning. This choice allowed such pipelines to quickly
Epoch achieve a low Levenshtein Distance and fit the train-validation
set better than our pipeline10 .
Fig. 1. Training Loss vs Epoch While Bengali Common Voice Speech Dataset and
OpenSLR-Bengali are datasets of the same language, the
Bengali Common Voice Speech Dataset has higher median
1 voice segments per second, higher dynamic range, smaller
0.9 pauses between voice segments [5]. The Bengali Common
Voice Speech Dataset is crowd sourced and therefore recorded
0.8
in a more uncontrolled environment. In addition, OpenSLR-
0.7 Bengali was recorded by a total of 505 speakers, whereas
0.6 Bengali Common Voice Speech Dataset has about 20,000
WER