0% found this document useful (0 votes)
94 views5 pages

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

1) The document discusses fine-tuning the wav2vec 2.0 model for Bengali speech recognition using the Bengali Common Voice Speech Dataset. 2) After training for 71 epochs, the model achieved a training loss of 0.3172 and a WER of 0.2524 on the validation set. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on the test set. 3) Additional training on the combined training and validation sets yielded an improved Levenshtein Distance of 2.60753 on the test set, showing the model was effective for Bengali speech recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views5 pages

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

1) The document discusses fine-tuning the wav2vec 2.0 model for Bengali speech recognition using the Bengali Common Voice Speech Dataset. 2) After training for 71 epochs, the model achieved a training loss of 0.3172 and a WER of 0.2524 on the validation set. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on the test set. 3) Additional training on the combined training and validation sets yielded an improved Levenshtein Distance of 2.60753 on the test set, showing the model was effective for Bengali speech recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Applying wav2vec2 for Speech Recognition on

Bengali Common Voices Dataset


1st H.A.Z Sameen Shahgir 2nd Khondker Salman Sayeed
Undergraduate, Computer Science and Engineering Undergraduate, Computer Science and Engineering
Bangladesh University of Engineering and Technology Bangladesh University of Engineering and Technology
Dhaka, Bangladesh Dhaka, Bangladesh
1805053@ugrad.cse.buet.ac.bd 1805050@ugrad.cse.buet.ac.bd
arXiv:2209.06581v1 [eess.AS] 11 Sep 2022

3rd Tanjeem Azwad Zaman


Undergraduate, Computer Science and Engineering
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh
1805006@ugrad.cse.buet.ac.bd

Abstract—Speech is inherently continuous, where discrete Recurrent Neutral Networks (RNN) has long been the go-
words, phonemes and other units are not clearly segmented, to solution for sequence to sequence machine translation tasks
and so speech recognition has been an active research problem like Speech Recognition. However, it runs into the problem of
for decades. In this work we have fine-tuned wav2vec 2.0 to
recognize and transcribe Bengali speech – training it on the vanishing gradients and very high computational cost owing
Bengali Common Voice Speech Dataset. After training for 71 to the lack of parallelizability when fitting long data, such as
epochs, on a training set consisting of 36919 mp3 files, we speech. Each second of audio sampled at 16KHz produces to
achieved a training loss of 0.3172 and WER of 0.2524 on a 16000 elements in its vector representation. Modifications to
validation set of size 7,747. Using a 5-gram language model, RNN such as Long short-term memory (LSTM) and Gated
the Levenshtein Distance was 2.6446 on a test set of size 7,747.
Then the training set and validation set were combined, shuffled Recurrent Unit (GRU) can mitigate, but not quite overcome
and split into 85-15 ratio. Training for 7 more epochs on this the limitations of a recurrence based approach.
combined dataset yielded an improved Levenshtein Distance of The Transformer [2] is a model architecture which eschews
2.60753 on the test set. Our model was the best performing one recurrence and instead relies entirely on an attention mech-
on a hidden dataset, achieving a Levenshtein Distance of 6.2341 , anism to draw dependencies between input and output. The
which was 1.1049 units lower than other competing submissions.
Transformer allows for significantly more parallelization and
have reached a new state of the art in sequence to sequence
I. I NTRODUCTION translations.
Speech recognition and transcription is one of the This brings us to Wav2vec2.0, a great match for our
quintessential applications of Machine Learning, with every requirements of an unsupervised model suited to low resource
advancement having wide reaching effects on society as a languages. Wav2vec2.0 is “a framework for self-supervised
whole. English speech recognition in particular is at the stage learning of representations from raw audio data” [3] . This
of commercial viability, with products like Alexa, Google model uses a multi-layer convolutional neural network (CNN)
Assistant and Siri becoming household names. Unfortunately, to encode speech audio and produce latent speech repre-
the state of the art for Bengali speech recognition is lagging sentations. Spans of these representations are masked and
behind, especially considering that it is one of the most widely contextualized using a Transformer network. The model then
spoken languages in the world. Several factors such as the demarcates true latent from distractors through training via
lack of commercial incentive, geological clustering of native contrastive tasks. The model learns discrete speech units
speakers and a relatively young Bengali IT industry have as part of this pre-training on unlabeled speech. Afterward,
caused this. labeled data is used to fine-tune the model with a Connectionist
“LRLs (Low resource Languages) can be understood as less Temporal Classification (CTC) loss. This aids in downstream
studied, resource scarce, less computerized, less privileged, speech recognition tasks.
less commonly taught, or low density, among other denomina- The CNN based speech-to-embedding method applied by
tions.” [1] Bengali can be considered a low-resource language the wav2vec2.0 model is not language dependent, rather it
in the sense that, transcribed speech for Bengali (to be used is able to learn the patterns that all spoken languages have
in supervised learning) is very scarce. in common. As such, pretraining on multiple languages was
found to be beneficial even when the target language was
1 https://www.kaggle.com/competitions/dlsprint/discussion/349991 not included in the pretraining phase [4]. As such, we chose
facebook/wav2vec2-large-xlsr-53, which was pretrained on same batch. For faster training, we removed the starting and
about 56 thousand hours of multilingual speech from the MLS, ending portions of the each sound array if the value at that
CommonVoice and BABEL datasets 2 . index was less than some fraction of the maximum value in
Recently, an addition to the repository of Bengali speech that array. After testing different values, the cutoff threshold
transcription was made by Bengali Common Voice Speech x < max(array)
30 was chosen.
Dataset [5]. This dataset allows us to fine-tune Meta’s pre- Finally, only sound files between 1 and 10 seconds were
trained wav2vec 2.0 to Bengali to allow better transcription chosen for training, yielding a final train set of length 36919.
specific to Bengali. In this paper we showcase the training
C. Training
pipeline used to train wav2vec 2.0 to Bengali Common Voice
Speech Dataset to achieve a considerable Word Error Rate For training we used the pytorch [9] based implementation
(WER). of Transformer [2] model provided and maintained by hug-
gingface.co. We fine-tuned the pretrained facebook/wav2vec2-
II. M ETHODOLOGY
large-xlsr-53 which was trained to on unlabelled multilingual
A. Model Selection speech and intended for further downstream training on la-
wav2vec2 is currently yields state-of-the-art WER 1.4% / belled data.
2.6% on the LibriSpeech test/test-other sets [6]. As such it was First Phase Training was done for a total of 71 epochs,
a natural choice for Bengali ASR. Two different approaches totalling to around 100 hours of training on a single Nvidia
were considered as the starting point for training, namely a K80 GPU. The training run-time was approximately 80 hours
self-supervised pre-trained model (facebook/wav2vec2-large- for 71 epochs of training. The AdamW Optimizer was used
xlsr-53)3 and an already fine-tuned model (arijitx/wav2vec2- starting with learning rate 5 × 10−4 and weight delay of 2.5 ×
xls-r-300m-bengali)4 convergent on another similar dataset [7] 10−6 . These values were decided on after some preliminary
(i.e. transfer learning). After preliminary testing, we found attempts which either failed to converge with larger hyper-
that further fine-tuning an already convergent model slightly parameters or were slow to show appreciable convergence with
degrades performance on the target dataset. Our results concur lower rates.
with similar findings where wav2vec2 model was applied to Around 0.25 WER (∼ Epoch 55), we noticed the training
CALLHOME-MA dataset [8]. and evaluation metrics plateauing considerably. On 7747 hid-
Therefore, we determined that the self-supervised pretrained den test data, we achieved a Levenshtein Distance of 2.64446.
model facebook/wav2vec2-large-xlsr-53 should serve as the Speech recognition tasks often hinge on exposure to a large
basis for further fine-tuning. vocabulary and since we initially limited ourselves to a subset
B. Preprocessing of the entire dataset, we used a portion of the validation set
for further training.
The train split of Bengali Common Voice Speech Dataset The main motivation behind two phase training was to
comprises of 206,951 mp3 files with their corresponding increase the model’s exposure to new speech and the cor-
Bengali transcription along with some meta-data such as up- responding text labels. We merged training and validation
votes, downvotes, gender etc. On quick analysis, the following datasets and applied a 85-15 split. With a weight decay of
metrics were found. 2.5 × 10−9 and learning rate 5 × 10−6 , we trained for a
TABLE I further 7 epochs, which took a further 9 hours on the Nvidia
DATASET ANALYSIS FINDINGS K80 GPU. The hyper-parameters were significantly lowered
to allow prevent the model weights from rapidly deviating.
Criterion Count
upvotes > downvotes 37405 This produced a slight but noticeable improvement to the
upvotes < downvotes 5536 Levenshtein Distance, now 2.60753 (down from 2.6446).
upvotes = downvotes = 0 161380
upvotes > downvotes and 1 ≤ duration ≤ 10 36919 D. Post-processing
Three separate post processing functions were used to
Since 5536 (∼13 percent) out of 42941 voted data was generate the final prediction on the hidden test set, namely n-
unreliable, we opted to train using only the subset with gram Language Model decoding, Unicode normalization and
upvotes > downvotes. appending a common character to the end of sentences.
The mp3 files were then sampled at 16kHz and characters Our Language Model of choice was forked from the
in the label strings were cast to integer hashes using the huggingface model arijitx/wav2vec2-xls-r-300m-bengali, gen-
a vocabulary dictionary forked from the huggingface model erated from IndicCorp language corpus [10]. The language
arijitx/wav2vec2-xls-r-300m-bengali. model improved spelling accuracy and the Levenshtein Dis-
The Transformers [2] implementation from the Transform- tance decreased from 3.30648 to 2.72243 once applied.
ers library5 pads each array to match the longest array in the Unicode Normalization using bnunicodenormalizer6 and
2 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vev appending the ”devanagari danda” [Unicode#2404] to output
3 https://huggingface.co/facebook/wav2vec2-large-xlsr-53
sentences also resulted in slight improvements.
4 https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali
5 https://huggingface.co/docs/transformers/index 6 https://github.com/mnansary/bnUnicodeNormalizer
Since most punctuation was removed during preprocessing, From figure 2 we can see that WER has plateaued at around
the model outputs didn’t contain them. A suitable language 60 epochs. Whereas training loss, as in figure 1 showed steady
model might could possibly reinsert punctuation but we did descent.
not explore this possibility in this iteration.
TABLE II
T RAINING P HASE 2 M ETRICS
III. R ESULTS
epoch Training Loss Eval.WER
After a total of about 77 epochs of training, in two phases 1.26 0.2552 0.1497
- our model settled on a Levenshtein Mean Distance Score 2.51 0.2482 0.1500
3.77 0.2497 0.1498
of 2.60753 on the test set. The training error and evaluation 5.02 0.2474 0.1499
word error metric gradually decreased through out the training 6.28 0.2493 0.1499
process. After training phase 1, the model had a training loss
of 0.3172, and an evaluation WER of 0.2524 on the validation
set. IV. C OMPARISON WITH OTHER WORKS

The final model scored a Levenshtein Mean Distance of This work was the champion on the kaggle community com-
6.234 on a new hidden dataset.7 petition ”DL Sprint”8 , under the name ”YellowKing”. Among
59 participating teams, our model was the best performing
1 one on a hidden dataset, achieving a Levenshtein Distance of
6.2349, which was 1.1049 units lower than other competing
0.9 submissions. Overall, this training pipeline generalized well
0.8 on completely unseen data compared to other models. Our
0.7 submission also achieved the highest combined score of 92.59
Training Loss

0.6 out of 100, taking the public dataset, hidden dataset and
methodology into account.
0.5
0.4 A. Base Model
0.3 Most of the contenders chose arijitx/wav2vec2-xls-r-300m-
bengali as the model to start training. That model was al-
0.2
ready fine-tuned for 180,000 steps, on the OPENSLR-SLR53-
0.1 Bengali dataset [7]. The rationale for this choice was to utilize
0 the already learned weights from that model, to incorporate
10 20 30 40 50 60 70 80 transfer learning. This choice allowed such pipelines to quickly
Epoch achieve a low Levenshtein Distance and fit the train-validation
set better than our pipeline10 .
Fig. 1. Training Loss vs Epoch While Bengali Common Voice Speech Dataset and
OpenSLR-Bengali are datasets of the same language, the
Bengali Common Voice Speech Dataset has higher median
1 voice segments per second, higher dynamic range, smaller
0.9 pauses between voice segments [5]. The Bengali Common
Voice Speech Dataset is crowd sourced and therefore recorded
0.8
in a more uncontrolled environment. In addition, OpenSLR-
0.7 Bengali was recorded by a total of 505 speakers, whereas
0.6 Bengali Common Voice Speech Dataset has about 20,000
WER

0.5 speakers, having accents originating from different parts of


Bangladesh and India. This allows models trained on that
0.4 dataset to perform better on unseen practical data with a
0.3 varying range of speaking patterns.
0.2 We chose to start training facebook/wav2vec2-large-xlsr-53,
0.1 which was not fine tuned for Bengali speech transcription.
We trained it solely on the Bengali Common Voice Speech
10 20 30 40 50 60 70 80 Dataset, which allowed it to be exposed to data from a more
Epoch uncontrolled environment and varying accents from the begin-
ning. Therefore, our model slowly improved it’s Levenshtein
Fig. 2. WER vs Epoch 8 https://www.kaggle.com/competitions/dlsprint
9 https://www.kaggle.com/competitions/dlsprint/discussion/349991
7 https://www.kaggle.com/competitions/dlsprint/discussion/349991 10 https://www.kaggle.com/competitions/dlsprint/leaderboard
Distance over time, as opposed to the rapid learning rate C. About the facebook/wav2vec2-large-xlsr-53 pretrained
observed on the contender pipelines, but performed better on model
unseen practical data.
The wav2vec2-large-xlsr-53 model which we used for
further fine-tuning was pretrained on about 56 thousand
B. Filtering Data hours of multilingual speech from the MLS, CommonVoice
and BABEL datasets11 . However the majority (50.7 thou-
All of the contenders trained their model on almost the
sand hours) were on the MLS dataset [11] which com-
entirety of the Bengali Common Voice Speech Dataset [5]
prises of Dutch, English, French, German, Italian, Polish,
on about 200,000 samples. The natural reasoning for this is,
Portuguese and Spanish; i.e. Modern European Languages.
increasing the number of samples allows the model to be
For this reason, effect of pretraining on cross-lingual speech
trained on more rich corpus of words and sentences.
appears somewhat muted on non-European languages. The xlsr
The dataset was divided on the basis on manual verification, model showed significant phonetic token error rate (PTER)
using ”up vote” and ”down vote” metrics. Which allowed us improvement for Georgian [4], a language similar to the
to filter the verified data from the unverified ones. We chose to languages it was pretrained on, but marginal improvement on
train on a small subset of the data, including on the samples Bengali and even deterioration on Vietnamese compared to the
that had upvotes > downvotes. Upon random sampling of previous state-of-the-art [12] on low resource ASR.
the data excluded from our training set we observed that
their flaws were - stuttering, restarting sentence upon mistake,
TABLE III
mispronunciation. Therefore we chose to train on a cleaner PTER ON BABEL L ANGUAGES 6 D ATASET
dataset, consisting of about 37,500 training samples, and 7,747
validation samples. Language Gao et. al [12] wav2vec2-large-xlsr-53 [4]
Georgian 38.6 27.6
Bengali 38.2 36.1
Vietnamese 32.0 40.7
C. Training Sequence
Most of the contenders trained their models on the bulk We believe that a model pretrained on languages similar to
200,000 samples. This data includes the errors mentioned ear- Bengali would offer similar improvements for Bengali ASR.
lier. Therefore their models learned to capture the ambiguous
pronunciations, sentence structures and unnatural pauses. D. Effectiveness of Transformer Model
Whereas we chose to train our model on two phases. First
It is our belief that we didn’t push the limits of the wav2vec2
phase included the majority of our training set consisting
Transformer model since in the ideal case (pretraining, fine-
of 37,500 samples. Then later in the training process, we
tuning and inference on English), it achieved a best case WER
combined the training and validation set to have 45,000
of 0.018 [8] without using a language model. With more
samples. This allowed for gradual exposure of new vocabulary
labelled data and better compute, it is highly likely a better
to the model.
result on Bengali speech recognition can be achieved using
the transformer architecture.
V. D ISCUSSION
VI. C ONCLUSION AND F UTURE W ORK
A. Training Loss and WER Correlation
In this work, we present an effective scheme of fine-tuning
While training it was observed there that the training loss a pretrained wav2vec2.0 model for speech transcription of a
did not correspond strongly to the Word Error Rate. Since low resource language. We applied the scheme on a subset of
WER is the final evaluation metric, epochs in the later stages of 45 thousand audio samples from the Bengali Common Voice
training sometimes resulted in slight but temporary increases Dataset to transcribe Bengali audio, achieving a final WER of
in training loss that may persist for multiple epochs before 0.2524. This demonstrates the viability of the wav2vec2 model
lowering once again. Unlike other neural network models, the for Bengali Speech Recognition. Furthermore, this result was
risk of over-fitting appears to be low in the Transformer model. achieved using only 17.84% of the Bengali Common Voices
Dataset making it very likely that even better results can be
B. Effect of Exposure to New Data achieved if the entire dataset is utilized. Furthermore, the
model was pretrained on mostly modern European languages,
The model was also able to adapt to exposure to new data with very little exposure to Bengali or languages similar to it.
well and it did not result in any sudden spike in evaluation We anticipate that pretraining the model on languages similar
metrics. On the second dataset, we observed that even with to the target language will yield better results as well. We will
training loss and WER plateauing, further training resulted in train networks using our scheme on future iterations of the
a slightly improved Levenshtein Distance. This is likely due dataset to create robust models.
to the exposure to new vocabulary which ultimately improves
performance on previously unseen data. 11 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
R EFERENCES
[1] Alexandre Magueresse, Vincent Carles, and Evan Heetderks. Low-
resource languages: A review of past work and future challenges, 2020.
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
2017.
[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael
Auli. wav2vec 2.0: A framework for self-supervised learning of speech
representations, 2020.
[4] Qiantong Xu, Alexei Baevski, and Michael Auli. Simple and ef-
fective zero-shot cross-lingual phoneme recognition. arXiv preprint
arXiv:2109.11680, 2021.
[5] Samiul Alam, Asif Sushmit, Zaowad Abdullah, Shahrin Nakkhatra,
MD. Nazmuddoha Ansary, Syed Mobassir Hossen, Sazia Morshed
Mehnaz, Tahsin Reasat, and Ahmed Imtiaz Humayun. Bengali common
voice speech dataset for automatic speech recognition, 2022.
[6] Yu Zhang, James Qin, Daniel S Park, Wei Han, Chung-Cheng Chiu,
Ruoming Pang, Quoc V Le, and Yonghui Wu. Pushing the limits of semi-
supervised learning for automatic speech recognition. arXiv preprint
arXiv:2010.10504, 2020.
[7] Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin
Jansche, and Linne Ha. Crowd-Sourced Speech Corpora for Javanese,
Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. In Proc. The 6th
Intl. Workshop on Spoken Language Technologies for Under-Resourced
Languages (SLTU), pages 52–55, Gurugram, India, August 2018.
[8] Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu.
Applying wav2vec2.0 to speech recognition in various low-resource
languages. arXiv preprint arXiv:2012.12121, 2020.
[9] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pages 8024–8035. Curran
Associates, Inc., 2019.
[10] Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C.,
Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. IndicNLP-
Suite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained
Multilingual Language Models for Indian Languages. In Findings of
EMNLP, 2020.
[11] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and
Ronan Collobert. Mls: A large-scale multilingual dataset for speech
research. ArXiv, abs/2012.03411, 2020.
[12] Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, and
Mark Hasegawa-Johnson. Zero-shot cross-lingual phonetic recognition
with external language embedding. In Interspeech, pages 1304–1308,
2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy