0% found this document useful (0 votes)

94 views5 pages

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

1) The document discusses fine-tuning the wav2vec 2.0 model for Bengali speech recognition using the Bengali Common Voice Speech Dataset. 2) After training for 71 epochs, the model achieved a training loss of 0.3172 and a WER of 0.2524 on the validation set. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on the test set. 3) Additional training on the combined training and validation sets yielded an improved Levenshtein Distance of 2.60753 on the test set, showing the model was effective for Bengali speech recognition.

Uploaded by

Latifur Rahman Zihad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views5 pages

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

Uploaded by

Latifur Rahman Zihad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Applying wav2vec2 for Speech Recognition on

Bengali Common Voices Dataset

1st H.A.Z Sameen Shahgir 2nd Khondker Salman Sayeed
Undergraduate, Computer Science and Engineering Undergraduate, Computer Science and Engineering
Bangladesh University of Engineering and Technology Bangladesh University of Engineering and Technology
Dhaka, Bangladesh Dhaka, Bangladesh
1805053@ugrad.cse.buet.ac.bd 1805050@ugrad.cse.buet.ac.bd
arXiv:2209.06581v1 [eess.AS] 11 Sep 2022

3rd Tanjeem Azwad Zaman

Undergraduate, Computer Science and Engineering
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh
1805006@ugrad.cse.buet.ac.bd

Abstract—Speech is inherently continuous, where discrete Recurrent Neutral Networks (RNN) has long been the go-
words, phonemes and other units are not clearly segmented, to solution for sequence to sequence machine translation tasks
and so speech recognition has been an active research problem like Speech Recognition. However, it runs into the problem of
for decades. In this work we have fine-tuned wav2vec 2.0 to
recognize and transcribe Bengali speech – training it on the vanishing gradients and very high computational cost owing
Bengali Common Voice Speech Dataset. After training for 71 to the lack of parallelizability when fitting long data, such as
epochs, on a training set consisting of 36919 mp3 files, we speech. Each second of audio sampled at 16KHz produces to
achieved a training loss of 0.3172 and WER of 0.2524 on a 16000 elements in its vector representation. Modifications to
validation set of size 7,747. Using a 5-gram language model, RNN such as Long short-term memory (LSTM) and Gated
the Levenshtein Distance was 2.6446 on a test set of size 7,747.
Then the training set and validation set were combined, shuffled Recurrent Unit (GRU) can mitigate, but not quite overcome
and split into 85-15 ratio. Training for 7 more epochs on this the limitations of a recurrence based approach.
combined dataset yielded an improved Levenshtein Distance of The Transformer [2] is a model architecture which eschews
2.60753 on the test set. Our model was the best performing one recurrence and instead relies entirely on an attention mech-
on a hidden dataset, achieving a Levenshtein Distance of 6.2341 , anism to draw dependencies between input and output. The
which was 1.1049 units lower than other competing submissions.
Transformer allows for significantly more parallelization and
have reached a new state of the art in sequence to sequence
I. I NTRODUCTION translations.
Speech recognition and transcription is one of the This brings us to Wav2vec2.0, a great match for our
quintessential applications of Machine Learning, with every requirements of an unsupervised model suited to low resource
advancement having wide reaching effects on society as a languages. Wav2vec2.0 is “a framework for self-supervised
whole. English speech recognition in particular is at the stage learning of representations from raw audio data” [3] . This
of commercial viability, with products like Alexa, Google model uses a multi-layer convolutional neural network (CNN)
Assistant and Siri becoming household names. Unfortunately, to encode speech audio and produce latent speech repre-
the state of the art for Bengali speech recognition is lagging sentations. Spans of these representations are masked and
behind, especially considering that it is one of the most widely contextualized using a Transformer network. The model then
spoken languages in the world. Several factors such as the demarcates true latent from distractors through training via
lack of commercial incentive, geological clustering of native contrastive tasks. The model learns discrete speech units
speakers and a relatively young Bengali IT industry have as part of this pre-training on unlabeled speech. Afterward,
caused this. labeled data is used to fine-tune the model with a Connectionist
“LRLs (Low resource Languages) can be understood as less Temporal Classification (CTC) loss. This aids in downstream
studied, resource scarce, less computerized, less privileged, speech recognition tasks.
less commonly taught, or low density, among other denomina- The CNN based speech-to-embedding method applied by
tions.” [1] Bengali can be considered a low-resource language the wav2vec2.0 model is not language dependent, rather it
in the sense that, transcribed speech for Bengali (to be used is able to learn the patterns that all spoken languages have
in supervised learning) is very scarce. in common. As such, pretraining on multiple languages was
found to be beneficial even when the target language was
1 https://www.kaggle.com/competitions/dlsprint/discussion/349991 not included in the pretraining phase [4]. As such, we chose
facebook/wav2vec2-large-xlsr-53, which was pretrained on same batch. For faster training, we removed the starting and
about 56 thousand hours of multilingual speech from the MLS, ending portions of the each sound array if the value at that
CommonVoice and BABEL datasets 2 . index was less than some fraction of the maximum value in
Recently, an addition to the repository of Bengali speech that array. After testing different values, the cutoff threshold
transcription was made by Bengali Common Voice Speech x < max(array)
30 was chosen.
Dataset [5]. This dataset allows us to fine-tune Meta’s pre- Finally, only sound files between 1 and 10 seconds were
trained wav2vec 2.0 to Bengali to allow better transcription chosen for training, yielding a final train set of length 36919.
specific to Bengali. In this paper we showcase the training
C. Training
pipeline used to train wav2vec 2.0 to Bengali Common Voice
Speech Dataset to achieve a considerable Word Error Rate For training we used the pytorch [9] based implementation
(WER). of Transformer [2] model provided and maintained by hug-
gingface.co. We fine-tuned the pretrained facebook/wav2vec2-
II. M ETHODOLOGY
large-xlsr-53 which was trained to on unlabelled multilingual
A. Model Selection speech and intended for further downstream training on la-
wav2vec2 is currently yields state-of-the-art WER 1.4% / belled data.
2.6% on the LibriSpeech test/test-other sets [6]. As such it was First Phase Training was done for a total of 71 epochs,
a natural choice for Bengali ASR. Two different approaches totalling to around 100 hours of training on a single Nvidia
were considered as the starting point for training, namely a K80 GPU. The training run-time was approximately 80 hours
self-supervised pre-trained model (facebook/wav2vec2-large- for 71 epochs of training. The AdamW Optimizer was used
xlsr-53)3 and an already fine-tuned model (arijitx/wav2vec2- starting with learning rate 5 × 10−4 and weight delay of 2.5 ×
xls-r-300m-bengali)4 convergent on another similar dataset [7] 10−6 . These values were decided on after some preliminary
(i.e. transfer learning). After preliminary testing, we found attempts which either failed to converge with larger hyper-
that further fine-tuning an already convergent model slightly parameters or were slow to show appreciable convergence with
degrades performance on the target dataset. Our results concur lower rates.
with similar findings where wav2vec2 model was applied to Around 0.25 WER (∼ Epoch 55), we noticed the training
CALLHOME-MA dataset [8]. and evaluation metrics plateauing considerably. On 7747 hid-
Therefore, we determined that the self-supervised pretrained den test data, we achieved a Levenshtein Distance of 2.64446.
model facebook/wav2vec2-large-xlsr-53 should serve as the Speech recognition tasks often hinge on exposure to a large
basis for further fine-tuning. vocabulary and since we initially limited ourselves to a subset
B. Preprocessing of the entire dataset, we used a portion of the validation set
for further training.
The train split of Bengali Common Voice Speech Dataset The main motivation behind two phase training was to
comprises of 206,951 mp3 files with their corresponding increase the model’s exposure to new speech and the cor-
Bengali transcription along with some meta-data such as up- responding text labels. We merged training and validation
votes, downvotes, gender etc. On quick analysis, the following datasets and applied a 85-15 split. With a weight decay of
metrics were found. 2.5 × 10−9 and learning rate 5 × 10−6 , we trained for a
TABLE I further 7 epochs, which took a further 9 hours on the Nvidia
DATASET ANALYSIS FINDINGS K80 GPU. The hyper-parameters were significantly lowered
to allow prevent the model weights from rapidly deviating.
Criterion Count
upvotes > downvotes 37405 This produced a slight but noticeable improvement to the
upvotes < downvotes 5536 Levenshtein Distance, now 2.60753 (down from 2.6446).
upvotes = downvotes = 0 161380
upvotes > downvotes and 1 ≤ duration ≤ 10 36919 D. Post-processing
Three separate post processing functions were used to
Since 5536 (∼13 percent) out of 42941 voted data was generate the final prediction on the hidden test set, namely n-
unreliable, we opted to train using only the subset with gram Language Model decoding, Unicode normalization and
upvotes > downvotes. appending a common character to the end of sentences.
The mp3 files were then sampled at 16kHz and characters Our Language Model of choice was forked from the
in the label strings were cast to integer hashes using the huggingface model arijitx/wav2vec2-xls-r-300m-bengali, gen-
a vocabulary dictionary forked from the huggingface model erated from IndicCorp language corpus [10]. The language
arijitx/wav2vec2-xls-r-300m-bengali. model improved spelling accuracy and the Levenshtein Dis-
The Transformers [2] implementation from the Transform- tance decreased from 3.30648 to 2.72243 once applied.
ers library5 pads each array to match the longest array in the Unicode Normalization using bnunicodenormalizer6 and
2 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vev appending the ”devanagari danda” [Unicode#2404] to output
3 https://huggingface.co/facebook/wav2vec2-large-xlsr-53
sentences also resulted in slight improvements.
4 https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali
5 https://huggingface.co/docs/transformers/index 6 https://github.com/mnansary/bnUnicodeNormalizer
Since most punctuation was removed during preprocessing, From figure 2 we can see that WER has plateaued at around
the model outputs didn’t contain them. A suitable language 60 epochs. Whereas training loss, as in figure 1 showed steady
model might could possibly reinsert punctuation but we did descent.
not explore this possibility in this iteration.
TABLE II
T RAINING P HASE 2 M ETRICS
III. R ESULTS
epoch Training Loss Eval.WER
After a total of about 77 epochs of training, in two phases 1.26 0.2552 0.1497
- our model settled on a Levenshtein Mean Distance Score 2.51 0.2482 0.1500
3.77 0.2497 0.1498
of 2.60753 on the test set. The training error and evaluation 5.02 0.2474 0.1499
word error metric gradually decreased through out the training 6.28 0.2493 0.1499
process. After training phase 1, the model had a training loss
of 0.3172, and an evaluation WER of 0.2524 on the validation
set. IV. C OMPARISON WITH OTHER WORKS

The final model scored a Levenshtein Mean Distance of This work was the champion on the kaggle community com-
6.234 on a new hidden dataset.7 petition ”DL Sprint”8 , under the name ”YellowKing”. Among
59 participating teams, our model was the best performing
1 one on a hidden dataset, achieving a Levenshtein Distance of
6.2349, which was 1.1049 units lower than other competing
0.9 submissions. Overall, this training pipeline generalized well
0.8 on completely unseen data compared to other models. Our
0.7 submission also achieved the highest combined score of 92.59
Training Loss

0.6 out of 100, taking the public dataset, hidden dataset and
methodology into account.
0.5
0.4 A. Base Model
0.3 Most of the contenders chose arijitx/wav2vec2-xls-r-300m-
bengali as the model to start training. That model was al-
0.2
ready fine-tuned for 180,000 steps, on the OPENSLR-SLR53-
0.1 Bengali dataset [7]. The rationale for this choice was to utilize
0 the already learned weights from that model, to incorporate
10 20 30 40 50 60 70 80 transfer learning. This choice allowed such pipelines to quickly
Epoch achieve a low Levenshtein Distance and fit the train-validation
set better than our pipeline10 .
Fig. 1. Training Loss vs Epoch While Bengali Common Voice Speech Dataset and
OpenSLR-Bengali are datasets of the same language, the
Bengali Common Voice Speech Dataset has higher median
1 voice segments per second, higher dynamic range, smaller
0.9 pauses between voice segments [5]. The Bengali Common
Voice Speech Dataset is crowd sourced and therefore recorded
0.8
in a more uncontrolled environment. In addition, OpenSLR-
0.7 Bengali was recorded by a total of 505 speakers, whereas
0.6 Bengali Common Voice Speech Dataset has about 20,000
WER

0.5 speakers, having accents originating from different parts of

Bangladesh and India. This allows models trained on that
0.4 dataset to perform better on unseen practical data with a
0.3 varying range of speaking patterns.
0.2 We chose to start training facebook/wav2vec2-large-xlsr-53,
0.1 which was not fine tuned for Bengali speech transcription.
We trained it solely on the Bengali Common Voice Speech
10 20 30 40 50 60 70 80 Dataset, which allowed it to be exposed to data from a more
Epoch uncontrolled environment and varying accents from the begin-
ning. Therefore, our model slowly improved it’s Levenshtein
Fig. 2. WER vs Epoch 8 https://www.kaggle.com/competitions/dlsprint
9 https://www.kaggle.com/competitions/dlsprint/discussion/349991
7 https://www.kaggle.com/competitions/dlsprint/discussion/349991 10 https://www.kaggle.com/competitions/dlsprint/leaderboard
Distance over time, as opposed to the rapid learning rate C. About the facebook/wav2vec2-large-xlsr-53 pretrained
observed on the contender pipelines, but performed better on model
unseen practical data.
The wav2vec2-large-xlsr-53 model which we used for
further fine-tuning was pretrained on about 56 thousand
B. Filtering Data hours of multilingual speech from the MLS, CommonVoice
and BABEL datasets11 . However the majority (50.7 thou-
All of the contenders trained their model on almost the
sand hours) were on the MLS dataset [11] which com-
entirety of the Bengali Common Voice Speech Dataset [5]
prises of Dutch, English, French, German, Italian, Polish,
on about 200,000 samples. The natural reasoning for this is,
Portuguese and Spanish; i.e. Modern European Languages.
increasing the number of samples allows the model to be
For this reason, effect of pretraining on cross-lingual speech
trained on more rich corpus of words and sentences.
appears somewhat muted on non-European languages. The xlsr
The dataset was divided on the basis on manual verification, model showed significant phonetic token error rate (PTER)
using ”up vote” and ”down vote” metrics. Which allowed us improvement for Georgian [4], a language similar to the
to filter the verified data from the unverified ones. We chose to languages it was pretrained on, but marginal improvement on
train on a small subset of the data, including on the samples Bengali and even deterioration on Vietnamese compared to the
that had upvotes > downvotes. Upon random sampling of previous state-of-the-art [12] on low resource ASR.
the data excluded from our training set we observed that
their flaws were - stuttering, restarting sentence upon mistake,
TABLE III
mispronunciation. Therefore we chose to train on a cleaner PTER ON BABEL L ANGUAGES 6 D ATASET
dataset, consisting of about 37,500 training samples, and 7,747
validation samples. Language Gao et. al [12] wav2vec2-large-xlsr-53 [4]
Georgian 38.6 27.6
Bengali 38.2 36.1
Vietnamese 32.0 40.7
C. Training Sequence
Most of the contenders trained their models on the bulk We believe that a model pretrained on languages similar to
200,000 samples. This data includes the errors mentioned ear- Bengali would offer similar improvements for Bengali ASR.
lier. Therefore their models learned to capture the ambiguous
pronunciations, sentence structures and unnatural pauses. D. Effectiveness of Transformer Model
Whereas we chose to train our model on two phases. First
It is our belief that we didn’t push the limits of the wav2vec2
phase included the majority of our training set consisting
Transformer model since in the ideal case (pretraining, fine-
of 37,500 samples. Then later in the training process, we
tuning and inference on English), it achieved a best case WER
combined the training and validation set to have 45,000
of 0.018 [8] without using a language model. With more
samples. This allowed for gradual exposure of new vocabulary
labelled data and better compute, it is highly likely a better
to the model.
result on Bengali speech recognition can be achieved using
the transformer architecture.
V. D ISCUSSION
VI. C ONCLUSION AND F UTURE W ORK
A. Training Loss and WER Correlation
In this work, we present an effective scheme of fine-tuning
While training it was observed there that the training loss a pretrained wav2vec2.0 model for speech transcription of a
did not correspond strongly to the Word Error Rate. Since low resource language. We applied the scheme on a subset of
WER is the final evaluation metric, epochs in the later stages of 45 thousand audio samples from the Bengali Common Voice
training sometimes resulted in slight but temporary increases Dataset to transcribe Bengali audio, achieving a final WER of
in training loss that may persist for multiple epochs before 0.2524. This demonstrates the viability of the wav2vec2 model
lowering once again. Unlike other neural network models, the for Bengali Speech Recognition. Furthermore, this result was
risk of over-fitting appears to be low in the Transformer model. achieved using only 17.84% of the Bengali Common Voices
Dataset making it very likely that even better results can be
B. Effect of Exposure to New Data achieved if the entire dataset is utilized. Furthermore, the
model was pretrained on mostly modern European languages,
The model was also able to adapt to exposure to new data with very little exposure to Bengali or languages similar to it.
well and it did not result in any sudden spike in evaluation We anticipate that pretraining the model on languages similar
metrics. On the second dataset, we observed that even with to the target language will yield better results as well. We will
training loss and WER plateauing, further training resulted in train networks using our scheme on future iterations of the
a slightly improved Levenshtein Distance. This is likely due dataset to create robust models.
to the exposure to new vocabulary which ultimately improves
performance on previously unseen data. 11 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
R EFERENCES
[1] Alexandre Magueresse, Vincent Carles, and Evan Heetderks. Low-
resource languages: A review of past work and future challenges, 2020.
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
2017.
[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael
Auli. wav2vec 2.0: A framework for self-supervised learning of speech
representations, 2020.
[4] Qiantong Xu, Alexei Baevski, and Michael Auli. Simple and ef-
fective zero-shot cross-lingual phoneme recognition. arXiv preprint
arXiv:2109.11680, 2021.
[5] Samiul Alam, Asif Sushmit, Zaowad Abdullah, Shahrin Nakkhatra,
MD. Nazmuddoha Ansary, Syed Mobassir Hossen, Sazia Morshed
Mehnaz, Tahsin Reasat, and Ahmed Imtiaz Humayun. Bengali common
voice speech dataset for automatic speech recognition, 2022.
[6] Yu Zhang, James Qin, Daniel S Park, Wei Han, Chung-Cheng Chiu,
Ruoming Pang, Quoc V Le, and Yonghui Wu. Pushing the limits of semi-
supervised learning for automatic speech recognition. arXiv preprint
arXiv:2010.10504, 2020.
[7] Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin
Jansche, and Linne Ha. Crowd-Sourced Speech Corpora for Javanese,
Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. In Proc. The 6th
Intl. Workshop on Spoken Language Technologies for Under-Resourced
Languages (SLTU), pages 52–55, Gurugram, India, August 2018.
[8] Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu.
Applying wav2vec2.0 to speech recognition in various low-resource
languages. arXiv preprint arXiv:2012.12121, 2020.
[9] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pages 8024–8035. Curran
Associates, Inc., 2019.
[10] Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C.,
Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. IndicNLP-
Suite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained
Multilingual Language Models for Indian Languages. In Findings of
EMNLP, 2020.
[11] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and
Ronan Collobert. Mls: A large-scale multilingual dataset for speech
research. ArXiv, abs/2012.03411, 2020.
[12] Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, and
Mark Hasegawa-Johnson. Zero-shot cross-lingual phonetic recognition
with external language embedding. In Interspeech, pages 1304–1308,
2021.

Vocabulary Ladders Understanding Word Nuances Level 6 (Timothy Rasinski, Melissa Cheesman Smith) (Z-Library)
100% (2)
Vocabulary Ladders Understanding Word Nuances Level 6 (Timothy Rasinski, Melissa Cheesman Smith) (Z-Library)
146 pages
Image Net - Detection-Audio Wave Net-Natural Language Processing - Word2Vec Model
No ratings yet
Image Net - Detection-Audio Wave Net-Natural Language Processing - Word2Vec Model
21 pages
ASRcourse MOSIG2024
No ratings yet
ASRcourse MOSIG2024
97 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
No ratings yet
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
25 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
2023.banglalp-1.16 Pseudo
No ratings yet
2023.banglalp-1.16 Pseudo
11 pages
Wavllm: Towards Robust and Adaptive Speech Large Language Model
No ratings yet
Wavllm: Towards Robust and Adaptive Speech Large Language Model
21 pages
2384 Unsupervised Speech Recognitio
No ratings yet
2384 Unsupervised Speech Recognitio
14 pages
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
No ratings yet
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
11 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
3 Gan
No ratings yet
3 Gan
12 pages
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
No ratings yet
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
9 pages
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
No ratings yet
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
9 pages
Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning
No ratings yet
Enhancing Bangla Local Speech-To-Text Conversion Using Fine-Tuning Wav2Vec 2.0 With Openslr and Self-Compiled Datasets Through Transfer Learning
11 pages
Performance Analysis of Different Acoustic Features Based On LSTM For Bangla Speech Recognition
No ratings yet
Performance Analysis of Different Acoustic Features Based On LSTM For Bangla Speech Recognition
9 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
Unit 5.
No ratings yet
Unit 5.
17 pages
2023 Arabicnlp-1 10
No ratings yet
2023 Arabicnlp-1 10
8 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
No ratings yet
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
13 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
No ratings yet
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
13 pages
L - R - V C CPU: OW Latency EAL Time Oice Onversion On
No ratings yet
L - R - V C CPU: OW Latency EAL Time Oice Onversion On
8 pages
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
No ratings yet
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
11 pages
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
No ratings yet
An 20advanced 20NLP 20framework 20-Formatted 20paper-Libre
12 pages
SIP 2023 2028 Template
100% (8)
SIP 2023 2028 Template
66 pages
Trend
No ratings yet
Trend
47 pages
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
No ratings yet
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
13 pages
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
No ratings yet
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
25 pages
1707 06519
No ratings yet
1707 06519
8 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
No ratings yet
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
5 pages
1 Base
No ratings yet
1 Base
5 pages
Opening The Black Box of Wav2Vec Feature Encoder
No ratings yet
Opening The Black Box of Wav2Vec Feature Encoder
5 pages
Unit 5 Part 2
No ratings yet
Unit 5 Part 2
21 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
No ratings yet
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
5 pages
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
No ratings yet
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
5 pages
Llama Omni 2
No ratings yet
Llama Omni 2
13 pages
Karafiat Interspeech2017 IS171775
No ratings yet
Karafiat Interspeech2017 IS171775
5 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
Speech Recognition of Isolated Words Usi
No ratings yet
Speech Recognition of Isolated Words Usi
10 pages
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
No ratings yet
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
6 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
Prompting Large Language Models With Speech Recognition Abilities
No ratings yet
Prompting Large Language Models With Speech Recognition Abilities
9 pages
Amanda Cardoso Duarte WAV2PIX Speech-Conditioned Face Generation Using Generative Adversarial Networks CVPRW 2019 Paper
No ratings yet
Amanda Cardoso Duarte WAV2PIX Speech-Conditioned Face Generation Using Generative Adversarial Networks CVPRW 2019 Paper
4 pages
31 Multilingual Automatic Speech
No ratings yet
31 Multilingual Automatic Speech
9 pages
A Survey On Deep Learning Based Lip-Reading Techniques
No ratings yet
A Survey On Deep Learning Based Lip-Reading Techniques
8 pages
Sip Project
No ratings yet
Sip Project
7 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
9 pages
133-138, Tesma0810, IJEAST
No ratings yet
133-138, Tesma0810, IJEAST
6 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
A Speech Recognition System For Bengali Language Using Recurrent Neural Network
No ratings yet
A Speech Recognition System For Bengali Language Using Recurrent Neural Network
4 pages
Capstone Paper
No ratings yet
Capstone Paper
3 pages
Teaching Techniques ABA
No ratings yet
Teaching Techniques ABA
44 pages
Unit 3 - Community Helpers
No ratings yet
Unit 3 - Community Helpers
16 pages
HRM Course Outline 15.08.2023
No ratings yet
HRM Course Outline 15.08.2023
5 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
Testbank Data Communication and Computer Networks Business Users Approach 9th Edition Verified PDF
No ratings yet
Testbank Data Communication and Computer Networks Business Users Approach 9th Edition Verified PDF
411 pages
Educ 160502142410
No ratings yet
Educ 160502142410
51 pages
Behaviourism Learning-Group 1
100% (1)
Behaviourism Learning-Group 1
22 pages
Assure Model Lesson Plan
No ratings yet
Assure Model Lesson Plan
2 pages
Mathematics: Quarter 3
No ratings yet
Mathematics: Quarter 3
13 pages
How Agile Is Your Strategy Process
No ratings yet
How Agile Is Your Strategy Process
5 pages
Annotated-Intasc 20standard 20 239 20 20journals
No ratings yet
Annotated-Intasc 20standard 20 239 20 20journals
2 pages
Bicol College: Juban Sorsogon Campus
No ratings yet
Bicol College: Juban Sorsogon Campus
35 pages
MSC Thesis
No ratings yet
MSC Thesis
85 pages
School Spaces For Student Wellbeing and Learning Insights From Research and Practice Hilary Hughes PDF Download
No ratings yet
School Spaces For Student Wellbeing and Learning Insights From Research and Practice Hilary Hughes PDF Download
64 pages
DRRR Q3 TG2
No ratings yet
DRRR Q3 TG2
14 pages
Assessment of Learning Out Comes
No ratings yet
Assessment of Learning Out Comes
38 pages
Little Sally Walker
No ratings yet
Little Sally Walker
2 pages
Gabriela Mcnamara Resume
No ratings yet
Gabriela Mcnamara Resume
1 page
Home Economics 7-9 Course Descriptor 2015-16
No ratings yet
Home Economics 7-9 Course Descriptor 2015-16
3 pages
Audiology Student Clinical Handbook Version 7 - Dec 2024
No ratings yet
Audiology Student Clinical Handbook Version 7 - Dec 2024
61 pages
DT Triangle Fillable For Instruction 1
No ratings yet
DT Triangle Fillable For Instruction 1
6 pages
LESSON Plan ENTREP Sy2023 24
No ratings yet
LESSON Plan ENTREP Sy2023 24
4 pages
Application
No ratings yet
Application
10 pages
Harmonizing Education and Artificial Intelligence For Sustainable Development in Industry 4.0: Challenges, Opportunities, and Ethical Dimensions
No ratings yet
Harmonizing Education and Artificial Intelligence For Sustainable Development in Industry 4.0: Challenges, Opportunities, and Ethical Dimensions
14 pages
Industrial Tour
No ratings yet
Industrial Tour
17 pages
A. A. Leontiev
No ratings yet
A. A. Leontiev
11 pages
Electronics: TIME: A Machine Learning-Based Framework For Gathering and Leveraging Web Data To Cyber-Threat Intelligence
No ratings yet
Electronics: TIME: A Machine Learning-Based Framework For Gathering and Leveraging Web Data To Cyber-Threat Intelligence
34 pages
Jashore University of Science and Technology: 2D Car Simulation Game With Opengl
No ratings yet
Jashore University of Science and Technology: 2D Car Simulation Game With Opengl
12 pages
Course Information:: Leadership Development in Corrections X
No ratings yet
Course Information:: Leadership Development in Corrections X
6 pages
List Category Cloth
No ratings yet
List Category Cloth
1 page
STA 2023 12062 Syllabus Stewart
No ratings yet
STA 2023 12062 Syllabus Stewart
2 pages
Jashore University of Science and Technology: Jashore - 7408 Course Registration Form
No ratings yet
Jashore University of Science and Technology: Jashore - 7408 Course Registration Form
1 page
Latifur Rahman Zihad
No ratings yet
Latifur Rahman Zihad
1 page
Jashore University of Science and Technology Jashore - 7408: Admit Card
No ratings yet
Jashore University of Science and Technology Jashore - 7408: Admit Card
1 page
Execute AI Genesis Team Match-Making Results
No ratings yet
Execute AI Genesis Team Match-Making Results
6 pages
Boosting
No ratings yet
Boosting
14 pages
Companies
No ratings yet
Companies
2 pages
Suggestion Batch 01
No ratings yet
Suggestion Batch 01
1 page
3 - 5630981 - 1726906554 - AWS Course Completion Certificate
No ratings yet
3 - 5630981 - 1726906554 - AWS Course Completion Certificate
1 page
223 - 3 - 5630981 - 1726987335 - AWS Course Completion Certificate
No ratings yet
223 - 3 - 5630981 - 1726987335 - AWS Course Completion Certificate
1 page
2000 - 3 - 5630981 - 1726986874 - AWS Course Completion Certificate
No ratings yet
2000 - 3 - 5630981 - 1726986874 - AWS Course Completion Certificate
1 page
Paper 1 May 2005 Physics
No ratings yet
Paper 1 May 2005 Physics
3 pages
Acdcal 060520
No ratings yet
Acdcal 060520
2 pages
International Student Application Guide
No ratings yet
International Student Application Guide
4 pages
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

Uploaded by

Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset

Uploaded by

Applying wav2vec2 for Speech Recognition on

Bengali Common Voices Dataset

3rd Tanjeem Azwad Zaman

0.5 speakers, having accents originating from different parts of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.