Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Baumann, Ilja; Wagner, Dominik; Riedhammer, Korbinian; Bocklet, Tobias

Computer Science > Sound

arXiv:2501.16131 (cs)

[Submitted on 27 Jan 2025]

Title:Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Authors:Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

View PDF HTML (experimental)

Abstract:Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Comments:	ICASSP 2025
Subjects:	Sound (cs.SD)
Cite as:	arXiv:2501.16131 [cs.SD]
	(or arXiv:2501.16131v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2501.16131

Submission history

From: Ilja Baumann [view email]
[v1] Mon, 27 Jan 2025 15:20:50 UTC (1,356 KB)

Computer Science > Sound

Title:Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Sound

Title:Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.