0% found this document useful (0 votes)
4 views5 pages

Optimizing Speech Models with Freezing

Adapting speech models to new languages requires an optimization of the trade-off between accuracy and computational cost. In this work, we investigate the optimization of Mozilla’s DeepSpeech model when adapted from English to German and Swiss German through selective freezing of layers. Employing a strategy of transfer learning, we analyze the performance impacts of freezing different numbers of network layers during fine-tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Optimizing Speech Models with Freezing

Adapting speech models to new languages requires an optimization of the trade-off between accuracy and computational cost. In this work, we investigate the optimization of Mozilla’s DeepSpeech model when adapted from English to German and Swiss German through selective freezing of layers. Employing a strategy of transfer learning, we analyze the performance impacts of freezing different numbers of network layers during fine-tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167

Optimizing Speech Models with Freezing


Revanth Reddy Pasula1
1
Department of Computer Science Wichita State University, Wichita, United States

Publication Date: 2025/07/14

Abstract: Adapting speech models to new languages requires an optimization of the trade-off between accuracy and
computational cost. In this work, we investigate the optimization of Mozilla’s DeepSpeech model when adapted from English
to German and Swiss German through selective freezing of layers. Employing a strategy of transfer learning, we analyze the
performance impacts of freezing different numbers of network layers during fine-tuning. The experiment reveals that
freezing the initial layers achieves significant performance improvements: training time decreases and accuracy increases.
This layer-freezing technique hence offers an extensible way to improve automated speech recognition for under-resourced
languages.

Keywords: Automatic Speech Recognition (ASR); Deep Speech; German; Layer Freezing; Low-Resource Languages; Swiss
German; Transfer Learning.

How to Cite: Revanth Reddy Pasula; (2025). Optimizing Speech Models with Freezing. International Journal of Innovative Science
and Research Technology, (RISEM–2025), 69-73. https://doi.org/10.38124/ijisrt/25jun167

I. INTRODUCTION pre-training of a network over an enormous, varied data set


and then initializing with these pre-trained parameters, fine-
ASR systems have improved mostly for the language of tuning over a more modest target data set can be rendered
English, leading to very well-optimized models for speech more efficient both in terms of time and performance [4]. This
tasks (e.g., text-to-speech systems [15]). In contrast, exploits the hierarchical representations obtained through
languages with few data sources—like standard German and training of the network: following exposure to large quantities
Swiss German—are under-resourced because they lack large of data, the layers within the network have extracted helpful
training sets and domain-specific models. The contribution of features that can be well-transferred to similar tasks without
the current work is to bridge this gap, and we adapt Mozilla's needing to begin from a zero starting point.
DeepSpeech implementation¹ of Baidu's DeepSpeech
architecture [1] to both German and Swiss German. We use It is common practice in computer vision to freeze parts
transfer learning with a proven pre-trained model in English, of pre-trained models during fine-tuning of the model for a
and we thoroughly investigate the impact of the freezing of novel task and keep previously acquired features [5]. The
various network layers during fine-tuning. practice has been adapted in end-to-end ASR models such as
DeepSpeech [4][6]. The idea is that the lower layers normally
Previous attempts at deploying DeepSpeech for extract the basis acoustic patterns (comparable to the low-
German [2] and Swiss German [3] have delivered early level visual features), whereas higher layers represent more
evidence; nonetheless, differences in data composition and abstract, language-dependent information. Current
training methods call for further inquiries. In this research, assessments of end-to-end ASR models show that, while the
emphasis is put into separating the effects of selective layer feature hierarchy of speech may not always be as apparent as
freezing and examining the contribution that it makes towards in vision, the higher layers do represent higher order phonetic
improving the performance of the recognizer while and linguistic features [7]. Practically, then, the earlier layers
minimizing training time. The research is framed against the capturing common acoustic features can have their
backdrop of modern developments in transfer learning parameters frozen while enabling the higher layers to adapt
methods and the growing interest in ensuring computationally to the new language
efficient ASR model adaptation towards the use of limited
resource environments. III. METHODOLOGY

II. TRANSFER LEARNING AND LAYER An experimental framework was formulated to


FREEZING investigate the effects of layer freezing in the scenario of ASR
under transfer learning. The methodology is comprised of
Transfer learning is now an essential method of deep four principal elements: the DeepSpeech architecture,
learning where models are able to recycle knowledge training procedure with layer freezing settings,
acquired from one task or data set for use in another. Through

IJISRT25JUN167 www.ijisrt.com 69
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167

hyperparameters and computing environment, and dataset connected layers, layer 4 is an LSTM recurrent layer [11],
preparation as well as preprocessing pipeline layer 5 is an additional fully connected but ReLU-activated
layer, and layer 6 is the output layer generating character
A. Deep Speech Architecture probabilities through the use of softmax. The model is trained
The DeepSpeech version of version 0.7 from Mozilla with the use of the Connectionist Temporal Classification
was utilized as the base ASR architecture. The described (CTC) loss [9] and optimization with the use of the Adam
implementation, deviating minimally from the model optimizer [10].
proposed originally by Hannun et al. [1], is documented in
greater detail in the official documentation². The processing Table 1 shows the DeepSpeech architecture and data
pipeline starts with the MFCC [8] extraction from the raw flow, from input audio to feature extraction to output
audio input, followed by a total of six layers with the form of character probabilities (figure adapted from the official
a deep recurrent neural network. The network structure is documentation).
shown in Table I. Briefly, layers 1–3 are ReLU-activated fully

Table 1 Structure of the DeepSpeech Architecture.


Layer Description Activation/Notes
1–3 Fully connected ReLU
4 Recurrent (LSTM) Long Short-Term Memory [11]
5 Fully connected ReLU
6 Output layer Softmax (character probabilities)

B. Training Procedure and Layer Freezing During fine-tuning, the mentioned layers were frozen
We performed a series of training experiments to by indicating them as non-trainable, while the rest of the
measure the effect of frozen layers in transfer learning. For layers were trained over the target data. All the transfer
weight initialization, we utilized an English pre-trained learning models' output layer was re-initialized, as the
DeepSpeech model offered by Mozilla. Six training setups character set (output labels) was different for English
were done for both German and Swiss German, which are compared to the target language. This re-initialization
compiled in Table II. Moreover, we trained one model provided compatibility with German or Swiss German
entirely from scratch with random initialization as our transcripts.
baseline comparison point (labeled the “Reference” condition
with no transfer learning).

Table 2 Training Conditions for Evaluating the impact of layer freezing.


Condition Description
Reference Trained from scratch (random initialization, no pre-trained model).
0 Frozen Layers Initialized from the English model; all layers are fine-tuned on target data.
1 Frozen Layer Freeze the first layer; fine-tune layers 2–6 on target data.
2 Frozen Layers Freeze the first two layers; fine-tune layers 3–6 on target data.
3 Frozen Layers Freeze the first three layers; fine-tune layers 4–6 on target data.
4 Frozen Layers Freeze the first four layers; fine-tune only the last two layers on target data.

C. Hyperparameters and Computational Environment All models (both German and Swiss German) were trained
The same set of hyperparameters was utilized in all over the same number of epochs under the same conditions to
experiments (Table III), with no further tuning aside from make an unbiased comparison of the different freezing
these preselected values. Training was carried out under a strategies.
Linux server with 96 Intel Xeon Platinum 8160 CPU cores.

Table 3 Hyperparameter Settings Utilized when Training.


Hyperparameter Value Notes
Batch Size 24 –
Learning Rate 0.0005 –
Dropout Rate 0.4 –
Training Epochs 30 Per model (each experiment)
Optimizer Adam –

D. Datasets and Preprocessing speakers, with utterances lasting around 3 to 5 seconds. For
The data we used for our experiments are tabulated in the Swiss German models, we drew upon an even smaller
Table IV. For the German models, we utilized the training dataset of 70 hours of Swiss German speech derived from
data from Mozilla’s German corpus [12]. This data comprises Bernese parliamentary debates [13]. The Swiss German
around 315 hours of speech, provided by about 4,823 dataset covers formal speaking with relatively few speakers

IJISRT25JUN167 www.ijisrt.com 70
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167

(around 191), and its size is considerably lower compared to itemized description of each component of the data set, as
the German corpus. well as the preprocessing pipeline.

The initial model for the English DeepSpeech model Along with the acoustic data, we used an external
was trained with a much larger dataset (over 6500 speech language model at time of inference to enhance the accuracy
audio hours of English data) aggregated from heterogeneous of the recognizer. To do this, we trained a tri-gram language
sources such as LibriSpeech and the English part of the model with the KenLM toolkit [14] over a large corpus of text
Common Voice dataset (more information can be found in consisting of public domain German-language text from
footnote 5). Before training, all data sets underwent common Wikipedia articles and Europarl parliamentary debates. This
preprocessing steps, for example, audio-normalization and language model we incorporated into the DeepSpeech
cleaning of transcript text (e.g. lowercasing, punctuation decoder for both German and Swiss German trials, helping
removal), to make them consistent. Table V contains an the system to make more accurate transcripts through
language contextualization.

Table 4 Summary of the Datasets used for Training.


Dataset Language Hours of Audio Number of Speakers
Pre-training English > 6500 —
Training German 315 4,823
Training Swiss German 70 191

Table 5 Description of each Dataset and key Preprocessing Details.


Component Description
German Dataset Collected from Mozilla Common Voice; crowd-sourced speech with diverse speakers; average
utterance length ~3–5 seconds.
Swiss German Collected from Bernese Parliament speeches; formal register, fewer speakers; significantly lower
Dataset volume of data compared to the German set.
English Pretraining Combined from large-scale English corpora (LibriSpeech + Common Voice English); provides broad
acoustic coverage for subsequent adaptation.

IV. RESULTS AND DISCUSSION These are all indications that, for German, retaining the
lower-level layers of the acoustic features (up to two or three
We compared the performance of the six training layers) gives the best result, significantly outperforming the
schemes in terms of word error rate (WER) and character baseline and the full fine-tuned model.
error rate (CER) on test sets for both German and Swiss
German. Table VI shows the WER and CER seen by each For Swiss German, we see the same pattern with
model configuration for German, while Table VII shows the differing magnitude. The baseline Swiss German model (no-
WER and CER for Swiss German. “Reference” in these transfer) achieved a WER of 74.0% (CER 52.0%). Fine-
tables indicates the model trained from scratch without any tuning all model layers on Swiss German data (0 frozen)
transfer learning, and the “Improvement” column shows the caused the WER to worsen slightly to 76.0%, which shows
percentage point improvement in WER with respect to that that such indiscriminate fine-tuning with no freezing can
baseline. overfit or mis-adapt to the small Swiss German corpus.
Freezing the early layers, in contrast, worked: with one
For the German ASR task, the baseline model of frozen, the WER improved to 69.0% (CER 48.0%), about a
training without any transfer learning obtained a WER of 5-point improvement over the baseline, and with two frozen,
70.0% with CER of 42.0%. Employing the pre-trained model the WER further improved to 67.0% (CER 45.0%), which
for English with no frozen layers (0 frozen, full fine-tuning) was the best performance for Swiss German (a 7-point
reduced the WER to 63.0% (CER 37.0%), which is only a improvement over baseline). Freezing three or four layers
modest improvement of 7.0 points. However, partial freezing showed no additional improvements (WER ~68.0% in each
of the initial layers produced much greater improvements. case, ~6 points improvement over baseline). So, for Swiss
Simply freezing the first layer improved the WER to 48.0% German, the first two layers of pre-trained model freezation
(CER 26.0%), which is a 22-point WER improvement over provided the greatest improvement, with freezation beyond
the baseline. Freezing the first two layers improved the WER two not bringing an additional advantage and retaining
further to 44.0% (CER 22.0%), which is the best performance approximately the same performance.
and an improvement of 26 points over the baseline.
Significantly, two or three frozen layers showed the same In total, selective freezing of layers resulted in notably
WER (44.0%), which means that there would have been no improved accuracy for both languages over training from
additional improvement from the third layer over the first scratch. The advantage was particularly dramatic for German,
two. With four frozen layers, performance actually decreased with the larger dataset; the method of transfer learning
slightly with an increase in WER to 46.0% and CER to reduced the WER by more than 26 absolute points. Swiss
25.0%, though still significantly better than the baseline. German, with the much smaller dataset and higher dialectal

IJISRT25JUN167 www.ijisrt.com 71
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167

variation, also showed improved performance with transfer convergence patterns. This suggests that retaining the pre-
learning, though the relative improvement fell short. These trained low-level feature extractors didn't impede training;
results show that the lowest layers of the deep model encode the models all trained at around the same speed, just they
general acoustic representations that are relevant for many achieved different ending accuracy levels depending upon the
languages. By holding these layers constant, the fine-tuning number of layers updated. This result indicates that much of
procedure can concentrate on adapting higher-level layers to the key learning of the new language happens higher up in the
the target language’s idiosyncrasies. But freezing too many model after an effective set of building block features is
layers starts to restrict the model’s flexibility: the modest established.
decline in performance when four layers were frozen
indicates that the model required some of the later layers to Table 6 below presents the performance of the different
adapt to language-specific features. training strategies for German, and Table VII presents the
respective results for Swiss German. Table VIII presents a
Interestingly, we found that models with varying high-level comparison of each language's optimal freezing
numbers of frozen layers showed very comparable training configurations and the resultant error rates.

Table 6 German ASR Performance with Various Layer-Freezing Strategies (WER = Word Error Rate, CER = Character Error
Rate). The Improvement Column Indicates WER Improvement Compared to the Baseline (Reference) Model.
Training Strategy WER (%) CER (%) WER Improvement
Reference (No Transfer; Random Init.) 70.0 42.0 —
0 Frozen Layers (Full fine-tuning) 63.0 37.0 +7.0
1 Frozen Layer 48.0 26.0 +22.0
2 Frozen Layers 44.0 22.0 +26.0
3 Frozen Layers 44.0 22.0 +26.0
4 Frozen Layers 46.0 25.0 +24.0

Table 7 ASR Performance for Swiss German under Different Layer-Freezing Strategies.
Training Strategy WER (%) CER (%) WER Improvement
Reference (No Transfer; Random Init.) 74.0 52.0 —
0 Frozen Layers (Full fine-tuning) 76.0 54.0 –2.0
1 Frozen Layer 69.0 48.0 +5.0
2 Frozen Layers 67.0 45.0 +7.0
3 Frozen Layers 68.0 47.0 +6.0
4 Frozen Layers 68.0 46.0 +6.0

Table 8 Summary of Optimal Performance Results Across Languages.


Language Optimal # of Frozen Layers Best WER (%) Best CER (%)
German 2–3 44.0 22.0
Swiss German 2 67.0 45.0

V. CONCLUSION Our investigation demonstrated that higher layer wise


freezing for Swiss German (after the second layer) and for
Finally, we have shown in this work that transfer German (after the third layer) leads to no further
learning along with selective layer freezing can be an improvements in accuracy, but the selective freezing of
affordable approach to enhance ASR systems for low- higher dense layers is still very advantageous. It not only
resource languages. We experimented heavily with Mozilla’s increases the accuracy but also makes the fine-tuning more
DeepSpeech setup on German and Swiss German and could easily by decreasing the trainable parameters. There seems
confirm that by initializing the network from a pre-trained to be a trade-off between keeping pre-learned representations
English model and freezing the initial layers, recognition and enough flexibility for language-specific adaptation.
performance can be improved significantly. The best gains Freezing more layers (even four) harms the performance
were found in models with two to three frozen layers, slightly, and thus the higher layers still need some retraining
suggesting that low-level phonetic features that English ASR to handle the nuances of the target language. This trade-off
systems learned are highly transferable. By preserving them probably depends on the amount and quality of the training
with a frozen model, the fine-tuning can more quickly data available in the target language, and deserves further
specialize the higher layers of the model to the target exploration.
language. On the other hand, models trained from scratch
(i.e., no pre-training and no transfer learning) performed Overall, selective layer freezing transfer learning is a
significantly worse, demonstrating the utility of abundant powerful technique for closing the performance gap between
English data in low-resource settings. high and low resource languages in speech recognition. The
results also motivate additional research into adaptive
freezing strategies (e.g., deciding at runtime which layers to

IJISRT25JUN167 www.ijisrt.com 72
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167

freeze), and demonstrate the potential of such methodology Workshop on Representation Learning for NLP
towards building scalable, robust, and computationally (RepL4NLP@ACL 2017), Vancouver, Canada, Aug.
efficient multilingual ASR systems. Further studies in this 2017, Association for Computational Linguistics, pp.
area are likely to result in ASR technology that is more 168–177.
accessible across languages and dialects, and thus increase [5]. M. Huh, P. Agrawal, and A. A. Efros, “What makes
the inclusivity of ASR systems globally. ImageNet good for transfer learning?” 2016.
[6]. B. Li, X. Wang, and H. S. M. Beigi, “Cantonese
FUTURE WORK automatic speech recognition using transfer learning
from Mandarin,” CoRR, 2019.
Further, future work should focus on improving this [7]. Y. Belinkov and J. Glass, “Analyzing hidden
layer-freezing method and extending it to other models and representations in end-to-end automatic speech
languages. Another direction is to study how to optimize the recognition systems,” in Advances in Neural
selective layer freezing strategies by taking more adaptive or Information Processing Systems, vol. 30, 2017, pp.
dynamic ways. For instance, optimal number of frozen layers 2441–2451.
can be modified according to target dataset size and quality, [8]. S. Imai, “Cepstral analysis synthesis on the mel
as the trade-off of preserving prelearned features to adapt can frequency scale,” in Proc. IEEE International
be different. It remains for future work whether techniques Conference on Acoustics, Speech, and Signal
for being able to automatically determine or gradually Processing (ICASSP’83), vol. 8, 1983, pp. 93–96.
unfreeze the weights of layers could also benefit [9]. A. Graves, S. Fernández, F. Gomez, and J.
performance. It would be interesting to try other pretrained Schmidhuber, “Connectionist temporal classification:
model or models as the base. Testing the freezing strategy on Labelling unsegmented sequence data with recurrent
other state-of-the-art ASR models would reveal whether the neural networks,” in Proc. of the 23rd International
gains achieved are consistent across different network Conference on Machine Learning, 2006, pp. 369–376.
designs and potentially use richer pretrained representations [10]. D. P. Kingma and J. Ba, “Adam: A method for
for improving performance. In addition, an interesting stochastic optimization,” 2014.
direction is to extend the proposed transfer- learning to [11]. S. Hochreiter and J. Schmidhuber, “Long short-term
multiple languages or more complex datasets to examine its memory,” Neural Computation, vol. 9, no. 8, pp.
generality. Finally, it will be important to apply the presented 1735–1780, 1997.
method to languages outside of German and Swiss German [12]. R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer,
(other language families, and also languages with phonetic M. Henretty, R. Morais, L. Saunders, F. M. Tyers, and
characteristics quite different from what was considered here) G. Weber, “Common Voice: A massively-multilingual
in order to see whether the low-level acoustic features learnt speech corpus,” in Proc. of The 12th Language
from English can generally be successfully employed, or Resources and Evaluation Conference (LREC 2020),
whether fine-tuning at the language specific level is required. Marseille, France, May 2020, European Language
Likewise, generalizing to more challenging and/or more Resources Association, pp. 4218–4222.
diverse datasets (such as larger speech corpora with more [13]. M. Plüss, L. Neukom, and M. Vogel, “GermEval 2020
speakers, dialectal variability and noisier audio), is important Task 4: Low-resource speech-to-text,” 2020.
to evaluate the robustness of the method in real-world [14]. K. Heafield, “KenLM: Faster and smaller language
settings. Such experiments would help confirming the model queries,” in Proc. of the 6th Workshop on
effectiveness of approach in multilingual setting and also Statistical Machine Translation, Association for
provide practical optimizations for scalable and efficient Computational Linguistics, 2011, pp. 187–197.
speech model adaptation for low resource scenarios. [15]. M. Schröder and J. Trouvain, “The German text-to-
speech synthesis system MARY: A tool for research,
REFERENCES development and teaching,” International Journal of
Speech Technology, vol. 6, no. 4, pp. 365–377, 2003.
[1]. A. Hannun, C. Case, J. Casper, B. Catanzaro, G.
Diamos, E. Elsen, R. Prenger, S. Satheesh, S.
Sengupta, A. Coates, and A. Y. Ng, “Deep speech:
Scaling up end-to-end speech recognition,” 2014.
[2]. A. Agarwal and T. Zesch, “German end-to-end speech
recognition based on DeepSpeech,” Proc. of the 15th
Conf. on Natural Language Processing (KONVENS
2019): Long Papers, Erlangen, Germany: German
Society for Computational Linguistics & Language
Technology, 2019, pp. 111–119.
[3]. “LTL-UDE at low-resource speech-to-text shared task:
Investigating Mozilla DeepSpeech in a low-resource
setting,” 2020.
[4]. J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J.
Johannsmeier, and S. Stober, “Transfer learning for
speech recognition on a budget,” in Proc. of the 2nd

IJISRT25JUN167 www.ijisrt.com 73

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy