Optimizing Speech Models with Freezing
Optimizing Speech Models with Freezing
Abstract: Adapting speech models to new languages requires an optimization of the trade-off between accuracy and
computational cost. In this work, we investigate the optimization of Mozilla’s DeepSpeech model when adapted from English
to German and Swiss German through selective freezing of layers. Employing a strategy of transfer learning, we analyze the
performance impacts of freezing different numbers of network layers during fine-tuning. The experiment reveals that
freezing the initial layers achieves significant performance improvements: training time decreases and accuracy increases.
This layer-freezing technique hence offers an extensible way to improve automated speech recognition for under-resourced
languages.
Keywords: Automatic Speech Recognition (ASR); Deep Speech; German; Layer Freezing; Low-Resource Languages; Swiss
German; Transfer Learning.
How to Cite: Revanth Reddy Pasula; (2025). Optimizing Speech Models with Freezing. International Journal of Innovative Science
and Research Technology, (RISEM–2025), 69-73. https://doi.org/10.38124/ijisrt/25jun167
IJISRT25JUN167 www.ijisrt.com 69
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167
hyperparameters and computing environment, and dataset connected layers, layer 4 is an LSTM recurrent layer [11],
preparation as well as preprocessing pipeline layer 5 is an additional fully connected but ReLU-activated
layer, and layer 6 is the output layer generating character
A. Deep Speech Architecture probabilities through the use of softmax. The model is trained
The DeepSpeech version of version 0.7 from Mozilla with the use of the Connectionist Temporal Classification
was utilized as the base ASR architecture. The described (CTC) loss [9] and optimization with the use of the Adam
implementation, deviating minimally from the model optimizer [10].
proposed originally by Hannun et al. [1], is documented in
greater detail in the official documentation². The processing Table 1 shows the DeepSpeech architecture and data
pipeline starts with the MFCC [8] extraction from the raw flow, from input audio to feature extraction to output
audio input, followed by a total of six layers with the form of character probabilities (figure adapted from the official
a deep recurrent neural network. The network structure is documentation).
shown in Table I. Briefly, layers 1–3 are ReLU-activated fully
B. Training Procedure and Layer Freezing During fine-tuning, the mentioned layers were frozen
We performed a series of training experiments to by indicating them as non-trainable, while the rest of the
measure the effect of frozen layers in transfer learning. For layers were trained over the target data. All the transfer
weight initialization, we utilized an English pre-trained learning models' output layer was re-initialized, as the
DeepSpeech model offered by Mozilla. Six training setups character set (output labels) was different for English
were done for both German and Swiss German, which are compared to the target language. This re-initialization
compiled in Table II. Moreover, we trained one model provided compatibility with German or Swiss German
entirely from scratch with random initialization as our transcripts.
baseline comparison point (labeled the “Reference” condition
with no transfer learning).
C. Hyperparameters and Computational Environment All models (both German and Swiss German) were trained
The same set of hyperparameters was utilized in all over the same number of epochs under the same conditions to
experiments (Table III), with no further tuning aside from make an unbiased comparison of the different freezing
these preselected values. Training was carried out under a strategies.
Linux server with 96 Intel Xeon Platinum 8160 CPU cores.
D. Datasets and Preprocessing speakers, with utterances lasting around 3 to 5 seconds. For
The data we used for our experiments are tabulated in the Swiss German models, we drew upon an even smaller
Table IV. For the German models, we utilized the training dataset of 70 hours of Swiss German speech derived from
data from Mozilla’s German corpus [12]. This data comprises Bernese parliamentary debates [13]. The Swiss German
around 315 hours of speech, provided by about 4,823 dataset covers formal speaking with relatively few speakers
IJISRT25JUN167 www.ijisrt.com 70
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167
(around 191), and its size is considerably lower compared to itemized description of each component of the data set, as
the German corpus. well as the preprocessing pipeline.
The initial model for the English DeepSpeech model Along with the acoustic data, we used an external
was trained with a much larger dataset (over 6500 speech language model at time of inference to enhance the accuracy
audio hours of English data) aggregated from heterogeneous of the recognizer. To do this, we trained a tri-gram language
sources such as LibriSpeech and the English part of the model with the KenLM toolkit [14] over a large corpus of text
Common Voice dataset (more information can be found in consisting of public domain German-language text from
footnote 5). Before training, all data sets underwent common Wikipedia articles and Europarl parliamentary debates. This
preprocessing steps, for example, audio-normalization and language model we incorporated into the DeepSpeech
cleaning of transcript text (e.g. lowercasing, punctuation decoder for both German and Swiss German trials, helping
removal), to make them consistent. Table V contains an the system to make more accurate transcripts through
language contextualization.
IV. RESULTS AND DISCUSSION These are all indications that, for German, retaining the
lower-level layers of the acoustic features (up to two or three
We compared the performance of the six training layers) gives the best result, significantly outperforming the
schemes in terms of word error rate (WER) and character baseline and the full fine-tuned model.
error rate (CER) on test sets for both German and Swiss
German. Table VI shows the WER and CER seen by each For Swiss German, we see the same pattern with
model configuration for German, while Table VII shows the differing magnitude. The baseline Swiss German model (no-
WER and CER for Swiss German. “Reference” in these transfer) achieved a WER of 74.0% (CER 52.0%). Fine-
tables indicates the model trained from scratch without any tuning all model layers on Swiss German data (0 frozen)
transfer learning, and the “Improvement” column shows the caused the WER to worsen slightly to 76.0%, which shows
percentage point improvement in WER with respect to that that such indiscriminate fine-tuning with no freezing can
baseline. overfit or mis-adapt to the small Swiss German corpus.
Freezing the early layers, in contrast, worked: with one
For the German ASR task, the baseline model of frozen, the WER improved to 69.0% (CER 48.0%), about a
training without any transfer learning obtained a WER of 5-point improvement over the baseline, and with two frozen,
70.0% with CER of 42.0%. Employing the pre-trained model the WER further improved to 67.0% (CER 45.0%), which
for English with no frozen layers (0 frozen, full fine-tuning) was the best performance for Swiss German (a 7-point
reduced the WER to 63.0% (CER 37.0%), which is only a improvement over baseline). Freezing three or four layers
modest improvement of 7.0 points. However, partial freezing showed no additional improvements (WER ~68.0% in each
of the initial layers produced much greater improvements. case, ~6 points improvement over baseline). So, for Swiss
Simply freezing the first layer improved the WER to 48.0% German, the first two layers of pre-trained model freezation
(CER 26.0%), which is a 22-point WER improvement over provided the greatest improvement, with freezation beyond
the baseline. Freezing the first two layers improved the WER two not bringing an additional advantage and retaining
further to 44.0% (CER 22.0%), which is the best performance approximately the same performance.
and an improvement of 26 points over the baseline.
Significantly, two or three frozen layers showed the same In total, selective freezing of layers resulted in notably
WER (44.0%), which means that there would have been no improved accuracy for both languages over training from
additional improvement from the third layer over the first scratch. The advantage was particularly dramatic for German,
two. With four frozen layers, performance actually decreased with the larger dataset; the method of transfer learning
slightly with an increase in WER to 46.0% and CER to reduced the WER by more than 26 absolute points. Swiss
25.0%, though still significantly better than the baseline. German, with the much smaller dataset and higher dialectal
IJISRT25JUN167 www.ijisrt.com 71
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167
variation, also showed improved performance with transfer convergence patterns. This suggests that retaining the pre-
learning, though the relative improvement fell short. These trained low-level feature extractors didn't impede training;
results show that the lowest layers of the deep model encode the models all trained at around the same speed, just they
general acoustic representations that are relevant for many achieved different ending accuracy levels depending upon the
languages. By holding these layers constant, the fine-tuning number of layers updated. This result indicates that much of
procedure can concentrate on adapting higher-level layers to the key learning of the new language happens higher up in the
the target language’s idiosyncrasies. But freezing too many model after an effective set of building block features is
layers starts to restrict the model’s flexibility: the modest established.
decline in performance when four layers were frozen
indicates that the model required some of the later layers to Table 6 below presents the performance of the different
adapt to language-specific features. training strategies for German, and Table VII presents the
respective results for Swiss German. Table VIII presents a
Interestingly, we found that models with varying high-level comparison of each language's optimal freezing
numbers of frozen layers showed very comparable training configurations and the resultant error rates.
Table 6 German ASR Performance with Various Layer-Freezing Strategies (WER = Word Error Rate, CER = Character Error
Rate). The Improvement Column Indicates WER Improvement Compared to the Baseline (Reference) Model.
Training Strategy WER (%) CER (%) WER Improvement
Reference (No Transfer; Random Init.) 70.0 42.0 —
0 Frozen Layers (Full fine-tuning) 63.0 37.0 +7.0
1 Frozen Layer 48.0 26.0 +22.0
2 Frozen Layers 44.0 22.0 +26.0
3 Frozen Layers 44.0 22.0 +26.0
4 Frozen Layers 46.0 25.0 +24.0
Table 7 ASR Performance for Swiss German under Different Layer-Freezing Strategies.
Training Strategy WER (%) CER (%) WER Improvement
Reference (No Transfer; Random Init.) 74.0 52.0 —
0 Frozen Layers (Full fine-tuning) 76.0 54.0 –2.0
1 Frozen Layer 69.0 48.0 +5.0
2 Frozen Layers 67.0 45.0 +7.0
3 Frozen Layers 68.0 47.0 +6.0
4 Frozen Layers 68.0 46.0 +6.0
IJISRT25JUN167 www.ijisrt.com 72
Special Issue, RISEM–2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25jun167
freeze), and demonstrate the potential of such methodology Workshop on Representation Learning for NLP
towards building scalable, robust, and computationally (RepL4NLP@ACL 2017), Vancouver, Canada, Aug.
efficient multilingual ASR systems. Further studies in this 2017, Association for Computational Linguistics, pp.
area are likely to result in ASR technology that is more 168–177.
accessible across languages and dialects, and thus increase [5]. M. Huh, P. Agrawal, and A. A. Efros, “What makes
the inclusivity of ASR systems globally. ImageNet good for transfer learning?” 2016.
[6]. B. Li, X. Wang, and H. S. M. Beigi, “Cantonese
FUTURE WORK automatic speech recognition using transfer learning
from Mandarin,” CoRR, 2019.
Further, future work should focus on improving this [7]. Y. Belinkov and J. Glass, “Analyzing hidden
layer-freezing method and extending it to other models and representations in end-to-end automatic speech
languages. Another direction is to study how to optimize the recognition systems,” in Advances in Neural
selective layer freezing strategies by taking more adaptive or Information Processing Systems, vol. 30, 2017, pp.
dynamic ways. For instance, optimal number of frozen layers 2441–2451.
can be modified according to target dataset size and quality, [8]. S. Imai, “Cepstral analysis synthesis on the mel
as the trade-off of preserving prelearned features to adapt can frequency scale,” in Proc. IEEE International
be different. It remains for future work whether techniques Conference on Acoustics, Speech, and Signal
for being able to automatically determine or gradually Processing (ICASSP’83), vol. 8, 1983, pp. 93–96.
unfreeze the weights of layers could also benefit [9]. A. Graves, S. Fernández, F. Gomez, and J.
performance. It would be interesting to try other pretrained Schmidhuber, “Connectionist temporal classification:
model or models as the base. Testing the freezing strategy on Labelling unsegmented sequence data with recurrent
other state-of-the-art ASR models would reveal whether the neural networks,” in Proc. of the 23rd International
gains achieved are consistent across different network Conference on Machine Learning, 2006, pp. 369–376.
designs and potentially use richer pretrained representations [10]. D. P. Kingma and J. Ba, “Adam: A method for
for improving performance. In addition, an interesting stochastic optimization,” 2014.
direction is to extend the proposed transfer- learning to [11]. S. Hochreiter and J. Schmidhuber, “Long short-term
multiple languages or more complex datasets to examine its memory,” Neural Computation, vol. 9, no. 8, pp.
generality. Finally, it will be important to apply the presented 1735–1780, 1997.
method to languages outside of German and Swiss German [12]. R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer,
(other language families, and also languages with phonetic M. Henretty, R. Morais, L. Saunders, F. M. Tyers, and
characteristics quite different from what was considered here) G. Weber, “Common Voice: A massively-multilingual
in order to see whether the low-level acoustic features learnt speech corpus,” in Proc. of The 12th Language
from English can generally be successfully employed, or Resources and Evaluation Conference (LREC 2020),
whether fine-tuning at the language specific level is required. Marseille, France, May 2020, European Language
Likewise, generalizing to more challenging and/or more Resources Association, pp. 4218–4222.
diverse datasets (such as larger speech corpora with more [13]. M. Plüss, L. Neukom, and M. Vogel, “GermEval 2020
speakers, dialectal variability and noisier audio), is important Task 4: Low-resource speech-to-text,” 2020.
to evaluate the robustness of the method in real-world [14]. K. Heafield, “KenLM: Faster and smaller language
settings. Such experiments would help confirming the model queries,” in Proc. of the 6th Workshop on
effectiveness of approach in multilingual setting and also Statistical Machine Translation, Association for
provide practical optimizations for scalable and efficient Computational Linguistics, 2011, pp. 187–197.
speech model adaptation for low resource scenarios. [15]. M. Schröder and J. Trouvain, “The German text-to-
speech synthesis system MARY: A tool for research,
REFERENCES development and teaching,” International Journal of
Speech Technology, vol. 6, no. 4, pp. 365–377, 2003.
[1]. A. Hannun, C. Case, J. Casper, B. Catanzaro, G.
Diamos, E. Elsen, R. Prenger, S. Satheesh, S.
Sengupta, A. Coates, and A. Y. Ng, “Deep speech:
Scaling up end-to-end speech recognition,” 2014.
[2]. A. Agarwal and T. Zesch, “German end-to-end speech
recognition based on DeepSpeech,” Proc. of the 15th
Conf. on Natural Language Processing (KONVENS
2019): Long Papers, Erlangen, Germany: German
Society for Computational Linguistics & Language
Technology, 2019, pp. 111–119.
[3]. “LTL-UDE at low-resource speech-to-text shared task:
Investigating Mozilla DeepSpeech in a low-resource
setting,” 2020.
[4]. J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J.
Johannsmeier, and S. Stober, “Transfer learning for
speech recognition on a budget,” in Proc. of the 2nd
IJISRT25JUN167 www.ijisrt.com 73