0% found this document useful (0 votes)
17 views4 pages

6690 6215 1 PB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

6690 6215 1 PB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Procesamiento del Lenguaje Natural, Revista nº 74, marzo de 2025, pp.

395-398 recibido 04-12-2024 revisado 21-01-2025 aceptado 31-01-2025

Deep learning applied to speech processing:


Development of novel models and techniques
Aprendizaje profundo aplicado al procesamiento de voz:
Desarrollo de nuevos modelos y técnicas
Roberto Andrés Carofilis Vasco
Departamento de Ingenierı́a Eléctrica y de Sistemas y Automática, Universidad de León
Campus de Vegazana, s/n, 24007 León, España
andres.vasco@unileon.es

Abstract: This is the summary of the Ph.D. thesis conducted by Roberto Andrés
Carofilis Vasco, under the supervision of Prof. Enrique Alegre Gutiérrez and Prof.
Laura Fernández Robles at the University of León. The thesis defense took place in
León, Spain, on December 20, 2023, in the presence of a committee formed by Dr.
Luis Fernando D’Haro (Polytechnic University of Madrid, Spain), Dr. Kenneth P.
Camilleri (University of Malta, Malta), and Dr. Victor González Castro (University
of León, Spain). The thesis received international mention following a 3-month stay
at the Idiap Research Institute in Switzerland, under the supervision of Dr. Petr
Motlicek. The thesis was awarded with the outstanding cum laude distinction.
Keywords: Speech processing, language identification, accent identification, spea-
ker identification.
Resumen: Este es el resumen de la tesis doctoral realizada por Roberto Andrés
Carofilis Vasco, bajo la dirección del Prof. Enrique Alegre Gutiérrez y la Prof. Lau-
ra Fernández Robles en la Universidad de León. La defensa de la tesis se realizó en
León, España, el 20 de diciembre de 2023 ante un tribunal compuesto por el Dr. Luis
Fernando D’Haro (Universidad Politécnica de Madrid, España), el Dr. Kenneth P.
Camilleri (Universidad de Malta, Malta), y el Dr. Victor González Castro (Univer-
sidad de León, España). La tesis obtuvo la mención internacional tras una estancia
de 3 meses en el Idiap Research Institute, en Suiza, bajo la supervisión del Dr. Petr
Motlicek. La tesis obtuvo la calificación de sobresaliente cum laude.
Palabras clave: Procesamiento del habla, identificación de idiomas, identificación
de acentos, identificación de hablantes.

1 Introduction racy and robustness on specific tasks. Moreo-


ver, these models are often complex and re-
Speech-processing models have gained in- quire significant computational resources for
creasing importance across various fields, in- both the training and inference phases, resul-
cluding law enforcement and cybersecurity. ting in significant costs and time investments.
These models play a crucial role in the fight The thesis focuses on three speech-
against crimes like child exploitation and hu- processing tasks: language identification, ac-
man trafficking, helping in suspect identifica- cent identification, and speaker identifica-
tion and providing evidence in criminal inves- tion. All three tasks are crucial in academia,
tigations. They can also be used for other ap- industry, and cybersecurity, being useful in
plications, such as speech recognition in per- tasks such as victim and fugitive identifica-
sonal assistants, voice control systems, and tion and tracking, crime prevention, and sus-
language learning tools. pect segmentation. In addition, they have the
However, speech processing models face potential to improve automatic speech recog-
numerous challenges, a major one being the nition systems, addressing the challenges of
scarcity of data. Acquiring sufficient and rele- creating robust systems that are resilient to
vant speech data poses an obstacle, as it ma- the particularities of speech in different re-
kes it difficult to train models that show accu- gions.

ISSN 1135-5948 DOI 10.26342/2025-74-28 ©2025 Sociedad Española para el Procesamiento del Lenguaje Natural
Roberto Andrés Carofilis Vasco

In this thesis, we propose new techniques, gram and the dimensionality-reduced heat-
models, and datasets to address the descri- maps generated with the Grad-CAM inter-
bed speech processing tasks, which facilita- pretability method (Selvaraju et al., 2020).
te the creation of systems with state-of-the- This descriptor is capable of transferring
art performance, and require a relatively low knowledge extracted from a Convolutional
amount of data and computational resour- Neural Network (CNN) specialized in accent
ces to train. Motivated by our collaboration identification, to be used as additional infor-
with the European project “Global Respon- mation in a Classical Machine Learning Al-
se Against Child Exploitation”(GRACE), we gorithm (CMLA), to improve the results of
focus on the creation of applications that are the CMLA by enriching the data it receives
useful for law enforcement agencies in their as input.
fight against cybercrime and child sexual ex- We used Grad-Transfer for the classifica-
ploitation. tion of native English accents and compared
Several contributions presented in this it with the results achieved by CMLA and
thesis will be used by Europol and the na- state-of-the-art deep learning models fed only
tional law enforcement agencies of the Euro- by spectrograms. Grad-Transfer is especially
pean Union countries. The objective of the useful in data-poor tasks, where CMLA may
GRACE project is the creation of tools that give better results than larger models.
allow the monitoring and generation of auto- The description of the pipeline, and
matic alerts in cases of possible risk involving the experimental results achieved, were pu-
minors. blished in the IEEE/ACM Transactions on
Among the proposals of this thesis are Audio, Speech, and Language Processing
new systems capable of achieving competiti- journal (Carofilis et al., 2023a).
ve results even though they have been trained Chapter 4, entitled “MeWEHV: Mel and
with a limited amount of data. In addition, Wave Embeddings for Human Voice Tasks”
it presents two new models capable of being presents a novel embedding enrichment pro-
trained with limited computational resources cedure that combines the outputs of two con-
and at the same time achieving results supe- catenated models as independent branches of
rior to those of other state-of-the-art models. the same model. On the one hand, a branch
We also include a new highly balanced da- with an embedding generation model fed by
taset, and the experimental setup used in all raw audio waves, called wave encoder, and,
the experiments carried out to allow reprodu- on the other hand, a branch with a CNN fed
cibility of the results and to make the results by MFCCs of the raw audios, called MFCC
presented comparable with future tools. encoder.
We designed an architecture, named Me-
2 Thesis Overview WEHV, capable of interacting with the two
This thesis is composed of 6 chapters, which branches through a set of layers, inclu-
are described below: ding LSTM layers and attention mechanisms,
Chapter 1 presents the objectives, moti- combining the information extracted from
vations, and introduces the contributions of both representations. MeWEHV was tested
this thesis. on the language identification, accent identi-
Chapter 2 contains a detailed review of fication, and speaker identification tasks.
state-of-the-art approaches related to langua- We empirically evaluated the hypothesis
ge identification, accent identification, and that there is a complementarity between the
speaker identification tasks, and related work embeddings of the wave encoder, this being
on the proposed contributions. We also men- a non-imposed representation of the acous-
tion the main limitations of the methods re- tic information, and the embeddings of the
viewed and possible improvements that can MFCC encoder, generated from MFCCs, this
be applied. being an imposed representation.
In Chapter 3, entitled “Improvement of We presented a new speaker identifica-
accent classification models through Grad- tion dataset, named YouSpeakers204, which
Transfer from Spectrograms and Gradient- is highly balanced in terms of speaker accent
weighted Class Activation Mapping” we pre- and gender. We compared the MeWEHV mo-
sent Grad-Transfer, a novel descriptor based del with six state-of-the-art models on the
on the concatenation of a flattened spectro- proposed tasks using nine datasets, including

396
Deep learning applied to speech processing: Development of novel models and techniques

YouSpeakers204. CNN-based class-discriminative localization


Details of the MeWEHV architecture, da- technique Grad-CAM and spectrograms.
taset information, and experimental results Novel accent classification approach.
were published in the IEEE Access jour- We proposed a new method for accent clas-
nal (Carofilis et al., 2023b). sification using Grad-Transfer, so that the
Chapter 5, entitled “Squeeze-and- method transfers knowledge from a CNN to
excitation for embeddings weighting in a CMLA, achieving better results than other
speech classification tasks”, presents the state-of-the-art models. This is the first time
Squeeze-and-excitation for Embeddings in literature to propose the use of a Grad-
Network (SaEENet), an update of the Me- CAM-based method for knowledge transfer
WEHV architecture. SaEENet is built using between machine learning models.
novel neural layers and several optimizations Benchmark setup for VCTK. We pu-
inspired by recent advances in other deep blicly present a setup for the Voice Cloning
learning fields, such as the use of depthwise Toolkit (VCTK) dataset (Veaux et al., 2017)
separable convolutions (Chollet, 2017), in the accent identification task, along with
and squeeze-and-excitation blocks (Hu et the results achieved by Grad-Transfer using
al., 2020), initially proposed in the image that setup. With the aim that it can be used
processing field, and GRU layers (Cho et al., by researchers to test their models and com-
2014), originally used in text processing. pare the results with those of this work.
In the SaEENet model, we introduce a no- Multi-representation audio pipeline.
vel implementation of squeeze-and-excitation We introduced a new pipeline to generate
block, which processes the stacked embed- rich embeddings by merging multiple audio
dings considering time as a dimension con- representations. This approach establishes a
taining the target channels. Instead of weigh- basis for improving large pre-trained models
ting the relevance of 2D channels of a convo- and increasing their performance without the
lutional network, SaEENet weights each 1D need for retraining all their weights.
embedding according to its relevance. This MeWEHV model architecture. Based
allows the next layer of the model to have on this pipeline we proposed the MeWEHV
the context of which embedding is more re- deep learning model architecture, which ef-
levant, reducing the impact of embeddings ficiently handles three speech classification
generated from audio segments that do not tasks and achieves state-of-the-art perfor-
contain speech or contain unnecessary infor- mance on nine datasets. MeWEHV levera-
mation, and increasing the relevance of the ges the knowledge of frozen weights of pre-
segments that contain information of inter- trained speech processing models and impro-
est to the model. ves their performance by enriching the em-
We compared SaEENet with other state- beddings generated by them by adding infor-
of-the-art models, including MeWEHV, using mation extracted from MFCCs, as a comple-
three datasets, for the language identifica- mentary representation. The MeWEHV ar-
tion, accent identification, and speaker iden- chitecture requires a relatively low number of
tification tasks. trainable parameters, making it suitable for
This chapter has been presented in an ar- resource-constrained environments.
ticle detailing the work done and submitted YouSpeakers204 dataset. We created
to a journal. a new dataset for speaker identification and
Chapter 6 summarizes the conclusions of accent identification, called YouSpeakers204,
this thesis and provides an outlook for pos- with 19607 audio clips and 204 speakers,
sible future research lines to extend the pre- which was created using public YouTube vi-
sented work. deos. The dataset is highly balanced accor-
ding to the gender of the speakers and six ac-
3 Contributions cents: United States, Canada, Scotland, En-
The main contributions of this thesis are pre- gland, England, Ireland, and Australia.
sented below: Benchmarking Latin American Spa-
Grad-Transfer feature extractor. We nish Corpora. We used, for the first time in
introduced the new Grad-Transfer feature ex- literature, the publicly available Latin Ame-
tractor to represent distinctive audio featu- rican Spanish Corpora dataset (Guevara-
res that combine information from both the Rukoz et al., 2020) in the accent identifica-

397
Roberto Andrés Carofilis Vasco

tion task, providing benchmark results with rough grad-transfer from spectrograms
the systems we designed and an experimental and gradient-weighted class activation
setup made publicly available for reproduci- mapping. IEEE ACM Transactions on
bility and future research. Audio, Speech, and Language Processing,
SaEENet model architecture. We pro- 31:2859–2871.
posed SaEENet, a novel model architecture Carofilis, A., L. Fernández-Robles, E. Alegre,
that achieves competitive results in speaker, y E. Fidalgo. 2023b. MeWEHV: Mel and
language, and accent identification tasks. For wave embeddings for human voice tasks.
the first time in the literature, we introduced IEEE Access, 11:80089–80104.
the use of squeeze-and-excitation blocks to
weight and filter compressed information in Cho, K., B. van Merrienboer, D. Bahdanau,
embeddings generated from audio clips. y Y. Bengio. 2014. On the properties
Squeeze-and-excitation variants eva- of neural machine translation: Encoder-
luation. We evaluated three variants of decoder approaches. En D. Wu M. Car-
squeeze-and-excitation blocks and presented puat X. Carreras, y E. M. Vecchi, editores,
which variants work best for weighting em- Eighth Workshop on Syntax, Semantics
beddings of state-of-the-art models trained and Structure in Statistical Translation,
with self-supervised learning, and feature páginas 103–111. Association for Compu-
maps generated by a CNN. tational Linguistics.
State-of-the-art performance. We Chollet, F. 2017. Xception: Deep lear-
successfully outperformed the results of the ning with depthwise separable convolu-
MeWEHV model and other state-of-the-art tions. En IEEE Conference on Computer
models using the SaEENet architecture in Vision and Pattern Recognition, páginas
the tasks of speaker identification, language 1800–1807. IEEE Computer Society.
identification, and accent identification.
Guevara-Rukoz, A., I. Demirsahin, F. He,
Among the other novelties of SaEENet are
S. C. Chu, S. Sarin, K. Pipatsrisawat,
the use of depthwise separable convolution
A. Gutkin, A. Butryna, y O. Kjartansson.
layers and GRU layers, reducing the number
2020. Crowdsourcing latin american spa-
of trainable parameters.
nish for low-resource text-to-speech. En
Real-world application. We applied the
N. Calzolari F. Béchet P. Blache K. Chou-
models and techniques developed in this work
kri C. Cieri T. Declerck S. Goggi H. Isaha-
to real-world scenarios, focusing specifically
ra B. Maegaard J. Mariani H. Mazo
on extracting speaker information to identify
A. Moreno J. Odijk, y S. Piperidis, edi-
offenders and victims. This work contributes
tores, Proceedings of The 12th Langua-
to the efforts of the GRACE project to leve-
ge Resources and Evaluation Conferen-
rage machine learning techniques to combat
ce, LREC 2020, páginas 6504–6513. Eu-
child sexual exploitation.
ropean Language Resources Association.
Acknowledgements Hu, J., L. Shen, S. Albanie, G. Sun, y
This work was supported in part by the Eu- E. Wu. 2020. Squeeze-and-excitation
ropean Union’s Horizon 2020 Research and networks. IEEE Transactions on Pat-
Innovation Framework Programme under the tern Analysis and Machine Intelligence,
Global Response Against Child Exploitation 42(8):2011–2023.
(GRACE) Project under Grant 883341; in Selvaraju, R. R., M. Cogswell, A. Das,
part by the Predoctoral Grant of the Junta de R. Vedantam, D. Parikh, y D. Batra.
Castilla y León, under Grant EDU/875/2021; 2020. Grad-cam: Visual explanations
and in part by the framework agreement bet- from deep networks via gradient-based lo-
ween the University of León and Spanish Na- calization. Proceedings of the IEEE Inter-
tional Cybersecurity Institute (INCIBE) un- national Conference on Computer Vision,
der Addendum 01. 128(2):336–359.

References Veaux, C., J. Yamagishi, K. MacDonald, y


others. 2017. Superseded-CSTR VCTK
Carofilis, A., E. Alegre, E. Fidalgo, y corpus: English multi-speaker corpus for
L. Fernández-Robles. 2023a. Improve- CSTR voice cloning toolkit.
ment of accent classification models th-

398

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy