6690 6215 1 PB
6690 6215 1 PB
Abstract: This is the summary of the Ph.D. thesis conducted by Roberto Andrés
Carofilis Vasco, under the supervision of Prof. Enrique Alegre Gutiérrez and Prof.
Laura Fernández Robles at the University of León. The thesis defense took place in
León, Spain, on December 20, 2023, in the presence of a committee formed by Dr.
Luis Fernando D’Haro (Polytechnic University of Madrid, Spain), Dr. Kenneth P.
Camilleri (University of Malta, Malta), and Dr. Victor González Castro (University
of León, Spain). The thesis received international mention following a 3-month stay
at the Idiap Research Institute in Switzerland, under the supervision of Dr. Petr
Motlicek. The thesis was awarded with the outstanding cum laude distinction.
Keywords: Speech processing, language identification, accent identification, spea-
ker identification.
Resumen: Este es el resumen de la tesis doctoral realizada por Roberto Andrés
Carofilis Vasco, bajo la dirección del Prof. Enrique Alegre Gutiérrez y la Prof. Lau-
ra Fernández Robles en la Universidad de León. La defensa de la tesis se realizó en
León, España, el 20 de diciembre de 2023 ante un tribunal compuesto por el Dr. Luis
Fernando D’Haro (Universidad Politécnica de Madrid, España), el Dr. Kenneth P.
Camilleri (Universidad de Malta, Malta), y el Dr. Victor González Castro (Univer-
sidad de León, España). La tesis obtuvo la mención internacional tras una estancia
de 3 meses en el Idiap Research Institute, en Suiza, bajo la supervisión del Dr. Petr
Motlicek. La tesis obtuvo la calificación de sobresaliente cum laude.
Palabras clave: Procesamiento del habla, identificación de idiomas, identificación
de acentos, identificación de hablantes.
ISSN 1135-5948 DOI 10.26342/2025-74-28 ©2025 Sociedad Española para el Procesamiento del Lenguaje Natural
Roberto Andrés Carofilis Vasco
In this thesis, we propose new techniques, gram and the dimensionality-reduced heat-
models, and datasets to address the descri- maps generated with the Grad-CAM inter-
bed speech processing tasks, which facilita- pretability method (Selvaraju et al., 2020).
te the creation of systems with state-of-the- This descriptor is capable of transferring
art performance, and require a relatively low knowledge extracted from a Convolutional
amount of data and computational resour- Neural Network (CNN) specialized in accent
ces to train. Motivated by our collaboration identification, to be used as additional infor-
with the European project “Global Respon- mation in a Classical Machine Learning Al-
se Against Child Exploitation”(GRACE), we gorithm (CMLA), to improve the results of
focus on the creation of applications that are the CMLA by enriching the data it receives
useful for law enforcement agencies in their as input.
fight against cybercrime and child sexual ex- We used Grad-Transfer for the classifica-
ploitation. tion of native English accents and compared
Several contributions presented in this it with the results achieved by CMLA and
thesis will be used by Europol and the na- state-of-the-art deep learning models fed only
tional law enforcement agencies of the Euro- by spectrograms. Grad-Transfer is especially
pean Union countries. The objective of the useful in data-poor tasks, where CMLA may
GRACE project is the creation of tools that give better results than larger models.
allow the monitoring and generation of auto- The description of the pipeline, and
matic alerts in cases of possible risk involving the experimental results achieved, were pu-
minors. blished in the IEEE/ACM Transactions on
Among the proposals of this thesis are Audio, Speech, and Language Processing
new systems capable of achieving competiti- journal (Carofilis et al., 2023a).
ve results even though they have been trained Chapter 4, entitled “MeWEHV: Mel and
with a limited amount of data. In addition, Wave Embeddings for Human Voice Tasks”
it presents two new models capable of being presents a novel embedding enrichment pro-
trained with limited computational resources cedure that combines the outputs of two con-
and at the same time achieving results supe- catenated models as independent branches of
rior to those of other state-of-the-art models. the same model. On the one hand, a branch
We also include a new highly balanced da- with an embedding generation model fed by
taset, and the experimental setup used in all raw audio waves, called wave encoder, and,
the experiments carried out to allow reprodu- on the other hand, a branch with a CNN fed
cibility of the results and to make the results by MFCCs of the raw audios, called MFCC
presented comparable with future tools. encoder.
We designed an architecture, named Me-
2 Thesis Overview WEHV, capable of interacting with the two
This thesis is composed of 6 chapters, which branches through a set of layers, inclu-
are described below: ding LSTM layers and attention mechanisms,
Chapter 1 presents the objectives, moti- combining the information extracted from
vations, and introduces the contributions of both representations. MeWEHV was tested
this thesis. on the language identification, accent identi-
Chapter 2 contains a detailed review of fication, and speaker identification tasks.
state-of-the-art approaches related to langua- We empirically evaluated the hypothesis
ge identification, accent identification, and that there is a complementarity between the
speaker identification tasks, and related work embeddings of the wave encoder, this being
on the proposed contributions. We also men- a non-imposed representation of the acous-
tion the main limitations of the methods re- tic information, and the embeddings of the
viewed and possible improvements that can MFCC encoder, generated from MFCCs, this
be applied. being an imposed representation.
In Chapter 3, entitled “Improvement of We presented a new speaker identifica-
accent classification models through Grad- tion dataset, named YouSpeakers204, which
Transfer from Spectrograms and Gradient- is highly balanced in terms of speaker accent
weighted Class Activation Mapping” we pre- and gender. We compared the MeWEHV mo-
sent Grad-Transfer, a novel descriptor based del with six state-of-the-art models on the
on the concatenation of a flattened spectro- proposed tasks using nine datasets, including
396
Deep learning applied to speech processing: Development of novel models and techniques
397
Roberto Andrés Carofilis Vasco
tion task, providing benchmark results with rough grad-transfer from spectrograms
the systems we designed and an experimental and gradient-weighted class activation
setup made publicly available for reproduci- mapping. IEEE ACM Transactions on
bility and future research. Audio, Speech, and Language Processing,
SaEENet model architecture. We pro- 31:2859–2871.
posed SaEENet, a novel model architecture Carofilis, A., L. Fernández-Robles, E. Alegre,
that achieves competitive results in speaker, y E. Fidalgo. 2023b. MeWEHV: Mel and
language, and accent identification tasks. For wave embeddings for human voice tasks.
the first time in the literature, we introduced IEEE Access, 11:80089–80104.
the use of squeeze-and-excitation blocks to
weight and filter compressed information in Cho, K., B. van Merrienboer, D. Bahdanau,
embeddings generated from audio clips. y Y. Bengio. 2014. On the properties
Squeeze-and-excitation variants eva- of neural machine translation: Encoder-
luation. We evaluated three variants of decoder approaches. En D. Wu M. Car-
squeeze-and-excitation blocks and presented puat X. Carreras, y E. M. Vecchi, editores,
which variants work best for weighting em- Eighth Workshop on Syntax, Semantics
beddings of state-of-the-art models trained and Structure in Statistical Translation,
with self-supervised learning, and feature páginas 103–111. Association for Compu-
maps generated by a CNN. tational Linguistics.
State-of-the-art performance. We Chollet, F. 2017. Xception: Deep lear-
successfully outperformed the results of the ning with depthwise separable convolu-
MeWEHV model and other state-of-the-art tions. En IEEE Conference on Computer
models using the SaEENet architecture in Vision and Pattern Recognition, páginas
the tasks of speaker identification, language 1800–1807. IEEE Computer Society.
identification, and accent identification.
Guevara-Rukoz, A., I. Demirsahin, F. He,
Among the other novelties of SaEENet are
S. C. Chu, S. Sarin, K. Pipatsrisawat,
the use of depthwise separable convolution
A. Gutkin, A. Butryna, y O. Kjartansson.
layers and GRU layers, reducing the number
2020. Crowdsourcing latin american spa-
of trainable parameters.
nish for low-resource text-to-speech. En
Real-world application. We applied the
N. Calzolari F. Béchet P. Blache K. Chou-
models and techniques developed in this work
kri C. Cieri T. Declerck S. Goggi H. Isaha-
to real-world scenarios, focusing specifically
ra B. Maegaard J. Mariani H. Mazo
on extracting speaker information to identify
A. Moreno J. Odijk, y S. Piperidis, edi-
offenders and victims. This work contributes
tores, Proceedings of The 12th Langua-
to the efforts of the GRACE project to leve-
ge Resources and Evaluation Conferen-
rage machine learning techniques to combat
ce, LREC 2020, páginas 6504–6513. Eu-
child sexual exploitation.
ropean Language Resources Association.
Acknowledgements Hu, J., L. Shen, S. Albanie, G. Sun, y
This work was supported in part by the Eu- E. Wu. 2020. Squeeze-and-excitation
ropean Union’s Horizon 2020 Research and networks. IEEE Transactions on Pat-
Innovation Framework Programme under the tern Analysis and Machine Intelligence,
Global Response Against Child Exploitation 42(8):2011–2023.
(GRACE) Project under Grant 883341; in Selvaraju, R. R., M. Cogswell, A. Das,
part by the Predoctoral Grant of the Junta de R. Vedantam, D. Parikh, y D. Batra.
Castilla y León, under Grant EDU/875/2021; 2020. Grad-cam: Visual explanations
and in part by the framework agreement bet- from deep networks via gradient-based lo-
ween the University of León and Spanish Na- calization. Proceedings of the IEEE Inter-
tional Cybersecurity Institute (INCIBE) un- national Conference on Computer Vision,
der Addendum 01. 128(2):336–359.
398