Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Shome, Debaditya; Etemad, Ali

Computer Science > Computation and Language

arXiv:2309.04849 (cs)

[Submitted on 9 Sep 2023 (v1), last revised 14 Mar 2024 (this version, v2)]

Title:Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Authors:Debaditya Shome, Ali Etemad

View PDF HTML (experimental)

Abstract:We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Comments:	Accepted at ICASSP 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2309.04849 [cs.CL]
	(or arXiv:2309.04849v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.04849

Submission history

From: Debaditya Shome [view email]
[v1] Sat, 9 Sep 2023 17:30:35 UTC (185 KB)
[v2] Thu, 14 Mar 2024 21:46:37 UTC (457 KB)

Computer Science > Computation and Language

Title:Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Computation and Language

Title:Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.