Back to Search Start Over

Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network

Authors :
Yuhua Wang
Jianxing Huang
Zhengdao Zhao
Haiyan Lan
Xinjia Zhang
Source :
Applied Sciences, Vol 14, Iss 24, p 11494 (2024)
Publication Year :
2024
Publisher :
MDPI AG, 2024.

Abstract

Speech emotion recognition (SER) is important in facilitating natural human–computer interactions. In speech sequence modeling, a vital challenge is to learn context-aware sentence expression and temporal dynamics of paralinguistic features to achieve unambiguous emotional semantic understanding. In previous studies, the SER method based on the single-scale cascade feature extraction module could not effectively preserve the temporal structure of speech signals in the deep layer, downgrading the sequence modeling performance. To address these challenges, this paper proposes a novel multi-scale feature pyramid network. The enhanced multi-scale convolutional neural networks (MSCNNs) significantly improve the ability to extract multi-granular emotional features. Experimental results on the IEMOCAP corpus demonstrate the effectiveness of the proposed approach, achieving a weighted accuracy (WA) of 71.79% and an unweighted accuracy (UA) of 73.39%. Furthermore, on the RAVDESS dataset, the model achieves an unweighted accuracy (UA) of 86.5%. These results validate the system’s performance and highlight its competitive advantage.

Details

Language :
English
ISSN :
20763417
Volume :
14
Issue :
24
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.8aae40bdcaa04fc8948306376979b90a
Document Type :
article
Full Text :
https://doi.org/10.3390/app142411494