1. Exploring the potential of Wav2vec 2.0 for speech emotion recognition using classifier combination and attention-based feature fusion.
- Author
-
Nasersharif, Babak and Namvarpour, Mohammad
- Subjects
- *
EMOTION recognition , *SPEECH , *FEED additives , *ATTENTION - Abstract
Self-supervised learning models, such as Wav2vec 2.0, extract efficient features for speech processing applications including speech emotion recognition. In this study, we propose a Dimension Reduction Module (DRM) to apply to the output of each transformer block in the Wav2vec 2.0 model. Our DRM consists of an attentive average pooling, a linear layer with a maxout activation function, and a linear layer that reduces the number of dimensions to the number of classes. Subsequently, we propose two methods, classifier combination and feature fusion, to generate the final decision using DRM outputs. In the Classifier Combination method, the outputs of each DRM are fed to a distinct Additive Angular Margin (AAM) softmax loss function. This constructs an individual classifier for each DRM. Then, the outputs of these classifiers are combined using five different statistical methods. In the Feature Fusion method, the outputs of the DRMs are concatenated, and an attention mechanism is applied to them. Then, the attended outputs are fed to an AAM-Softmax loss function, which is used for training all DRMs in addition to the attention mechanism. The proposed models have been evaluated on the EMODB, IEMOCAP, and ShEMO datasets. Our best method, the Attention-based Feature Fusion, has obtained unweighted accuracies of 94.80% on EMODB, 74.00% on IEMOCAP, and 80.60% on ShEMO, which are competitive with the best baseline methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF