1. Multimodal Emotion Recognition Using Contextualized Audio Information and Ground Transcripts on Multiple Datasets.
- Author
-
Chauhan, Krishna, Sharma, Kamalesh Kumar, and Varma, Tarun
- Subjects
- *
EMOTION recognition , *CONVOLUTIONAL neural networks , *AMERICAN English language , *LANGUAGE models , *MOTION capture (Cinematography) , *DEEP learning , *MOTION capture (Human mechanics) - Abstract
The widespread applications of emotion recognition (ER) in various fields have recently attracted much attention from researchers. Consequently, an array of advanced techniques has emerged, driven by enhancing the accuracy and robustness of these recognition systems. As emotional dialogue comprises sound and spoken content, the proposed model encodes the information from audio and text sequences using two separate channels and merges them for emotional classification. The two channels used inputs from audio and text modalities. The audio channel is encoded using a deep convolutional neural network with residual connections and further transformed using a self-attention-based multihead attention network called channel-wise global head pooling. Unlike the vanilla multihead attention network, an adaptive global pooling is used after concatenating all the heads. The text channel is encoded using a pre-trained BERT model. The proposed ER method is validated on four benchmark databases: Interactive Emotional Dyadic Motion Capture in English, the Berlin emotional speech dataset in the German language, Ryerson Audio-Visual Database of Emotional Speech and Song in North American English and Crowd-sourced Emotional Multimodal Actors Dataset in English. The classification accuracy on the above emotional corpora is 85.71%, 79.52%, 76.71% and 73.91%, respectively. Furthermore, cross-corpus analysis is presented to understand the variability of speech and text features and the robustness of the proposed architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF