Start Over

An Active Learning Paradigm for Online Audio-Visual Emotion Recognition.

Authors :: Kansizoglou, Ioannis
Bampis, Loukas
Gasteratos, Antonios
Source :: IEEE Transactions on Affective Computing; Apr-Jun2022, Vol. 13 Issue 2, p756-768, 13p
Publication Year :: 2022
Abstract: The advancement of Human-Robot Interaction (HRI) drives research into the development of advanced emotion identification architectures that fathom audio-visual (A-V) modalities of human emotion. State-of-the-art methods in multi-modal emotion recognition mainly focus on the classification of complete video sequences, leading to systems with no online potentialities. Such techniques are capable of predicting emotions only when the videos are concluded, thus restricting their applicability in practical scenarios. This article provides a novel paradigm for online emotion classification, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough. We propose two deep Convolutional Neural Network (CNN) models for extracting emotion features, one for each modality, and a Deep Neural Network (DNN) for their fusion. In order to conceive the temporal quality of human emotion in interactive scenarios, we train in cascade a Long Short-Term Memory (LSTM) layer and a Reinforcement Learning (RL) agent –which monitors the speaker– thus stopping feature extraction and making the final prediction. The comparison of our results on two publicly available A-V emotional datasets viz., RML and BAUM-1s, against other state-of-the-art models, demonstrates the beneficial capabilities of our work. [ABSTRACT FROM AUTHOR]