Metric Learning-Based Multimodal Audio-Visual Emotion Recognition.

Authors :: Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
Source :: IEEE MultiMedia; Jan/Mar2020, Vol. 27 Issue 1, p37-48, 12p
Publication Year :: 2020
Abstract: People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets. [ABSTRACT FROM AUTHOR]