1. Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition
- Author
-
James J. Deng and Clement H. C. Leung
- Subjects
Facial expression ,Modality (human–computer interaction) ,Computer science ,ComputerApplications_MISCELLANEOUS ,Speech recognition ,Representation (systemics) ,Feature (machine learning) ,Construct (python library) ,Transfer of learning ,Joint (audio engineering) ,Transformer (machine learning model) - Abstract
Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. This paper proposes a new method to learn a joint emotion representation for multimodal emotion recognition. Emotion-based feature for speech audio is learned by an unsupervised triplet-loss objective, and a text-to-text transformer network is used to extract text embedding for latent emotional meaning. Transfer learning provides a powerful and reusable technique to help fine-tune emotion recognition models trained on mega audio and text datasets respectively. The extracted emotional information from speech audio and text embedding are processed by dedicated transformer networks. The alternating co-attention mechanism is used to construct a deep transformer network. Multimodal fusion is implemented by a deep co-attention transformer network. Experimental results show the proposed method for learning a joint emotion representation achieves good performance in multimodal emotion recognition.
- Published
- 2021