Start Over

Prediction of evoked expression from videos with temporal position fusion.

Authors :: Huynh, Van Thong
Yang, Hyung-Jeong
Lee, Guee-Sang
Kim, Soo-Hyung
Source :: Pattern Recognition Letters. Aug2023, Vol. 172, p245-251. 7p.
Publication Year :: 2023
Abstract: • Temporal convolutional network with temporal position fusion for estimating evoked expressions in video. • Address the inconsistency of time steps during training and testing with temporal position fusion. • State-of-the-art result on Evoked Expressions from Videos (EEV) dataset. • Made huge improvements with temporal position fusion on MediaEval 2018 dataset. This paper introduces an approach for estimating evoked categories expression from videos with the temporal position fusion. Pre-trained models on large-scale datasets in computer vision and audio signals were used to extract the deep representation for timestamps in the video. A temporal convolution network, rather than an RNN-like architecture, was applied to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the noise labels, the temporal position was fused with the deep learned feature to ensure the network differentiates the time steps when noise labels were removed from the training set. This technique helps the system gain a considerable improvement compared to other methods. We conducted experiments on EEV, a large-scale dataset for evoked expression from videos, and achieved a score of 0.054 in terms of Pearson correlation coefficient as a state-of-the-art result. Further experiments on a sub set of LIRIS-ACCEDE dataset - MediaEval 2018 benchmark, also demonstrated the effectiveness of our approach. [ABSTRACT FROM AUTHOR]