Back to Search Start Over

Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism.

Authors :
Luo, Yuanmao
Wang, Ruomei
Zhang, Fuwei
Zhou, Fan
Liu, Mingyang
Feng, Jiawei
Source :
Neural Computing & Applications. May2024, Vol. 36 Issue 14, p8055-8071. 17p.
Publication Year :
2024

Abstract

Multi-modal attention learning in video question answering (VideoQA) is a challenging task, as it requires consideration of information recognition within modalities and information interaction and fusion between modalities. Existing methods employs the cross-attention mechanism to compute feature similarity between modalities, thereby aggregating relevant information in a shared space. However, heterogeneous features have different distributions in the shared space, making it difficult to directly match semantics, which may affect similarity calculation. To address this issue, a novel enhanced cross-modal attention mechanism (ECAM) is proposed in this paper that pre-fuses two modalities to generate an enhanced key with feature importance distributions to effectively solve the semantic mismatch. Compared with the existing cross-attention mechanism, ECAM can realize the semantic matching between multiple modalities more accurately and pay more attention to the relevant feature regions. In the multi-modal fusion phase, a two-stage fusion strategy is proposed to exploit the advantages of the two fusion methods to deeply explore the complex and diverse dependency relationships between the multi-modal features. Collectively supported by these two newly designed modules, we proposed the VideoQA solution based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism which is able to conquer challenging semantic understanding and question answering tasks. Extensive experiments on four VideoQA datasets show that the new approach attains superior results in comparison with state-of-the-art peer methods. Moreover, experiments on the latest joint task datasets prove that ECAM is a general mechanism that can be easily adapted to solve other visual-linguistic tasks. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*MODAL logic
*VIDEOS

Details

Language :
English
ISSN :
09410643
Volume :
36
Issue :
14
Database :
Academic Search Index
Journal :
Neural Computing & Applications
Publication Type :
Academic Journal
Accession number :
177776152
Full Text :
https://doi.org/10.1007/s00521-024-09482-8