Start Over

ACF-net: appearance-guided content filter network for video captioning.

Authors :: Li, Min
Liu, Dongmei
Liu, Chunsheng
Chang, Faliang
Wang, Wenqian
Wang, Bin
Source :: Multimedia Tools & Applications; Mar2024, Vol. 83 Issue 10, p31103-31122, 20p
Publication Year :: 2024
Abstract: Video captioning refers to the automatic generation of natural language sentences for a given video. There are two open problems for this task: how to effectively combine the multimodal features to better represent video content, and how to effectively extract useful features from complex visual and linguistic information to generate more detailed descriptions. Considering these two difficulties together, we propose a video captioning method named ACF-Net: Appearance-guided Content Filter Network, utilizing appearance information as a Content Filter to guide the network to aware discrimination information from both motion information and object information. Specifically, we propose a new multimodal fusion method to alleviate the problem of insufficient video information fusion. Distinguished with previous feature fusion methods that directly concatenate features, our fusion mechanism fuses relevant content information through content filters to form unified multimodal features. Moreover, a hierarchical decoder with temporal semantic aggregation is proposed, which can dynamically aggregate visual and linguistic features while generating corresponding words, focusing on the most relevant temporal and semantic information. Extensive experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT, which demonstrate the effectiveness of our proposed method. [ABSTRACT FROM AUTHOR]