Back to Search Start Over

Contrastive topic-enhanced network for video captioning.

Authors :
Zeng, Yawen
Wang, Yiru
Liao, Dongliang
Li, Gongfu
Xu, Jin
Man, Hong
Liu, Bo
Xu, Xiangmin
Source :
Expert Systems with Applications. Mar2024:Part B, Vol. 237, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

In the field of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most existing works only consider transcripts as a supplementary modality, neglecting their potential in capturing high-level semantics, such as multi-modal topics . In fact, transcripts, as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation. Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Extensive experiments have verified the effectiveness of our solution. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09574174
Volume :
237
Database :
Academic Search Index
Journal :
Expert Systems with Applications
Publication Type :
Academic Journal
Accession number :
173609361
Full Text :
https://doi.org/10.1016/j.eswa.2023.121601