Bidirectional interactive alignment network for image captioning.

Authors :: Cao, Xinrong
Yan, Peixin
Hu, Rong
Li, Zuoyong
Source :: Multimedia Systems; Dec2024, Vol. 30 Issue 6, p1-13, 13p
Publication Year :: 2024
Abstract: In recent years, many researchers have improved image captioning performance by fusing region features and grid features. However, the semantic gap between the two features is often overlooked during fusion, and the exploration of multimodal feature interaction remains insufficient. In this paper, we propose a Bidirectional Interactive Alignment Network (BIANet) to achieve more multi-feature and multi-modal fusion in both the encoder and decoder. We propose a bidirectional interactive encoder that utilizes cross-interaction to complement the advantages of both image features, enriching their visual information. In the decoder, we propose a cross-alignment module. This module enables the text features to interact in two sequences: “region feature-grid feature” and “grid feature-region feature”, resulting in two new text features. By improving the similarity between these two text features, the semantic gap between region features and grid features is indirectly alleviated. Extensive experiments on the MS COCO dataset demonstrate that the proposed model achieves competitive results on the Karpathy test split. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools