1. Bidirectional interactive alignment network for image captioning.
- Author
-
Cao, Xinrong, Yan, Peixin, Hu, Rong, and Li, Zuoyong
- Abstract
In recent years, many researchers have improved image captioning performance by fusing region features and grid features. However, the semantic gap between the two features is often overlooked during fusion, and the exploration of multimodal feature interaction remains insufficient. In this paper, we propose a Bidirectional Interactive Alignment Network (BIANet) to achieve more multi-feature and multi-modal fusion in both the encoder and decoder. We propose a bidirectional interactive encoder that utilizes cross-interaction to complement the advantages of both image features, enriching their visual information. In the decoder, we propose a cross-alignment module. This module enables the text features to interact in two sequences: “region feature-grid feature” and “grid feature-region feature”, resulting in two new text features. By improving the similarity between these two text features, the semantic gap between region features and grid features is indirectly alleviated. Extensive experiments on the MS COCO dataset demonstrate that the proposed model achieves competitive results on the Karpathy test split. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF