Back to Search
Start Over
COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval.
- Source :
- Neural Processing Letters; Oct2023, Vol. 55 Issue 5, p5959-5978, 20p
- Publication Year :
- 2023
-
Abstract
- Cross-modal image-text retrieval aims at retrieving the images according to the given query texts and vice versa, which is a challenging task due to the inherent heterogeneous gap between computer vision and natural language processing. Most previous methods mine the intra-modal interactions and inter-modal interactions independently, which may lead to a fragmented understanding of the visual-linguistic modalities. Different from them, in this paper, we address this challenge by proposing a unified multi-modal Co-Occurrence transformer Reasoning Network, dubbed as COREN, to comprehensively discover the semantic correlations of the two modalities. Specifically, we resort to a unified multi-modal transformer encoder to decompose the intra-modal and inter-modal co-occurrence relationships reasoning into a two-stage learning architecture. In the first learning stage, we utilize the multi-modal transformer as a shared siamese encoder for both visual and textual branch to reason the intra-modal co-occurrence relationships. In this way, we obtain modality-specific contextualized representations for each input image and text instance, and the model is equipped with the representation and reasoning ability of both visual and textual entities. In the second learning stage, we stack the visual and textual features together and jointly feed them into the same multi-modal transformer encoder to reason the inter-modal co-occurrence relationships between the two modalities. Additionally, we propose a novel Adaptive Similarity Aggregation (ASA) module to achieve a more accurate cross-modal similarity measurement based on the generated contextualized representations. The experimental results on benchmark datasets demonstrate the effectiveness and superiority of our proposed method. [ABSTRACT FROM AUTHOR]
- Subjects :
- TRANSFORMER models
COMPUTER vision
NATURAL language processing
Subjects
Details
- Language :
- English
- ISSN :
- 13704621
- Volume :
- 55
- Issue :
- 5
- Database :
- Complementary Index
- Journal :
- Neural Processing Letters
- Publication Type :
- Academic Journal
- Accession number :
- 172445323
- Full Text :
- https://doi.org/10.1007/s11063-022-11121-z