Start Over

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval.

Authors :: Wang, Yaodong
Ji, Zhong
Chen, Kexin
Pang, Yanwei
Zhang, Zhongfei
Source :: Neural Processing Letters; Oct2023, Vol. 55 Issue 5, p5959-5978, 20p
Publication Year :: 2023
Abstract: Cross-modal image-text retrieval aims at retrieving the images according to the given query texts and vice versa, which is a challenging task due to the inherent heterogeneous gap between computer vision and natural language processing. Most previous methods mine the intra-modal interactions and inter-modal interactions independently, which may lead to a fragmented understanding of the visual-linguistic modalities. Different from them, in this paper, we address this challenge by proposing a unified multi-modal Co-Occurrence transformer Reasoning Network, dubbed as COREN, to comprehensively discover the semantic correlations of the two modalities. Specifically, we resort to a unified multi-modal transformer encoder to decompose the intra-modal and inter-modal co-occurrence relationships reasoning into a two-stage learning architecture. In the first learning stage, we utilize the multi-modal transformer as a shared siamese encoder for both visual and textual branch to reason the intra-modal co-occurrence relationships. In this way, we obtain modality-specific contextualized representations for each input image and text instance, and the model is equipped with the representation and reasoning ability of both visual and textual entities. In the second learning stage, we stack the visual and textual features together and jointly feed them into the same multi-modal transformer encoder to reason the inter-modal co-occurrence relationships between the two modalities. Additionally, we propose a novel Adaptive Similarity Aggregation (ASA) module to achieve a more accurate cross-modal similarity measurement based on the generated contextualized representations. The experimental results on benchmark datasets demonstrate the effectiveness and superiority of our proposed method. [ABSTRACT FROM AUTHOR]

Subjects :: TRANSFORMER models
COMPUTER vision
NATURAL language processing

Details

Language :: English
ISSN :: 13704621
Volume :: 55
Issue :: 5
Database :: Complementary Index
Journal :: Neural Processing Letters
Publication Type :: Academic Journal
Accession number :: 172445323
Full Text :: https://doi.org/10.1007/s11063-022-11121-z

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources