Back to Search
Start Over
Hierarchical cross-modal contextual attention network for visual grounding.
- Source :
-
Multimedia Systems . Aug2023, Vol. 29 Issue 4, p2073-2083. 11p. - Publication Year :
- 2023
-
Abstract
- This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN. [ABSTRACT FROM AUTHOR]
- Subjects :
- *DEEP learning
Subjects
Details
- Language :
- English
- ISSN :
- 09424962
- Volume :
- 29
- Issue :
- 4
- Database :
- Academic Search Index
- Journal :
- Multimedia Systems
- Publication Type :
- Academic Journal
- Accession number :
- 164947959
- Full Text :
- https://doi.org/10.1007/s00530-023-01097-8