Back to Search Start Over

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval.

Authors :
Ji, Zhong
Lin, Zhigang
Wang, Haoran
Pang, Yanwei
Li, Xuelong
Source :
Pattern Recognition. Jul2024, Vol. 151, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e. , exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods. • This paper proposes a novel Multi-Task Hierarchical Convolutional Network (MT-HCN) for visual-semantic cross-modal retrieval, which is characterized by adopting classification task to improve the hierarchical multi-modal representation learning. • This paper proposes a novel Self-Supervision Clustering (SSC) loss to learn the exterior representations that fully exploits low-level fine-grained correlation for associating images and texts. • This paper presents an effective bidirectional ranking loss, namely Harmonious Bidirectional Ranking (HBR) for cross-modal correlation preserving. It not only efficiently assists us to seek out more representative hard negative samples, but also leverages the category center of negatives to enhance the robustness of cross-modal representations. • Extensive experiments on two benchmark datasets validate the superiority of our proposed model in comparison to the state-of-the-art approaches. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00313203
Volume :
151
Database :
Academic Search Index
Journal :
Pattern Recognition
Publication Type :
Academic Journal
Accession number :
176406953
Full Text :
https://doi.org/10.1016/j.patcog.2024.110398