Back to Search Start Over

Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge

Authors :
Huang, Yufeng
Tang, Jiji
Chen, Zhuo
Zhang, Rongsheng
Zhang, Xinfeng
Chen, Weijie
Zhao, Zeng
Lv, Tangjie
Hu, Zhipeng
Zhang, Wen
Publication Year :
2023

Abstract

Large-scale vision-language pre-training has shown promising advances on various downstream tasks and achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require a detailed semantics understanding of the text. Although there have been some works on this problem, they do not sufficiently exploit the structural knowledge present in sentences to enhance multi-modal language representations, which leads to poor performance. In this paper, we present an end-to-end framework Structure-CLIP, which integrates latent detailed semantics from the text to enhance fine-grained semantic representations. Specifically, (1) we use scene graphs in order to pay more attention to the detailed semantic learning in the text and fully explore structured knowledge between fine-grained semantics, and (2) we utilize the knowledge-enhanced framework with the help of the scene graph to make full use of representations of structured knowledge. To verify the effectiveness of our proposed method, we pre-trained our models with the aforementioned approach and conduct experiments on different downstream tasks. Numerical results show that Structure-CLIP can often achieve state-of-the-art performance on both VG-Attribution and VG-Relation datasets. Extensive experiments show its components are effective and its predictions are interpretable, which proves that our proposed method can enhance detailed semantic representation well.<br />Work in progress

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....c7892b10a958bef12592bcfd4100985a