Start Over

Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge

Authors :: Huang, Yufeng
Tang, Jiji
Chen, Zhuo
Zhang, Rongsheng
Zhang, Xinfeng
Chen, Weijie
Zhao, Zeng
Lv, Tangjie
Hu, Zhipeng
Zhang, Wen
Publication Year :: 2023
Abstract: Large-scale vision-language pre-training has shown promising advances on various downstream tasks and achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require a detailed semantics understanding of the text. Although there have been some works on this problem, they do not sufficiently exploit the structural knowledge present in sentences to enhance multi-modal language representations, which leads to poor performance. In this paper, we present an end-to-end framework Structure-CLIP, which integrates latent detailed semantics from the text to enhance fine-grained semantic representations. Specifically, (1) we use scene graphs in order to pay more attention to the detailed semantic learning in the text and fully explore structured knowledge between fine-grained semantics, and (2) we utilize the knowledge-enhanced framework with the help of the scene graph to make full use of representations of structured knowledge. To verify the effectiveness of our proposed method, we pre-trained our models with the aforementioned approach and conduct experiments on different downstream tasks. Numerical results show that Structure-CLIP can often achieve state-of-the-art performance on both VG-Attribution and VG-Relation datasets. Extensive experiments show its components are effective and its predictions are interpretable, which proves that our proposed method can enhance detailed semantic representation well.<br />Work in progress

Subjects :: FOS: Computer and information sciences
Artificial Intelligence (cs.AI)
Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computation and Language (cs.CL)
Computer Science - Multimedia
Multimedia (cs.MM)

Details

Language :: English
Database :: OpenAIRE
Accession number :: edsair.doi.dedup.....c7892b10a958bef12592bcfd4100985a

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources