Start Over

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Authors :: Zhao, Zijia
Guo, Longteng
He, Xingjian
Shao, Shuai
Yuan, Zehuan
Liu, Jing
Publication Year :: 2022
Abstract: Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.<br />Comment: SIGIR 2023, 10 pages

Subjects :: Computer Science - Computer Vision and Pattern Recognition
Computer Science - Computation and Language
Computer Science - Machine Learning
Computer Science - Multimedia

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2210.04183
Document Type :: Working Paper
Full Text :: https://doi.org/10.1145/3539618.3591721

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources