Back to Search Start Over

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Authors :
Huo, Yuqi
Zhang, Manli
Liu, Guangzhen
Lu, Haoyu
Gao, Yizhao
Yang, Guoxing
Wen, Jingyuan
Zhang, Heng
Xu, Baogui
Zheng, Weihao
Xi, Zongzheng
Yang, Yueqian
Hu, Anwen
Zhao, Jinming
Li, Ruichen
Zhao, Yida
Zhang, Liang
Song, Yuqing
Hong, Xin
Cui, Wanqing
Hou, Danyang
Li, Yingyan
Li, Junyi
Liu, Peiyu
Gong, Zheng
Jin, Chuhao
Sun, Yuchong
Chen, Shizhe
Lu, Zhiwu
Dou, Zhicheng
Jin, Qin
Lan, Yanyan
Zhao, Wayne Xin
Song, Ruihua
Wen, Ji-Rong
Huo, Yuqi
Zhang, Manli
Liu, Guangzhen
Lu, Haoyu
Gao, Yizhao
Yang, Guoxing
Wen, Jingyuan
Zhang, Heng
Xu, Baogui
Zheng, Weihao
Xi, Zongzheng
Yang, Yueqian
Hu, Anwen
Zhao, Jinming
Li, Ruichen
Zhao, Yida
Zhang, Liang
Song, Yuqing
Hong, Xin
Cui, Wanqing
Hou, Danyang
Li, Yingyan
Li, Junyi
Liu, Peiyu
Gong, Zheng
Jin, Chuhao
Sun, Yuchong
Chen, Shizhe
Lu, Zhiwu
Dou, Zhicheng
Jin, Qin
Lan, Yanyan
Zhao, Wayne Xin
Song, Ruihua
Wen, Ji-Rong
Publication Year :
2021

Abstract

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.<br />Comment: This paper is the outcome of the Chinese multi-modal pre-training project called 'WenLan'

Details

Database :
OAIster
Publication Type :
Electronic Resource
Accession number :
edsoai.on1269534538
Document Type :
Electronic Resource