mCLIP: Multilingual CLIP via Cross-lingual Transfer

Authors :: Chen, Guanhua
Hou, Lu
Chen, Yun
Dai, Wenliang
Shang, Lifeng
Jiang, Xin
Liu, Qun
Pan, Jia
Wang, Wenping
Chen, Guanhua
Hou, Lu
Chen, Yun
Dai, Wenliang
Shang, Lifeng
Jiang, Xin
Liu, Qun
Pan, Jia
Wang, Wenping
Publication Year :: 2023
Abstract: Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable performance on various downstream cross-modal tasks. However, they are usually biased towards English due to the lack of sufficient non-English image-text pairs. Existing multilingual VLP methods often learn retrieval-inefficient single-stream models by translation-augmented non-English image-text pairs. In this paper, we introduce mCLIP, a retrieval-efficient dual-stream multilingual VLP model, trained by aligning the CLIP model and a Multilingual Text Encoder (MTE) through a novel Triangle Cross-modal Knowledge Distillation (TriKD) method. It is parameter-efficient as only two light projectors on the top of them are updated during distillation. Furthermore, to enhance the token- and sentence-level multilingual representation of the MTE, we propose to train it with machine translation and contrastive learning jointly before the TriKD to provide a better initialization. Empirical results show that mCLIP achieves new state-of-the-art performance for both zero-shot and finetuned multilingual image-text retrieval task. © 2023 Association for Computational Linguistics.

Tools