1. MaXM: Towards Multilingual Visual Question Answering
- Author
-
Changpinyo, Soravit, Xue, Linting, Szpektor, Idan, Thapliyal, Ashish V., Amelot, Julien, Yarom, Michal, Chen, Xi, and Soricut, Radu
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computation and Language (cs.CL) - Abstract
Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we propose an approach to unified, extensible, open-ended, and end-to-end mVQA modeling and demonstrate strong performance in 13 languages., https://github.com/google-research-datasets/maxm
- Published
- 2022