1. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
- Author
-
Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, Schmid, Cordelia, Models of visual object recognition and scene understanding (WILLOW), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), DeepMind [London], DeepMind Technologies, Czech Institute of Informatics, Robotics and Cybernetics [Prague] (CIIRC), Czech Technical University in Prague (CTU), This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011011670R2 made by GENCI. The work was funded by a Google gift, the French government under management of Agence Nationale de la Recherche as part of the 'Investissements d’avenir' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the Louis Vuitton ENS Chair on Artificial Intelligence, the European Regional Development Fund under project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468)., and ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019)
- Subjects
FOS: Computer and information sciences ,ACM: I.: Computing Methodologies/I.5: PATTERN RECOGNITION ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Video Understanding ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Vision ,Computer Science - Computer Vision and Pattern Recognition ,[INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] ,Video Question Answering ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Machine Learning (cs.LG) ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.10: Vision and Scene Understanding ,ACM: I.: Computing Methodologies/I.5: PATTERN RECOGNITION/I.5.1: Models ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Vision and Language ,Zero-Shot Learning ,[INFO]Computer Science [cs] ,Computation and Language (cs.CL) ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning - Abstract
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM., NeurIPS 2022 Camera-Ready; Project Webpage: https://antoyang.github.io/frozenbilm.html; 25 pages; 5 figures
- Published
- 2022