1. Robust visual question answering via semantic cross modal augmentation.
- Author
-
Mashrur, Akib, Luo, Wei, Zaidi, Nayyar A., and Robles-Kelly, Antonio
- Subjects
LANGUAGE models ,DATA augmentation ,PREDICTION models - Abstract
Recent advances in vision-language models have resulted in improved accuracy in visual question answering (VQA) tasks. However, their robustness remains limited when faced with out-of-distribution data containing unanswerable questions. In this study, we first construct a simple randomised VQA dataset, incorporating unanswerable questions from the VQA v2 dataset, to evaluate the robustness of a state-of-the-art VQA model. Our findings reveal that the model struggles to predict the "unknown" answer or provides inaccurate responses with high confidence scores for irrelevant questions. To address this issue without retraining the large backbone models, we propose Cross Modal Augmentation (CMA), a model-agnostic, test-time-only, multi-modal semantic augmentation technique. CMA generates multiple semantically-consistent but heterogeneous instances from the visual and textual inputs, which are then fed to the model, and the predictions are combined to achieve a more robust output. We demonstrate that implementing CMA enables the VQA model to provide more reliable answers in scenarios involving unanswerable questions, and show that the approach is generalisable across different categories of pre-trained vision language models. • VQA models often confidently give incorrect answers to irrelevant questions. • We enhance model robustness at test-time through multi-modal semantic augmentation. • Proposed CMA creates varied inputs for models and merges predictions for stability. • CMA variants improve VQA reliability and performance in ambiguous environments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF