Back to Search Start Over

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Authors :
Uehara, Kohei
Goswami, Nabarun
Wang, Hanqin
Baba, Toshiaki
Tanaka, Kohtaro
Hashimoto, Tomohiro
Wang, Kai
Ito, Rei
Naoya, Takagi
Umagami, Ryo
Wen, Yingyi
Anakewat, Tanachai
Harada, Tatsuya
Publication Year :
2024

Abstract

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2401.10005
Document Type :
Working Paper