1. Detect2Interact: Localizing Object Key Field in Visual Question Answering with LLMs
- Author
-
Wang, Jialou, Zhu, Manli, Li, Yulei, Li, Honglei, Yang, Longzhi, and Woo, Wai Lok
- Abstract
Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spatially accurate responses. In this article, we introduce “Detect2Interact,” which addresses the challenges in accurately mapping objects within images to generate nuanced and spatially aware responses by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4’s commonsense knowledge. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.
- Published
- 2024
- Full Text
- View/download PDF