Back to Search Start Over

Detect2Interact: Localizing Object Key Field in Visual Question Answering with LLMs

Authors :
Wang, Jialou
Zhu, Manli
Li, Yulei
Li, Honglei
Yang, Longzhi
Woo, Wai Lok
Source :
IEEE Intelligent Systems; 2024, Vol. 39 Issue: 3 p35-44, 10p
Publication Year :
2024

Abstract

Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spatially accurate responses. In this article, we introduce “Detect2Interact,” which addresses the challenges in accurately mapping objects within images to generate nuanced and spatially aware responses by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4’s commonsense knowledge. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.

Details

Language :
English
ISSN :
15411672
Volume :
39
Issue :
3
Database :
Supplemental Index
Journal :
IEEE Intelligent Systems
Publication Type :
Periodical
Accession number :
ejs66752228
Full Text :
https://doi.org/10.1109/MIS.2024.3384513