Back to Search Start Over

An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention.

Authors :
Zhang, Junqian
Tu, Long
Zhang, Yakun
Xie, Liang
Xu, Minpeng
Ming, Dong
Yan, Ye
Yin, Erwei
Source :
Electronics (2079-9292); Dec2023, Vol. 12 Issue 24, p5007, 16p
Publication Year :
2023

Abstract

Visual grounding aims to recognize and locate the target in the image according to human intention, which provides a new intelligent interaction idea and method for augmented reality (AR) and virtual reality (VR) devices. However, existing vision language grounding adopts language modals for visual grounding, but it performs ineffectively for images containing multiple similar objects. Gaze interaction is an important interaction mode in AR/VR devices, and it provides an advanced solution to the inaccurate vision language grounding cases. Based on the above questions and analysis, a vision language grounding framework fused with gaze intention is proposed. Firstly, we collect the manual gaze annotations using the AR device and construct a novel multi-modal dataset, RefCOCOg-Gaze, combining it with the proposed data augmentation methods. Secondly, an attention-based multi-modal feature fusion model is designed, providing a baseline framework for vision language grounding with gaze intention (VLG-Gaze). Through a series of precisely designed experiments, we analyze the proposed dataset and framework qualitatively and quantitatively. Comparing with the state-of-the-art vision language grounding model, our proposed scheme improves the accuracy by 5.3%, which indicates the significance of gaze fusion in multi-modal grounding tasks. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
20799292
Volume :
12
Issue :
24
Database :
Complementary Index
Journal :
Electronics (2079-9292)
Publication Type :
Academic Journal
Accession number :
174440492
Full Text :
https://doi.org/10.3390/electronics12245007