Back to Search Start Over

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

Authors :
Zhao, Wei
Ding, Pengxiang
Zhang, Min
Gong, Zhefei
Bai, Shuanghao
Zhao, Han
Wang, Donglin
Publication Year :
2025

Abstract

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.<br />Comment: Accepted as a conference paper at ICLR 2025

Subjects

Subjects :
Computer Science - Robotics

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2502.13508
Document Type :
Working Paper