Start Over

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Authors :: Zhang, Junjie
Bai, Chenjia
He, Haoran
Xia, Wenke
Wang, Zhigang
Zhao, Bin
Li, Xiu
Li, Xuelong
Publication Year :: 2024
Abstract: Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.<br />Comment: ICML 2024. Project page: https://sam-embodied.github.io

Subjects :: Computer Science - Computer Vision and Pattern Recognition
Computer Science - Machine Learning
Computer Science - Robotics

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2405.19586
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources