ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Authors :: Byun, Ye Won
Jiao, Cathy
Noroozizadeh, Shahriar
Sun, Jimin
Vitiello, Rosa
Source :: Conference on Computer Vision and Pattern Recognition (CVPR 2022) - Embodied AI Workshop
Publication Year :: 2024
Abstract: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

Subjects :: Computer Science - Computer Vision and Pattern Recognition
Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Machine Learning
Computer Science - Robotics

Database :: arXiv
Journal :: Conference on Computer Vision and Pattern Recognition (CVPR 2022) - Embodied AI Workshop
Publication Type :: Report
Accession number :: edsarx.2406.17876
Document Type :: Working Paper

Tools