Back to Search Start Over

ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Authors :
Chen, Guiming Hardy
Chen, Shunian
Zhang, Ruifei
Chen, Junying
Wu, Xiangbo
Zhang, Zhiyi
Chen, Zhihong
Li, Jianquan
Wan, Xiang
Wang, Benyou
Chen, Guiming Hardy
Chen, Shunian
Zhang, Ruifei
Chen, Junying
Wu, Xiangbo
Zhang, Zhiyi
Chen, Zhihong
Li, Jianquan
Wan, Xiang
Wang, Benyou
Publication Year :
2024

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have enabled processing of multimodal inputs in language models but require significant computational resources for deployment, especially in edge devices. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To do this, a synthetic dataset is created by leveraging GPT-4V's ability to generate detailed captions, complex reasoning instructions and detailed answers from images. The resulted model trained with our data, ALLaVA, achieves competitive performance on 12 benchmarks up to 3B LVLMs. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. Our online demo is available at \url{https://allava.freedomai.cn}.<br />Comment: 19 pages

Details

Database :
OAIster
Publication Type :
Electronic Resource
Accession number :
edsoai.on1438527342
Document Type :
Electronic Resource