Back to Search Start Over

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Authors :
Zhang, Pan
Dong, Xiaoyi
Zang, Yuhang
Cao, Yuhang
Qian, Rui
Chen, Lin
Guo, Qipeng
Duan, Haodong
Wang, Bin
Ouyang, Linke
Zhang, Songyang
Zhang, Wenwei
Li, Yining
Gao, Yang
Sun, Peng
Zhang, Xinyue
Li, Wei
Li, Jingwen
Wang, Wenhai
Yan, Hang
He, Conghui
Zhang, Xingcheng
Chen, Kai
Dai, Jifeng
Qiao, Yu
Lin, Dahua
Wang, Jiaqi
Publication Year :
2024

Abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.<br />Comment: Technical Report. https://github.com/InternLM/InternLM-XComposer

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2407.03320
Document Type :
Working Paper