Back to Search Start Over

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Authors :
McKinzie, Brandon
Gan, Zhe
Fauconnier, Jean-Philippe
Dodge, Sam
Zhang, Bowen
Dufter, Philipp
Shah, Dhruti
Du, Xianzhi
Peng, Futang
Weers, Floris
Belyi, Anton
Zhang, Haotian
Singh, Karanjeet
Kang, Doug
Jain, Ankur
Hè, Hongyu
Schwarzer, Max
Gunter, Tom
Kong, Xiang
Zhang, Aonan
Wang, Jianyu
Wang, Chong
Du, Nan
Lei, Tao
Wiseman, Sam
Yin, Guoli
Lee, Mark
Wang, Zirui
Pang, Ruoming
Grasch, Peter
Toshev, Alexander
Yang, Yinfei
Publication Year :
2024

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2403.09611
Document Type :
Working Paper