Start Over

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Authors :: Li, You
Huang, Heyu
Chen, Chi
Huang, Kaiyu
Huang, Chao
Guo, Zonghao
Liu, Zhiyuan
Xu, Jinan
Li, Yuhua
Li, Ruixuan
Sun, Maosong
Publication Year :: 2025
Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.<br />Comment: 20 pages, 8 figures

Subjects :: Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2501.05767
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources