Back to Search Start Over

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Authors :
Microsoft
Abouelenin, Abdelrahman
Ashfaq, Atabak
Atkinson, Adam
Awadalla, Hany
Bach, Nguyen
Bao, Jianmin
Benhaim, Alon
Cai, Martin
Chaudhary, Vishrav
Chen, Congcong
Chen, Dong
Chen, Dongdong
Chen, Junkun
Chen, Weizhu
Chen, Yen-Chun
Chen, Yi-ling
Dai, Qi
Dai, Xiyang
Fan, Ruchao
Gao, Mei
Gao, Min
Garg, Amit
Goswami, Abhishek
Hao, Junheng
Hendy, Amr
Hu, Yuxuan
Jin, Xin
Khademi, Mahmoud
Kim, Dongwoo
Kim, Young Jin
Lee, Gina
Li, Jinyu
Li, Yunsheng
Liang, Chen
Lin, Xihui
Lin, Zeqi
Liu, Mengchen
Liu, Yang
Lopez, Gilsinia
Luo, Chong
Madan, Piyush
Mazalov, Vadim
Mitra, Arindam
Mousavi, Ali
Nguyen, Anh
Pan, Jing
Perez-Becker, Daniel
Platin, Jacob
Portet, Thomas
Qiu, Kai
Ren, Bo
Ren, Liliang
Roy, Sambuddha
Shang, Ning
Shen, Yelong
Singhal, Saksham
Som, Subhojit
Song, Xia
Sych, Tetyana
Vaddamanu, Praneetha
Wang, Shuohang
Wang, Yiming
Wang, Zhenghao
Wu, Haibin
Xu, Haoran
Xu, Weijian
Yang, Yifan
Yang, Ziyi
Yu, Donghan
Zabir, Ishmam
Zhang, Jianwen
Zhang, Li Lyna
Zhang, Yunan
Zhou, Xiren
Publication Year :
2025

Abstract

We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.<br />Comment: 39 pages

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2503.01743
Document Type :
Working Paper