Back to Search Start Over

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Authors :
Huang, Ailin
Wu, Boyong
Wang, Bruce
Yan, Chao
Hu, Chen
Feng, Chengli
Tian, Fei
Shen, Feiyu
Li, Jingbei
Chen, Mingrui
Liu, Peng
Miao, Ruihang
You, Wang
Chen, Xi
Yang, Xuerui
Huang, Yechang
Zhang, Yuxiang
Gong, Zheng
Zhang, Zixin
Zhou, Hongyu
Sun, Jianjian
Li, Brian
Feng, Chengting
Wan, Changyi
Hu, Hanpeng
Wu, Jianchang
Zhen, Jiangjie
Ming, Ranchen
Yuan, Song
Zhang, Xuelin
Zhou, Yu
Li, Bingxin
Ma, Buyun
Wang, Hongyuan
An, Kang
Ji, Wei
Li, Wen
Wen, Xuan
Kong, Xiangwen
Ma, Yuankai
Liang, Yuanwei
Mou, Yun
Ahmidi, Bahtiyar
Wang, Bin
Li, Bo
Miao, Changxin
Xu, Chen
Wang, Chenrun
Shi, Dapeng
Sun, Deshan
Hu, Dingyuan
Sai, Dula
Liu, Enle
Huang, Guanzhe
Yan, Gulin
Wang, Heng
Jia, Haonan
Zhang, Haoyang
Gong, Jiahao
Guo, Junjing
Liu, Jiashuai
Liu, Jiahong
Feng, Jie
Wu, Jie
Wu, Jiaoren
Yang, Jie
Wang, Jinguo
Zhang, Jingyang
Lin, Junzhe
Li, Kaixiang
Xia, Lei
Zhou, Li
Zhao, Liang
Gu, Longlong
Chen, Mei
Wu, Menglin
Li, Ming
Li, Mingxiao
Li, Mingliang
Liang, Mingyao
Wang, Na
Hao, Nie
Wu, Qiling
Tan, Qinyuan
Sun, Ran
Shuai, Shuai
Pang, Shaoliang
Yang, Shiliang
Gao, Shuli
Yuan, Shanshan
Liu, Siqi
Deng, Shihong
Jiang, Shilei
Liu, Sitong
Cao, Tiancheng
Wang, Tianyu
Deng, Wenjin
Xie, Wuxun
Ming, Weipeng
He, Wenqing
Sun, Wen
Han, Xin
Huang, Xin
Deng, Xiaomin
Liu, Xiaojia
Wu, Xin
Zhao, Xu
Wei, Yanan
Yu, Yanbo
Cao, Yang
Li, Yangguang
Ma, Yangzhen
Xu, Yanming
Wang, Yaoyu
Shi, Yaqiang
Wang, Yilei
Zhou, Yizhuang
Zhong, Yinmin
Zhang, Yang
Wei, Yaoben
Luo, Yu
Lu, Yuanwei
Yin, Yuhe
Luo, Yuchu
Ding, Yuanhao
Yan, Yuting
Dai, Yaqi
Yang, Yuxiang
Xie, Zhe
Ge, Zheng
Sun, Zheng
Huang, Zhewei
Chang, Zhichao
Guan, Zhisheng
Yang, Zidong
Zhang, Zili
Jiao, Binxing
Jiang, Daxin
Shum, Heung-Yeung
Chen, Jiansheng
Li, Jing
Zhou, Shuchang
Zhang, Xiangyu
Zhang, Xinhao
Zhu, Yibo
Publication Year :
2025

Abstract

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2502.11946
Document Type :
Working Paper