Author: "Du, Zhihao" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Du, Zhihao"' showing total 249 results

Start Over Author "Du, Zhihao"

249 results on '"Du, Zhihao"'

1. Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

Author: Sheng, Zhengyan, Du, Zhihao, Lu, Heng, Zhang, Shiliang, and Ling, Zhen-Hua
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at \url{https://UniSpeaker.github.io}.
Published: 2025

2. MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Author: Chen, Qian, Chen, Yafeng, Chen, Yanni, Chen, Mengzhe, Chen, Yingda, Deng, Chong, Du, Zhihao, Gao, Ruize, Gao, Changfeng, Gao, Zhifu, Li, Yabin, Lv, Xiang, Liu, Jiaqing, Luo, Haoneng, Ma, Bin, Ni, Chongjia, Shi, Xian, Tang, Jialong, Wang, Hui, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xu, Yunlan, Yu, Fan, Yan, Zhijie, Yang, Yexin, Yang, Baosong, Yang, Xian, Yang, Guanrou, Zhao, Tianyu, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, Zhang, Pei, Zhang, Chong, and Zhou, Jinren
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon., Comment: Work in progress. Authors are listed in alphabetical order by family name
Published: 2025

3. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Author: Du, Zhihao, Wang, Yuxuan, Chen, Qian, Shi, Xian, Lv, Xiang, Zhao, Tianyu, Gao, Zhifu, Yang, Yexin, Gao, Changfeng, Wang, Hui, Yu, Fan, Liu, Huadai, Sheng, Zhengyan, Gu, Yue, Deng, Chong, Wang, Wen, Zhang, Shiliang, Yan, Zhijie, and Zhou, Jingren
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2., Comment: Tech report, work in progress
Published: 2024

4. OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Author: Zhang, Qinglin, Cheng, Luyao, Deng, Chong, Chen, Qian, Wang, Wen, Zheng, Siqi, Liu, Jiaqing, Yu, Hai, Tan, Chaohong, Du, Zhihao, and Zhang, Shiliang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/)., Comment: Work in progress
Published: 2024

5. Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Author: Yang, Guanrou, Yu, Fan, Ma, Ziyang, Du, Zhihao, Gao, Zhifu, Zhang, Shiliang, and Chen, Xie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. Comprehensive experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements, proving that the proposed method of enhancing low-resource ASR through a versatile TTS model is highly effective and has broad application prospects. Furthermore, we delve deeper into key characteristics of synthesized speech data that contribute to ASR improvement, examining factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work. We hope our findings provide helpful guidance and reference for the practical application of TTS-based data augmentation and push the advancement of low-resource ASR one step further.
Published: 2024

6. IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Author: Zhang, Xin, Lyu, Xiang, Du, Zhihao, Chen, Qian, Zhang, Dong, Hu, Hangrui, Tan, Chaohong, Zhao, Tianyu, Wang, Yuxuan, Zhang, Bin, Lu, Heng, Zhou, Yaqian, and Qiu, Xipeng
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence
Abstract: Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named \method-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/.
Published: 2024

7. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Author: Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, Zheng, Siqi, Gu, Yue, Ma, Ziyang, Gao, Zhifu, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models., Comment: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051
Published: 2024

8. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Author: An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM., Comment: Work in progress. Authors are listed in alphabetical order by family name
Published: 2024

9. An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Author: Ma, Ziyang, Yang, Guanrou, Yang, Yifan, Gao, Zhifu, Wang, Jiaming, Du, Zhihao, Yu, Fan, Chen, Qian, Zheng, Siqi, Zhang, Shiliang, and Chen, Xie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community., Comment: Working in progress and will open-source soon
Published: 2024

10. Correction: Exosomes derived from human exfoliated deciduous teeth ameliorate adult bone loss in mice through promoting osteogenesis

Author: Wei, Jizhen, Song, Yeqing, Du, Zhihao, Yu, Feiyan, Zhang, Yimei, Jiang, Nan, and Ge, Xuejun
Published: 2025
Full Text: View/download PDF

11. Effect of heat treatment and electromagnetic forming on springback and related properties of 7075 aluminum alloy sheet

Author: Du, Zhihao, Han, Yanbin, Han, Dong, Zhang, Haibao, Mao, Xiaobo, Zhang, Zhao, and Cui, Xiaohui
Published: 2024
Full Text: View/download PDF

12. Mind–body exercise for patients with stable COPD on lung function and exercise capacity: a systematic review and meta-analysis of RCTs

Author: Zhu, Yutong, Zhang, Zhihao, Du, Zhihao, and Zhai, Feng
Published: 2024
Full Text: View/download PDF

13. SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Author: Li, Yangze, Yu, Fan, Liang, Yuhao, Guo, Pengcheng, Shi, Mohan, Du, Zhihao, Zhang, Shiliang, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.
Published: 2023

14. LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Author: Du, Zhihao, Wang, Jiaming, Chen, Qian, Chu, Yunfei, Gao, Zhifu, Li, Zerui, Hu, Kai, Zhou, Xiaohuan, Xu, Jin, Ma, Ziyang, Wang, Wen, Zheng, Siqi, Zhou, Chang, Yan, Zhijie, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding., Comment: 10 pages, work in progress
Published: 2023

15. The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Author: Liang, Yuhao, Shi, Mohan, Yu, Fan, Li, Yangze, Zhang, Shiliang, Du, Zhihao, Chen, Qian, Xie, Lei, Qian, Yanmin, Wu, Jian, Chen, Zhuo, Lee, Kong Aik, Yan, Zhijie, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR., Comment: 8 pages, Accepted by ASRU2023
Published: 2023

16. FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

Author: Du, Zhihao, Zhang, Shiliang, Hu, Kai, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec., Comment: 5 pages, 3 figures, submitted to ICASSP 2024
Published: 2023

17. CASA-ASR: Context-Aware Speaker-Attributed ASR

Author: Shi, Mohan, Du, Zhihao, Chen, Qian, Yu, Fan, Li, Yangze, Zhang, Shiliang, Zhang, Jie, and Dai, Li-Rong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextual modeling ability of E2E SA-ASR. Specifically, in CASA-ASR, a contextual text encoder is involved to aggregate the semantic information of the whole utterance, and a context-dependent scorer is employed to model the speaker discriminability by contrasting with speakers in the context. In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance. Experimental results on AliMeeting corpus show that the proposed CASA-ASR model outperforms the original E2E SA-ASR system with a relative improvement of 11.76% in terms of speaker-dependent character error rate., Comment: Accepted by Interspeech2023
Published: 2023

18. FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Author: Gao, Zhifu, Li, Zerui, Wang, Jiaming, Luo, Haoneng, Shi, Xian, Chen, Mengzhe, Li, Yabin, Zuo, Lingyun, Du, Zhihao, Xiao, Zhangyu, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance., Comment: 5 pages, 3 figures, accepted by INTERSPEECH 2023
Published: 2023

19. TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Author: Wang, Jiaming, Du, Zhihao, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diarization as a single-label classification problem and propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly. Inspired by the success of two-stage hybrid systems, we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD) by involving a speaker overlap-aware post-processing (SOAP) model to iteratively refine the diarization results of EEND-OLA. Experimental results show that, compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates (DER), and utilizing SOAP provides another 19.33% relative improvement. As a result, our method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new state-of-the-art result on this benchmark to the best of our knowledge., Comment: Accepted by ICASSP2023
Published: 2023

20. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction., Comment: Accepted by EMNLP 2022
Published: 2022

21. A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

Author: Shi, Mohan, Zhang, Jie, Du, Zhihao, Yu, Fan, Chen, Qian, Zhang, Shiliang, and Dai, Li-Rong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform the corresponding single-channel counterparts in terms of the speaker-dependent character error rate.
Published: 2022

22. MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Liang, Yuhao, Du, Zhihao, Lin, Yuxiao, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi-channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7\% and 37.0\% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge., Comment: Accepted by SLT 2022
Published: 2022

23. The influence of physical activity on internet addiction among Chinese college students: the mediating role of self-esteem and the moderating role of gender

Author: Du Zhihao, Wang Tao, Sun Yingjie, and Zhai Feng
Subjects: Physical activity, Self-esteem, Internet addiction, Mediating effect, Moderating effect, Public aspects of medicine, RA1-1270
Abstract: Abstract Objectives The significance of self-esteem in the relationship between physical activity and Internet addiction among college students cannot be over, as it lays a solid foundation for the prevention and control of Internet addiction. Methods A total of 950 college students were surveyed using the Physical Activity Rating Scale (PARS-3), Rosenberg Self-Esteem Scale (SES), and Chinese Internet Addiction Scale (CIAS-R) through a cluster random sampling method. Descriptive statistics, independent sample t-test, partial correlation analysis, mediation effect, moderation effect, and Bootstrap testing were conducted on the collected data to analyze and interpret the results. Results (1) Significant gender differences were found in the amount of physical activity and the degree of Internet addiction among college students(P&& lt;0.001); (2) Physical activity was significantly and positively correlated with self-esteem (r = 0.26, P
Published: 2024
Full Text: View/download PDF

24. A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Author: Yu, Fan, Du, Zhihao, Zhang, Shiliang, Lin, Yuxiao, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction., Comment: accepted by INTERSPEECH 2022, 5 pages, 2 figures
Published: 2022

25. Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that SEND has a stable learning process and can be trained on highly overlapped data without extra initialization. More importantly, our method achieves the state-of-the-art performance in real meeting scenarios with fewer model parameters and lower computational complexity., Comment: Submitted to INTERSPEECH 2022, 5 parges, 2 figure
Published: 2022

26. Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Fu, Yihui, Du, Zhihao, Zheng, Siqi, Huang, Weilong, Xie, Lei, Tan, Zheng-Hua, Wang, DeLiang, Qian, Yanmin, Lee, Kong Aik, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions., Comment: Accepted by ICASSP 2022
Published: 2022

27. Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, Huang, Weilong, and Lei, Ming
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm., Comment: Submitted to ICASSP 2022, 5 pages, 2 figures
Published: 2021

28. M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Author: Yu, Fan, Zhang, Shiliang, Fu, Yihui, Xie, Lei, Zheng, Siqi, Du, Zhihao, Huang, Weilong, Guo, Pengcheng, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition have attracted much attention recently. However, the lack of large public meeting data has been a major obstacle for the advancement of the field. Therefore, we make available the AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by headset microphone. Each meeting session is composed of 2-4 speakers with different speaker overlap ratio, recorded in rooms with different size. Along with the dataset, we launch the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field. In this paper we provide a detailed introduction of the AliMeeting dateset, challenge rules, evaluation methods and baseline systems., Comment: Accepted by ICASSP 2022
Published: 2021

29. Microstructure evolution of 5052 aluminum alloy after electromagnetic-driven stamping following different heat treatment conditions

Author: Li, Rui, Du, Zhihao, Yang, Huan, He, Liming, and Cui, Xiaohui
Published: 2024
Full Text: View/download PDF

30. Eliminating warping due to stamping by using pulsed radial magnetic force

Author: Du, Zhihao, Yang, Huan, Xiao, Ang, and Cui, Xiaohui
Published: 2023
Full Text: View/download PDF

31. A combined analysis of multi-omics data reveals the prognostic values and immunotherapy response of LAG3 in human cancers

Author: Peng, Jinwu, Du, Zhihao, Sun, Yuwei, and Zhou, Zhiyang
Published: 2023
Full Text: View/download PDF

32. On the Relationship Between Newton’s Law and Sports : Taking Track and Field as an Example

Author: Li, Kehan, Zhou, Enming, Tao, Fangzheng, Du, Zhihao, Striełkowski, Wadim, Editor-in-Chief, Black, Jessica M., Series Editor, Butterfield, Stephen A., Series Editor, Chang, Chi-Cheng, Series Editor, Cheng, Jiuqing, Series Editor, Dumanig, Francisco Perlas, Series Editor, Al-Mabuk, Radhi, Series Editor, Scheper-Hughes, Nancy, Series Editor, Urban, Mathias, Series Editor, Webb, Stephen, Series Editor, Sun, Jian, editor, Chew, Fong Peng, editor, Khan, Intakhab Alam, editor, and Jenks, Christopher, editor
Published: 2023
Full Text: View/download PDF

33. Polyelectrolyte microgels-enhanced double-crosslinking polyacrylamide hydrogel sensing with stretchable, transparent, and fast response

Author: Zhu, Longxiang, Pan, Yujing, Wu, Jiamin, Du, Zhihao, and Shao, Zhu-Bao
Published: 2023
Full Text: View/download PDF

34. The process and mechanical properties of Nb–Si–Mo–Ti alloy prepared by spark plasma sintering

Author: Du, Zhihao, Yang, Zengyan, Jia, Yong, Jiang, Shaosong, Lu, Zhen, Ma, Shibang, Li, Liang, and Zhang, Congzheng
Published: 2023
Full Text: View/download PDF

35. A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting

Author: Gu, Yue, Du, Zhihao, Zhang, Hui, and Zhang, Xueliang
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Robustness against noise is critical for keyword spotting (KWS) in real-world environments. To improve the robustness, a speech enhancement front-end is involved. Instead of treating the speech enhancement as a separated preprocessing before the KWS system, in this study, a pre-trained speech enhancement front-end and a convolutional neural networks (CNNs) based KWS system are concatenated, where a feature transformation block is used to transform the output from the enhancement front-end into the KWS system's input. The whole model is trained jointly, thus the linguistic and other useful information from the KWS system can be back-propagated to the enhancement front-end to improve its performance. To fit the small-footprint device, a novel convolution recurrent network is proposed, which needs fewer parameters and computation and does not degrade performance. Furthermore, by changing the input features from the power spectrogram to Mel-spectrogram, less computation and better performance are obtained. our experimental results demonstrate that the proposed method significantly improves the KWS system with respect to noise robustness.
Published: 2019

36. Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

Author: Song, Hongwei, Han, Jiqing, Deng, Shiwen, and Du, Zhihao
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy. This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones., Comment: code URL typo, code is available at https://github.com/hackerekcah/distinct-events-asc.git
Published: 2019
Full Text: View/download PDF

37. An RCT-reticulated meta-analysis of six MBE therapies affecting college students' negative psychology

Author: Li, Haojie, Du, Zhihao, Shen, Shunze, Du, Wenya, Kang, Junhao, and Gong, Deming
Published: 2023
Full Text: View/download PDF

38. Deformation and fracture behavior of 5052 aluminum alloy by electromagnetic-driven stamping

Author: Du, Zhihao, Cui, Xiaohui, Yang, Huan, and Xia, Wenzhen
Published: 2022
Full Text: View/download PDF

39. Effect of electromagnetic forming–heat treatment process on mechanical and corrosion properties of 2024 aluminum alloy

Author: Xiao, Ang, Lin, Yuhong, Huang, Changqing, Cui, Xiaohui, Yan, Ziqin, and Du, Zhihao
Published: 2023
Full Text: View/download PDF

40. Investigation of Monaural Front-End Processing for Robust ASR without Retraining or Joint-Training

Author: Du, Zhihao, Zhang, Xueliang, and Han, Jiqing
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, monaural speech separation has been formulated as a supervised learning problem, which has been systematically researched and shown the dramatical improvement of speech intelligibility and quality for human listeners. However, it has not been well investigated whether the methods can be employed as the front-end processing and directly improve the performance of a machine listener, i.e., an automatic speech recognizer, without retraining or joint-training the acoustic model. In this paper, we explore the effectiveness of the independent front-end processing for the multi-conditional trained ASR on the CHiME-3 challenge. We find that directly feeding the enhanced features to ASR can make 36.40% and 11.78% relative WER reduction for the GMM-based and DNN-based ASR respectively. We also investigate the affect of noisy phase and generalization ability under unmatched noise condition., Comment: 5 pages, 0 figures, 4 tables, conference
Published: 2018

41. On the Relationship Between Newton’s Law and Sports

Author: Li, Kehan, primary, Zhou, Enming, additional, Tao, Fangzheng, additional, and Du, Zhihao, additional
Published: 2022
Full Text: View/download PDF

42. Investigation of electromagnetic incremental forming of single-curvature thin-walled aluminum alloy skins

Author: Du, Zhihao, Lin, Lei, Ye, Shengping, Cui, Xiaohui, Sun, Xiaoming, Zhang, Lei, and Yang, Huan
Published: 2022
Full Text: View/download PDF

43. Low-rise steel moment frames with energy-dissipating exterior walls: Computer modelling and seismic performance assessment

Author: Hou, Hetao, Du, Zhihao, Yan, Xuexue, Qu, Bing, Gao, Mengqi, Fang, Haibo, Zeng, Xiaozhen, and Xiong, Fangming
Published: 2022
Full Text: View/download PDF

44. Numerical simulation of the electromagnetic hot forming process of 7075-T6 aluminum alloy

Author: Huang, Changqing, Deng, Zanshi, Cui, Xiaohui, and Du, Zhihao
Published: 2022
Full Text: View/download PDF

45. The superplastic forming/diffusion bonding of TA7 titanium alloy for manufacturing hollow structure with stiffeners

Author: Du, Zhihao, Wang, Chunxu, Liu, Qing, Wang, Shan, Liu, Yang, and Wang, Guofeng
Published: 2022
Full Text: View/download PDF

46. Springback control and large skin manufacturing by high-speed vibration using electromagnetic forming

Author: Du, Zhihao, Yan, Ziqin, Cui, Xiaohui, Chen, Baoguo, Yu, Hailiang, Qiu, Dongyang, Xia, Wenzhen, and Deng, Zanshi
Published: 2022
Full Text: View/download PDF

47. Springback control with small vibration using electromagnetic forming

Author: Xia, Wenzhen, Cui, Xiaohui, Du, Zhihao, Deng, Zanshi, and Lin, Yuhong
Published: 2022
Full Text: View/download PDF

48. Effect of the aging process on the micro-structure & properties of 7075 aluminum alloy using electromagnetic bulging

Author: Du, Zhihao, Deng, Zanshi, Xiao, Ang, Cui, Xiaohui, Yu, Hailiang, and Feng, Zhuangzhuang
Published: 2021
Full Text: View/download PDF

49. Cyclic tests of steel tee energy absorbers for precast exterior wall panels in steel building frames

Author: Hou, Hetao, Yan, Xuexue, Qu, Bing, Du, Zhihao, and Lu, Yuxi
Published: 2021
Full Text: View/download PDF

50. Large ellipsoid parts manufacture using electromagnetic incremental forming with variable blankholder structure

Author: Cui, Xiaohui, Yan, Ziqin, Chen, Baoguo, Du, Zhihao, Xiao, Ang, and Yu, Hailiang
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Database

Publisher

249 results on '"Du, Zhihao"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources