Author: "Yan, Zhijie" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yan, Zhijie"' showing total 333 results

Start Over Author "Yan, Zhijie"

333 results on '"Yan, Zhijie"'

1. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Author: Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, Zheng, Siqi, Gu, Yue, Ma, Ziyang, Gao, Zhifu, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models., Comment: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051
Published: 2024

2. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Author: An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM., Comment: Work in progress. Authors are listed in alphabetical order by family name
Published: 2024

3. TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Author: Jin, Bu, Zheng, Yupeng, Li, Pengfei, Li, Weize, Zheng, Yuhang, Hu, Sujie, Liu, Xinyu, Zhu, Jinwei, Yan, Zhijie, Sun, Haiyang, Zhan, Kun, Jia, Peng, Long, Xiaoxiao, Chen, Yilun, and Zhao, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: 3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap., Comment: Code, data, and models are publicly available at https://github.com/jxbbb/TOD3Cap
Published: 2024

4. Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving

Author: Zheng, Xiaoji, Wu, Lixiu, Yan, Zhijie, Tang, Yuanrong, Zhao, Hao, Zhong, Chen, Chen, Bokui, and Gong, Jiangtao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics, 68T45
Abstract: Motion prediction is among the most fundamental tasks in autonomous driving. Traditional methods of motion forecasting primarily encode vector information of maps and historical trajectory data of traffic participants, lacking a comprehensive understanding of overall traffic semantics, which in turn affects the performance of prediction tasks. In this paper, we utilized Large Language Models (LLMs) to enhance the global traffic context understanding for motion prediction tasks. We first conducted systematic prompt engineering, visualizing complex traffic environments and historical trajectory information of traffic participants into image prompts -- Transportation Context Map (TC-Map), accompanied by corresponding text prompts. Through this approach, we obtained rich traffic context information from the LLM. By integrating this information into the motion prediction model, we demonstrate that such context can enhance the accuracy of motion predictions. Furthermore, considering the cost associated with LLMs, we propose a cost-effective deployment strategy: enhancing the accuracy of motion prediction tasks at scale with 0.7\% LLM-augmented datasets. Our research offers valuable insights into enhancing the understanding of traffic scenes of LLMs and the motion prediction performance of autonomous driving. The source code is available at \url{https://github.com/AIR-DISCOVER/LLM-Augmented-MTR} and \url{https://aistudio.baidu.com/projectdetail/7809548}., Comment: 6 pages,4 figures
Published: 2024

5. Influence of Annealing Temperature on Microstructure and Magnetic Properties of FeSiNbCuBP Amorphous Alloys

Author: Zhu, Qianke, Liu, Yan, Zhu, Ziteng, Chen, Zhe, Kang, Shujie, Li, Wang, Zhang, Kewei, Hu, Jifan, and Yan, Zhijie
Published: 2024
Full Text: View/download PDF

6. Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Author: Zuo, Lingyun, An, Keyu, Zhang, Shiliang, and Yan, Zhijie
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In a speech recognition system, voice activity detection (VAD) is a crucial frontend module. Addressing the issues of poor noise robustness in traditional binary VAD systems based on DFSMN, the paper further proposes semantic VAD based on multi-task learning with improved models for real-time and offline systems, to meet specific application requirements. Evaluations on internal datasets show that, compared to the real-time VAD system based on DFSMN, the real-time semantic VAD system based on RWKV achieves relative decreases in CER of 7.0\%, DCF of 26.1\% and relative improvement in NRR of 19.2\%. Similarly, when compared to the offline VAD system based on DFSMN, the offline VAD system based on SAN-M demonstrates relative decreases in CER of 4.4\%, DCF of 18.6\% and relative improvement in NRR of 3.5\%.
Published: 2023

7. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Author: Chu, Yunfei, Xu, Jin, Zhou, Xiaohuan, Yang, Qian, Zhang, Shiliang, Yan, Zhijie, Zhou, Chang, and Zhou, Jingren
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios., Comment: The code, checkpoints and demo are released at https://github.com/QwenLM/Qwen-Audio
Published: 2023

8. LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Author: Du, Zhihao, Wang, Jiaming, Chen, Qian, Chu, Yunfei, Gao, Zhifu, Li, Zerui, Hu, Kai, Zhou, Xiaohuan, Xu, Jin, Ma, Ziyang, Wang, Wen, Zheng, Siqi, Zhou, Chang, Yan, Zhijie, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding., Comment: 10 pages, work in progress
Published: 2023

9. The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Author: Liang, Yuhao, Shi, Mohan, Yu, Fan, Li, Yangze, Zhang, Shiliang, Du, Zhihao, Chen, Qian, Xie, Lei, Qian, Yanmin, Wu, Jian, Chen, Zhuo, Lee, Kong Aik, Yan, Zhijie, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR., Comment: 8 pages, Accepted by ASRU2023
Published: 2023

10. Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Author: Shi, Xian, Luo, Haoneng, Gao, Zhifu, Zhang, Shiliang, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies. Previous end-to-end~(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. In this paper we proposed CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer. CA-CEM utilizes the modeling character of continuous integrate-and-fire (CIF) mechanism to generate token-synchronous acoustic embedding, which solves the estimation failure issue above. We measure the quality of estimation with AUC and RMSE in token level and ECE-U - a proposed metrics in utterance level. CA-CEM gains 24% and 19% relative reduction on ECE-U and also better AUC and RMSE on two test sets. Furthermore, we conduct analysis to explore the potential of CEM for different ASR related usage., Comment: 5 pages, 4 figures, Interspeech2023
Published: 2023

11. MSim: A Long-Term Interactive Driving Simulator

Author: Han, Zhengxiao, Yan, Zhijie, Li, Yang, Li, Pengfei, Shi, Yifeng, Luo, Nairui, Gao, Xu, Shi, Yongliang, Huang, Pengfei, Gong, Jiangtao, Zhou, Guyue, Chen, Yilun, Zhao, Hang, Zhao, Hao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fang, Lu, editor, Pei, Jian, editor, Zhai, Guangtao, editor, and Wang, Ruiping, editor
Published: 2024
Full Text: View/download PDF

12. Long-Term Interactive Driving Simulation: MPC to the Rescue

Author: Han, Zhengxiao, Yan, Zhijie, Li, Yang, Li, Pengfei, Shi, Yifeng, Luo, Nairui, Gao, Xu, Shi, Yongliang, Huang, Pengfei, Gong, Jiangtao, Zhou, Guyue, Chen, Yilun, Zhao, Hang, Zhao, Hao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fang, Lu, editor, Pei, Jian, editor, Zhai, Guangtao, editor, and Wang, Ruiping, editor
Published: 2024
Full Text: View/download PDF

13. Correction: Star Generative Adversarial VGG Network-Based Sample Augmentation for Insulator Defect Detection

Author: Zhang, Linghao, Wang, Luqing, Yan, Zhijie, Jia, Zhentang, Wang, Hongjun, and Tang, Xinyu
Published: 2024
Full Text: View/download PDF

14. Star Generative Adversarial VGG Network-Based Sample Augmentation for Insulator Defect Detection

Author: Zhang, Linghao, Wang, Luqing, Yan, Zhijie, Jia, Zhentang, Wang, Hongjun, and Tang, Xinyu
Published: 2024
Full Text: View/download PDF

15. Development of Mg–Zn–Mn–Y magnesium alloy with high thermal conductivity and compression properties via injection molding

Author: Song, Xin, Hu, Yong, Xue, Kaijiang, Wang, Yapeng, and Yan, Zhijie
Published: 2024
Full Text: View/download PDF

16. Effects of Addition of Yttrium on the Microstructure, Compression Properties and Corrosion Resistance of Injection Molding AZ91D-1.6Ca Magnesium Alloy

Author: Song, Xin, Hu, Yong, Tian, Jinsong, Wang, Yapeng, and Yan, Zhijie
Published: 2024
Full Text: View/download PDF

17. MUG: A General Meeting Understanding and Generation Benchmark

Author: Zhang, Qinglin, Deng, Chong, Liu, Jiaqing, Yu, Hai, Chen, Qian, Wang, Wen, Yan, Zhijie, Liu, Jinglin, Ren, Yi, and Zhao, Zhou
Subjects: Computer Science - Computation and Language
Abstract: Listening to long video/audio recordings from video conferencing and online courses for acquiring information is extremely inefficient. Even after ASR systems transcribe recordings into long-form spoken language documents, reading ASR transcripts only partly speeds up seeking information. It has been observed that a range of NLP applications, such as keyphrase extraction, topic segmentation, and summarization, significantly improve users' efficiency in grasping important information. The meeting scenario is among the most valuable scenarios for deploying these spoken language processing (SLP) capabilities. However, the lack of large-scale public meeting datasets annotated for these SLP tasks severely hinders their advancement. To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection. To facilitate the MUG benchmark, we construct and release a large-scale meeting dataset for comprehensive long-form SLP development, the AliMeeting4MUG Corpus, which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage, with manual annotations for SLP tasks on manual transcripts of meeting recordings. To the best of our knowledge, the AliMeeting4MUG Corpus is so far the largest meeting corpus in scale and facilitates most SLP tasks. In this paper, we provide a detailed introduction of this corpus, SLP tasks and evaluation methods, baseline systems and their performance., Comment: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece
Published: 2023

18. Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

Author: Zhang, Qinglin, Deng, Chong, Liu, Jiaqing, Yu, Hai, Chen, Qian, Wang, Wen, Yan, Zhijie, Liu, Jinglin, Ren, Yi, and Zhao, Zhou
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topic title generation, keyphrase extraction, and action item detection. To facilitate MUG, we construct and release a large-scale meeting dataset, the AliMeeting4MUG Corpus., Comment: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece
Published: 2023

19. Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Author: Shi, Xian, Chen, Yanni, Zhang, Shiliang, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.
Published: 2023

20. M$$^2$$Sim: A Long-Term Interactive Driving Simulator

Author: Han, Zhengxiao, primary, Yan, Zhijie, additional, Li, Yang, additional, Li, Pengfei, additional, Shi, Yifeng, additional, Luo, Nairui, additional, Gao, Xu, additional, Shi, Yongliang, additional, Huang, Pengfei, additional, Gong, Jiangtao, additional, Zhou, Guyue, additional, Chen, Yilun, additional, Zhao, Hang, additional, and Zhao, Hao, additional
Published: 2024
Full Text: View/download PDF

21. Long-Term Interactive Driving Simulation: MPC to the Rescue

Author: Han, Zhengxiao, primary, Yan, Zhijie, additional, Li, Yang, additional, Li, Pengfei, additional, Shi, Yifeng, additional, Luo, Nairui, additional, Gao, Xu, additional, Shi, Yongliang, additional, Huang, Pengfei, additional, Gong, Jiangtao, additional, Zhou, Guyue, additional, Chen, Yilun, additional, Zhao, Hang, additional, and Zhao, Hao, additional
Published: 2024
Full Text: View/download PDF

22. MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Author: Zhou, Xiaohuan, Wang, Jiaming, Cui, Zeyu, Zhang, Shiliang, Yan, Zhijie, Zhou, Jingren, and Zhou, Chang
Subjects: Computer Science - Multimedia, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods., Comment: Submitted to ICASSP 2023
Published: 2022

23. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction., Comment: Accepted by EMNLP 2022
Published: 2022

24. Minimizing heat generation in quantum dot light-emitting diodes by increasing quasi-Fermi-level splitting

Author: Gao, Yan, Li, Bo, Liu, Xiaonan, Shen, Huaibin, Song, Yang, Song, Jiaojiao, Yan, Zhijie, Yan, Xiaohan, Chong, Yihua, Yao, Ruyun, Wang, Shujie, Li, Lin Song, Fan, Fengjia, and Du, Zuliang
Published: 2023
Full Text: View/download PDF

25. Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Author: Gao, Zhifu, Zhang, Shiliang, McLoughlin, Ian, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the number of tokens and generate hidden variables. A glancing language model (GLM) sampler then generates semantic embeddings to enhance the NAR decoder's ability to model context interdependence. Finally, we design a strategy to generate negative samples for minimum word error rate training to further improve performance. Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an industrial-level 20,000 hour task demonstrate that the proposed Paraformer can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup., Comment: 5 pages, 3 figures, accepted by INTERSPEECH 2022
Published: 2022

26. Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that SEND has a stable learning process and can be trained on highly overlapped data without extra initialization. More importantly, our method achieves the state-of-the-art performance in real meeting scenarios with fewer model parameters and lower computational complexity., Comment: Submitted to INTERSPEECH 2022, 5 parges, 2 figure
Published: 2022

27. ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Author: Ren, Yi, Lei, Ming, Huang, Zhiying, Zhang, Shiliang, Chen, Qian, Yan, Zhijie, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods., Comment: Accepted by ICASSP 2022
Published: 2022

28. Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Author: Yu, Fan, Zhang, Shiliang, Guo, Pengcheng, Fu, Yihui, Du, Zhihao, Zheng, Siqi, Huang, Weilong, Xie, Lei, Tan, Zheng-Hua, Wang, DeLiang, Qian, Yanmin, Lee, Kong Aik, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions., Comment: Accepted by ICASSP 2022
Published: 2022

29. Focus review on γ′ coarsening in theorical development and application in Ni-base superalloys and high/medium-entropy alloys

Author: Chen, Yongan, Li, Dazhao, Yan, Zhijie, Bai, Shaobin, Xie, Ruofei, and Sheng, Jian
Published: 2024
Full Text: View/download PDF

30. Intergranular corrosion and stress corrosion cracking properties evaluation and mechanism study of multi-microalloyed 2519 Al alloy

Author: Qin, Jin, Zhao, Wang, Yan, Zhijie, Wang, Rui, Yu, Zhiqiang, Shi, Zhiyue, Liu, Chengzhi, and Wang, Bin
Published: 2024
Full Text: View/download PDF

31. M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Author: Yu, Fan, Zhang, Shiliang, Fu, Yihui, Xie, Lei, Zheng, Siqi, Du, Zhihao, Huang, Weilong, Guo, Pengcheng, Yan, Zhijie, Ma, Bin, Xu, Xin, and Bu, Hui
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition have attracted much attention recently. However, the lack of large public meeting data has been a major obstacle for the advancement of the field. Therefore, we make available the AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by headset microphone. Each meeting session is composed of 2-4 speakers with different speaker overlap ratio, recorded in rooms with different size. Along with the dataset, we launch the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field. In this paper we provide a detailed introduction of the AliMeeting dateset, challenge rules, evaluation methods and baseline systems., Comment: Accepted by ICASSP 2022
Published: 2021

32. Engineering electronic structures and oxygen vacancies of manganese-doped nickel molybdate porous nanosheets for efficient oxygen evolution reaction

Author: Miao, Fang, Cui, Peng, Gu, Tao, Sun, Bo, and Yan, Zhijie
Published: 2024
Full Text: View/download PDF

33. BeamTransformer: Microphone Array-based Overlapping Speech Detection

Author: Zheng, Siqi, Zhang, Shiliang, Huang, Weilong, Chen, Qian, Suo, Hongbin, Lei, Ming, Feng, Jinwei, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlapping speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlapping segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.
Published: 2021

34. A Real-time Speaker Diarization System Based on Spatial Spectrum

Author: Zheng, Siqi, Huang, Weilong, Wang, Xianliang, Suo, Hongbin, Feng, Jinwei, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting. We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks: (1) to segment and separate overlapping speech from two speakers; (2) to estimate the number of speakers when participants may enter or leave the conversation at any time; (3) to provide accurate speaker identification on short text-independent utterances; (4) to track down speakers movement during the conversation; (5) to detect speaker change incidence real-time. First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment. Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location. Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech. The results suggest that our system effectively incorporates spatial information and achieves significant gains., Comment: Published in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Published: 2021
Full Text: View/download PDF

35. Effect of titanium on the wettability between Fe-P-Ti alloy and Al2O3 substrate

Author: Wang, Rui, Bai, Shengjie, Chen, Xu, Xie, Likui, Yu, Zhiqiang, Kang, Yang, Li, Yihong, Hu, Yong, Shi, Zhiyue, and Yan, Zhijie
Published: 2024
Full Text: View/download PDF

36. Surface nanoprecipitation induced by severe plastic deformation in the Fe49.3Co23Ni23C0.85Mn1Si2.85 biphasic multicomponent alloy

Author: Liu, Pingping, Zhang, Mingzhi, Kou, Zongde, Gao, Qingwei, Gong, Jianhong, Yan, Zhijie, Lv, Wenquan, Xie, Meiting, and Song, Kaikai
Published: 2024
Full Text: View/download PDF

37. FeCoNiMgB high-entropy boride powder with a fluffy cotton structure and enhanced activity in the oxygen evolution reaction

Author: Miao, Fang, Cui, Peng, Jing, Zhiyuan, Wu, Wei, Zhang, Zhibin, Gu, Tao, Yan, Zhijie, and Liang, Xiubing
Published: 2024
Full Text: View/download PDF

38. Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Author: Zhang, Shiliang, Gao, Zhifu, Luo, Haoneng, Lei, Ming, Gao, Jie, Yan, Zhijie, and Xie, Lei
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR., Comment: submitted to INTERSPEECH2020
Published: 2020

39. Effect of P and Ti on the agglomeration behavior of Al2O3 inclusions in Fe–P–Ti alloys

Author: Dong Siyue, Wang Rui, Xie Likui, Kang Yan, Li Yihong, Fan Jing, Yu Zhiqiang, and Yan Zhijie
Subjects: inclusions, agglomeration, if steel with high p, capillary force, Technology, Chemical technology, TP1-1185, Chemicals: Manufacture, use, etc., TP200-248
Abstract: Nozzle clogging occurs in the interstitial free (IF) steel with high phosphorus (P) more frequently than in IF steel with lower P. To explore the effect of P and Ti on the inclusion behavior in liquid steel, the in situ experiment and theoretical calculations were conducted. High-temperature confocal laser scanning microscopy was used in situ to observe the inclusion behavior at the liquid Fe–P–Ti alloy surfaces, and the attractive and capillary forces were also calculated to quantitatively estimate the effect of P and Ti on the inclusion behavior. The results show that the agglomeration of Al2O3 inclusions involves four steps: dispersed Al2O3 particles in liquid alloy; formation of Al2O3 chain structure; bending of the Al2O3 chain, and sintering and densification of the chain structure. The addition of Ti and P in the steel can increase the agglomeration time of inclusions, indicating the impeding effect of P and Ti on the inclusion aggregation. Furthermore, the orientation factor is proposed to estimate the direction of movement of the small inclusion crossing between large inclusions, and the experimental results confirm its validity.
Published: 2023
Full Text: View/download PDF

40. Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

Author: Fan, Kai, Wang, Jiayi, Li, Bo, Zhang, Shiliang, Chen, Boxing, Ge, Niyu, and Yan, Zhijie
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation., Comment: InterSpeech 2020
Published: 2019

41. Influences of partial substitution of C by N on the microstructure and mechanical properties of 9Cr18Mo martensitic stainless steel

Author: Wang, Rui, Li, Fenghao, Yu, Zhiqiang, Kang, Yan, Li, Meng, Hu, Yong, An, Haoran, Fan, Jing, Miao, Fang, Zhao, Yuhong, Eckert, Jürgen, and Yan, Zhijie
Published: 2023
Full Text: View/download PDF

42. Study on thermal stability of 2519(Sc) Al alloy and establishment of coarsening model

Author: Qin, Jin, Yan, Zhijie, Wei, Qirong, and Wang, Bin
Published: 2023
Full Text: View/download PDF

43. Workability optimization and microstructure regulation for a multi-angular extrusion process of AZ80 magnesium alloy

Author: Su, Zexing, Sun, Chaoyang, Qian, Lingyun, Liu, Chengzhi, Yan, Zhijie, Zhang, Li, Wang, Rui, and Liu, Yanlian
Published: 2023
Full Text: View/download PDF

44. A multi-angular extrusion process for fine-grain magnesium alloy plate

Author: Su, Zexing, Sun, Chaoyang, Qian, Lingyun, Liu, Chengzhi, Wang, Zhijian, Zhang, Li, and Yan, Zhijie
Published: 2023
Full Text: View/download PDF

45. Microstructural evolution and mechanical properties of 6Cr13 martensitic stainless steel subjected to cold rolling and heat treatments

Author: Wang, Rui, Yan, Zhijie, He, Jie, Fan, Weihui, Li, Yihong, Hu, Yong, Kang, Yan, Fan, Jing, Yu, Zhiqiang, Zhao, Yuhong, and Eckert, Jürgen
Published: 2023
Full Text: View/download PDF

46. Self-ignition of amorphous alloys activated by exothermic crystallization

Author: Yan, Zhijie, Song, Kaikai, Hu, Yong, Dai, Fuping, and Eckert, Jürgen
Published: 2023
Full Text: View/download PDF

47. Achieving Timestamp Prediction While Recognizing with Non-autoregressive End-to-End ASR Model

Author: Shi, Xian, primary, Chen, Yanni, additional, Zhang, Shiliang, additional, and Yan, Zhijie, additional
Published: 2023
Full Text: View/download PDF

48. Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition

Author: Zhang, Shiliang, Lei, Ming, and Yan, Zhijie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced by language model will help to distinguish these substitution errors. In this work, we propose a transformer based spelling correction model to automatically correct errors especially the substitution errors made by CTC-based Mandarin speech recognition system. Specifically, we investigate using the recognition results generated by CTC-based systems as input and the ground-truth transcriptions as output to train a transformer with encoder-decoder architecture, which is much similar to machine translation. Results in a 20,000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3.41%, which results in 22.9% and 53.2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively., Comment: 6pages, 5 figures
Published: 2019

49. Surface Defect Detection and Classification Based on Fusing Multiple Computer Vision Techniques

Author: Zhu, Min, Shen, Bingqing, Sun, Yan, Wang, Chongyu, Hou, Guoxin, Yan, Zhijie, Cai, Hongming, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fujita, Hamido, editor, Fournier-Viger, Philippe, editor, Ali, Moonis, editor, and Wang, Yinglin, editor
Published: 2022
Full Text: View/download PDF

50. Promoting knowledge recommendation in innovative engineering design: a BERT-GAT-based patent representation learning approach

Author: Li, Mingrui, primary, Wang, Zuoxu, additional, Yan, Zhijie, additional, Liang, Xinxin, additional, and Liu, Jihong, additional
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

333 results on '"Yan, Zhijie"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources