Author: "Meng, Lingwei" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Meng, Lingwei"' showing total 21 results

Start Over Author "Meng, Lingwei" Database arXiv

21 results on '"Meng, Lingwei"'

1. ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Author: Li, Zongyi, Hu, Shujie, Liu, Shujie, Zhou, Long, Choi, Jeongsoo, Meng, Lingwei, Guo, Xun, Li, Jinyu, Ling, Hefei, and Wei, Furu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at \url{http://aka.ms/arlon}.
Published: 2024

2. Towards Within-Class Variation in Alzheimer's Disease Detection from Spontaneous Speech

Author: Kang, Jiawen, Han, Dongrui, Meng, Lingwei, Zhou, Jingyan, Li, Jinchao, Wu, Xixin, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Quantitative Biology - Neurons and Cognition
Abstract: Alzheimer's Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detection tasks lack fine-grained labels, simplistic binary classification may overlook two crucial aspects: within-class differences and instance-level imbalance. The former compels the model to map AD samples with varying degrees of impairment to a single diagnostic label, disregarding certain changes in cognitive function. While the latter biases the model towards overrepresented severity levels. This work presents early efforts to address these challenges. We propose two novel methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Experiments on the ADReSS and ADReSSo datasets demonstrate that the proposed methods significantly improve detection accuracy. Further analysis reveals that SoTD effectively harnesses the strengths of multiple component models, while InRe substantially alleviates model over-fitting. These findings provide insights for developing more robust and reliable AD detection models.
Published: 2024

3. Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Author: Kang, Jiawen, Meng, Lingwei, Cui, Mingyu, Wang, Yuejiao, Wu, Xixin, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.
Published: 2024

4. Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Author: Meng, Lingwei, Hu, Shujie, Kang, Jiawen, Li, Zhaoqing, Wang, Yuejiao, Wu, Wenxuan, Wu, Xixin, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings.
Published: 2024

5. LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Author: Jin, Zengrui, Yang, Yifan, Shi, Mohan, Kang, Wei, Yang, Xiaoyu, Yao, Zengwei, Kuang, Fangjun, Guo, Liyong, Meng, Lingwei, Lin, Long, Xu, Yong, Zhang, Shi-Xiong, and Povey, Daniel
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When'' in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data., Comment: InterSpeech 2024
Published: 2024

6. Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder

Author: Wang, Yuejiao, Gong, Xianmin, Meng, Lingwei, Wu, Xixin, and Meng, Helen
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Computation and Language
Abstract: Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level correlations. This paper explores language-related functional changes in older NCD adults using LLM-based fMRI encoding and brain scores, addressing current limitations. We analyze the correlation between brain scores and cognitive scores at both whole-brain and language-related ROI levels. Our findings reveal that higher cognitive abilities correspond to better brain scores, with correlations peaking in the middle temporal gyrus. This study highlights the potential of fMRI encoding models and brain scores for detecting early functional changes in NCD patients., Comment: 5 pages, accepted by Interspeech 2024
Published: 2024

7. Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Author: Meng, Lingwei, Kang, Jiawen, Wang, Yuejiao, Jin, Zengrui, Wu, Xixin, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset., Comment: Accepted to INTERSPEECH 2024
Published: 2024

8. Autoregressive Speech Synthesis without Vector Quantization

Author: Meng, Lingwei, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Han, Bing, Hu, Shujie, Liu, Yanqing, Li, Jinyu, Zhao, Sheng, Wu, Xixin, Meng, Helen, and Wei, Furu
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.
Published: 2024

9. VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Author: Han, Bing, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Meng, Lingwei, Qian, Yanming, Liu, Yanqing, Zhao, Sheng, Li, Jinyu, and Wei, Furu
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler., Comment: 15 pages, 5 figures
Published: 2024

10. WavLLM: Towards Robust and Adaptive Speech Large Language Model

Author: Hu, Shujie, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Meng, Lingwei, Hao, Hongkun, Pan, Jing, Liu, Xunying, Li, Jinyu, Sivasankaran, Sunit, Liu, Linquan, and Wei, Furu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavllm}., Comment: accepted by EMNLP2024 findings
Published: 2024

11. UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

Author: Wang, Yuejiao, Wu, Xixin, Wang, Disong, Meng, Lingwei, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise., Comment: Accepted to ICASSP 2024
Published: 2024

12. Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Author: Kang, Jiawen, Meng, Lingwei, Cui, Mingyu, Guo, Haohan, Wu, Xixin, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output training (SOT). In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. Furthermore, the CSE model is integrated with SOT to leverage both the advantages of SIMO and SISO while mitigating their drawbacks. To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition. Experiments on the two-speaker LibrispeechMix dataset show that the CES model reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model reduces WER by 10% overall and by 16% on high-overlap speech compared to the SOT model. Code is available at https://github.com/kjw11/CSEnet-ASR., Comment: Accepted by ICASSP2024
Published: 2024
Full Text: View/download PDF

13. Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

Author: Meng, Lingwei, Kang, Jiawen, Cui, Mingyu, Wu, Haibin, Wu, Xixin, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. The proposed method yields better ASR results compared to the baseline on LibriMix and LibriSpeechMix datasets. Moreover, without sophisticated customization on the diarization task, our method achieves acceptable diarization results on the two-speaker subset of CALLHOME with only a few adaptation steps., Comment: Accepted to INTERSPEECH 2023
Published: 2023

14. The defender's perspective on automatic speaker verification: An overview

Author: Wu, Haibin, Kang, Jiawen, Meng, Lingwei, Meng, Helen, and Lee, Hung-yi
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speaker verification (ASV) plays a critical role in security-sensitive environments. Regrettably, the reliability of ASV has been undermined by the emergence of spoofing attacks, such as replay and synthetic speech, as well as adversarial attacks and the relatively new partially fake speech. While there are several review papers that cover replay and synthetic speech, and adversarial attacks, there is a notable gap in a comprehensive review that addresses defense against adversarial attacks and the recently emerged partially fake speech. Thus, the aim of this paper is to provide a thorough and systematic overview of the defense methods used against these types of attacks., Comment: Accepted to IJCAI 2023 Workshop
Published: 2023

15. A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

Author: Meng, Lingwei, Kang, Jiawen, Cui, Mingyu, Wang, Yuejiao, Wu, Xixin, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguistic. This inspires us to develop a Sidecar separator to empower a well-trained ASR model for multi-talker scenarios by separating the mixed speech embedding between two suitable layers. We experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By freezing the parameters of the original model and training only the Sidecar (8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset, reaching a word error rate (WER) of 10.36%; and obtains comparable results (7.56%) for LibriSpeechMix dataset when limited training., Comment: Accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Published: 2023

16. 2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study

Author: Meng, Lingwei, Dong, Di, Chen, Xin, Fang, Mengjie, Wang, Rongpin, Li, Jing, Liu, Zaiyi, and Tian, Jie
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Signal Processing, Quantitative Biology - Quantitative Methods
Abstract: Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks. Methods: Four-center 539 GC patients were retrospectively enrolled and divided into the training and validation cohorts. From 2D or 3D regions of interest (ROIs) annotated by radiologists, radiomic features were extracted respectively. Feature selection and model construction procedures were customed for each combination of two modalities (2D or 3D) and three tasks. Subsequently, six machine learning models (Model_2D^LNM, Model_3D^LNM; Model_2D^LVI, Model_3D^LVI; Model_2D^pT, Model_3D^pT) were derived and evaluated to reflect modalities' performances in characterizing GC. Furthermore, we performed an auxiliary experiment to assess modalities' performances when resampling spacing is different. Results: Regarding three tasks, the yielded areas under the curve (AUCs) were: Model_2D^LNM's 0.712 (95% confidence interval, 0.613-0.811), Model_3D^LNM's 0.680 (0.584-0.775); Model_2D^LVI's 0.677 (0.595-0.761), Model_3D^LVI's 0.615 (0.528-0.703); Model_2D^pT's 0.840 (0.779-0.901), Model_3D^pT's 0.813 (0.747-0.879). Moreover, the auxiliary experiment indicated that Models_2D are statistically more advantageous than Models3D with different resampling spacings. Conclusion: Models constructed with 2D radiomic features revealed comparable performances with those constructed with 3D features in characterizing GC. Significance: Our work indicated that time-saving 2D annotation would be the better choice in GC, and provided a related reference to further radiomics-based researches., Comment: Published in IEEE Journal of Biomedical and Health Informatics
Published: 2022
Full Text: View/download PDF

17. Exploring linguistic feature and model combination for speech recognition based automatic AD detection

Author: Wang, Yi, Wang, Tianzi, Ye, Zi, Meng, Lingwei, Hu, Shoukang, Wu, Xixin, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and delay progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Scarcity of such specialist data leads to uncertainty in both model selection and feature learning when developing such systems. To this end, this paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders on limited data, before the resulting embedding features being fed into an ensemble of backend classifiers to produce the final AD detection decision via majority voting. Experiments conducted on the ADReSS20 Challenge dataset suggest consistent performance improvements were obtained using model and feature combination in system development. State-of-the-art AD detection accuracies of 91.67 percent and 93.75 percent were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers., Comment: Accepted by INTERSPEECH 2022
Published: 2022

18. Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

Author: Wu, Haibin, Kang, Jiawen, Meng, Lingwei, Zhang, Yang, Wu, Xixin, Wu, Zhiyong, Lee, Hung-yi, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed the extraordinary development of automatic speaker verification (ASV). However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process. How to integrate the CM and ASV together remains an open question. A spoofing aware speaker verification (SASV) challenge has recently taken place with the argument that better performance can be delivered when both CM and ASV subsystems are optimized jointly. Under the challenge's scenario, the integrated systems proposed by the participants are required to reject both impostor speakers and spoofing attacks from target speakers, which intuitively and effectively matches the expectation of a reliable, spoofing-robust ASV system. This work focuses on fusion-based SASV solutions and proposes a multi-model fusion framework to leverage the power of multiple state-of-the-art ASV and CM models. The proposed framework vastly improves the SASV-EER from 8.75% to 1.17\%, which is 86% relative improvement compared to the best baseline system in the SASV challenge., Comment: Accepted by Odyssey 2022
Published: 2022

19. Spoofing-Aware Speaker Verification by Multi-Level Fusion

Author: Wu, Haibin, Meng, Lingwei, Kang, Jiawen, Li, Jinchao, Li, Xu, Wu, Xixin, Lee, Hung-yi, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, many novel techniques have been introduced to deal with spoofing attacks, and achieve promising countermeasure (CM) performances. However, these works only take the stand-alone CM models into account. Nowadays, a spoofing aware speaker verification (SASV) challenge which aims to facilitate the research of integrated CM and ASV models, arguing that jointly optimizing CM and ASV models will lead to better performance, is taking place. In this paper, we propose a novel multi-model and multi-level fusion strategy to tackle the SASV task. Compared with purely scoring fusion and embedding fusion methods, this framework first utilizes embeddings from CM models, propagating CM embeddings into a CM block to obtain a CM score. In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision. As a result, the best single fusion system has achieved the SASV-EER of 0.97% on the evaluation set. Then by ensembling the top-5 fusion systems, the final SASV-EER reached 0.89%., Comment: Submitted to Interspeech 2022
Published: 2022

20. The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Author: Zheng, Naijun, Li, Na, Wu, Xixin, Meng, Lingwei, Kang, Jiawen, Wu, Haibin, Weng, Chao, Su, Dan, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for diarization. Based on the assumption that there is valuable complementary information between acoustic features, spatial-related and speaker-related features, we propose a multi-level feature fusion mechanism based target-speaker voice activity detection (FFM-TS-VAD) system to improve the performance of the conventional TS-VAD system. Furthermore, we propose a data augmentation method during training to improve the system robustness when the angular difference between two speakers is relatively small. We provide comparisons for different sub-systems we used in M2MeT challenge. Our submission is a fusion of several sub-systems and ranks second in the diarization task., Comment: submitted to ICASSP2022
Published: 2022

21. PM2.5-GNN: A Domain Knowledge Enhanced Graph Neural Network For PM2.5 Forecasting

Author: Wang, Shuo, Li, Yanran, Zhang, Jiang, Meng, Qingye, Meng, Lingwei, and Gao, Fei
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: When predicting PM2.5 concentrations, it is necessary to consider complex information sources since the concentrations are influenced by various factors within a long period. In this paper, we identify a set of critical domain knowledge for PM2.5 forecasting and develop a novel graph based model, PM2.5-GNN, being capable of capturing long-term dependencies. On a real-world dataset, we validate the effectiveness of the proposed model and examine its abilities of capturing both fine-grained and long-term influences in PM2.5 process. The proposed PM2.5-GNN has also been deployed online to provide free forecasting service., Comment: Pre-print version of a ACM SIGSPATIAL 2020 poster [paper](https://dl.acm.org/doi/10.1145/3397536.3422208). The code is available at [Github](https://github.com/shawnwang-tech/PM2.5-GNN), and the talk is available at [YouTube](https://www.youtube.com/watch?v=VX93vMthkGM)
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

21 results on '"Meng, Lingwei"'

1. ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

2. Towards Within-Class Variation in Alzheimer's Disease Detection from Spontaneous Speech

3. Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

4. Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

5. LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

6. Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder

7. Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

8. Autoregressive Speech Synthesis without Vector Quantization

9. VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

10. WavLLM: Towards Robust and Adaptive Speech Large Language Model

11. UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

12. Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

13. Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

14. The defender's perspective on automatic speaker verification: An overview

15. A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

16. 2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study

17. Exploring linguistic feature and model combination for speech recognition based automatic AD detection

18. Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

19. Spoofing-Aware Speaker Verification by Multi-Level Fusion

20. The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

21. PM2.5-GNN: A Domain Knowledge Enhanced Graph Neural Network For PM2.5 Forecasting

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

21 results on '"Meng, Lingwei"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources