Author: "Xie, Xurong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xie, Xurong"' showing total 120 results

Start Over Author "Xie, Xurong"

120 results on '"Xie, Xurong"'

1. Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Author: Geng, Mengzhe, Xie, Xurong, Deng, Jiajun, Jin, Zengrui, Li, Guinan, Wang, Tianzi, Hu, Shujie, Li, Zhaoqing, Meng, Helen, and Liu, Xunying
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms., Comment: In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2024

2. Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Author: Hu, Shujie, Xie, Xurong, Geng, Mengzhe, Jin, Zengrui, Deng, Jiajun, Li, Guinan, Wang, Yi, Cui, Mingyu, Wang, Tianzi, Meng, Helen, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs., Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2024

3. Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Author: Li, Guinan, Deng, Jiajun, Chen, Youjun, Geng, Mengzhe, Hu, Shujie, Li, Zhe, Jin, Zengrui, Wang, Tianzi, Xie, Xurong, Meng, Helen, and Liu, Xunying
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality., Comment: Accepted by Interspeech 2024
Published: 2024

4. Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Author: Wang, Tianzi, Xie, Xurong, Li, Zhaoqing, Hu, Shoukang, Jin, Zengrui, Deng, Jiajun, Cui, Mingyu, Hu, Shujie, Geng, Mengzhe, Li, Guinan, Meng, Helen, and Liu, Xunying
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline., Comment: 5 pages, 2 figures, 2 tables, Interspeech24 conference
Published: 2024

5. Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Author: Jiang, Yicong, Wang, Tianzi, Xie, Xurong, Liu, Juan, Sun, Wei, Yan, Nan, Chen, Hui, Wang, Lan, Liu, Xunying, and Tian, Feng
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper., Comment: Accepted by interspeech 2024
Published: 2024

6. Towards Automatic Data Augmentation for Disordered Speech Recognition

Author: Jin, Zengrui, Xie, Xurong, Wang, Tianzi, Geng, Mengzhe, Deng, Jiajun, Li, Guinan, Hu, Shujie, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task and system dependent, together with additionally introduced minimum and maximum cut-offs of these time-frequency masks, are now automatically learned using an RNN-based policy controller and tightly integrated with ASR system training. Experiments on the UASpeech corpus suggest the proposed RL-based data augmentation approach consistently produced performance superior or comparable that obtained using expert or handcrafted SpecAugment policies. Our RL auto-augmented PyChain TDNN system produced an overall WER of 28.79% on the UASpeech test set of 16 dysarthric speakers., Comment: To appear at IEEE ICASSP 2024
Published: 2023

7. Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Author: Deng, Jiajun, Li, Guinan, Xie, Xurong, Jin, Zengrui, Cui, Mingyu, Wang, Tianzi, Hu, Shujie, Geng, Mengzhe, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions., Comment: Accepted by INTERSPEECH 2023
Published: 2023

8. Use of Speech Impairment Severity for Dysarthric Speech Recognition

Author: Geng, Mengzhe, Jin, Zengrui, Wang, Tianzi, Hu, Shujie, Deng, Jiajun, Cui, Mingyu, Li, Guinan, Yu, Jianwei, Xie, Xurong, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound
Abstract: A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition: a) multitask training incorporating severity prediction error; b) speaker-severity aware auxiliary feature adaptation; and c) structured LHUC transforms separately conditioned on speaker-identity and severity. Experiments conducted on UASpeech suggest incorporating additional speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems produced statistically significant WER reductions up to 4.78% (14.03% relative). Using the best system the lowest published WER of 17.82% (51.25% on very low intelligibility) was obtained on UASpeech., Comment: Accepted to INTERSPEECH2023
Published: 2023

9. Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Author: Hu, Shujie, Xie, Xurong, Jin, Zengrui, Geng, Mengzhe, Wang, Yi, Cui, Mingyu, Deng, Jiajun, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively., Comment: accepted by ICASSP 2023
Published: 2023

10. Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Author: Deng, Jiajun, Xie, Xurong, Wang, Tianzi, Cui, Mingyu, Xue, Boyang, Jin, Zengrui, Li, Guinan, Hu, Shujie, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets., Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2023

11. Unsupervised Model-based speaker adaptation of end-to-end lattice-free MMI model for speech recognition

Author: Xie, Xurong, Liu, Xunying, Chen, Hui, and Wang, Hongan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Modeling the speaker variability is a key challenge for automatic speech recognition (ASR) systems. In this paper, the learning hidden unit contributions (LHUC) based adaptation techniques with compact speaker dependent (SD) parameters are used to facilitate both speaker adaptive training (SAT) and unsupervised test-time speaker adaptation for end-to-end (E2E) lattice-free MMI (LF-MMI) models. An unsupervised model-based adaptation framework is proposed to estimate the SD parameters in E2E paradigm using LF-MMI and cross entropy (CE) criterions. Various regularization methods of the standard LHUC adaptation, e.g., the Bayesian LHUC (BLHUC) adaptation, are systematically investigated to mitigate the risk of overfitting, on E2E LF-MMI CNN-TDNN and CNN-TDNN-BLSTM models. Lattice-based confidence score estimation is used for adaptation data selection to reduce the supervision label uncertainty. Experiments on the 300-hour Switchboard task suggest that applying BLHUC in the proposed unsupervised E2E adaptation framework to byte pair encoding (BPE) based E2E LF-MMI systems consistently outperformed the baseline systems by relative word error rate (WER) reductions up to 10.5% and 14.7% on the NIST Hub5'00 and RT03 evaluation sets, and achieved the best performance in WERs of 9.0% and 9.7%, respectively. These results are comparable to the results of state-of-the-art adapted LF-MMI hybrid systems and adapted Conformer-based E2E systems., Comment: 6 pages, 2 figures, submitted to ICASSP 2023
Published: 2022

12. Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Author: Jin, Zengrui, Xie, Xurong, Geng, Mengzhe, Wang, Tianzi, Hu, Shujie, Deng, Jiajun, Li, Guinan, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility., Comment: Submitted to ICASSP 2023
Published: 2022

13. Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Author: Cui, Mingyu, Deng, Jiajun, Hu, Shoukang, Xie, Xurong, Wang, Tianzi, Hu, Shujie, Geng, Mengzhe, Xue, Boyang, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data., Comment: It' s accepted to ISCA 2022
Published: 2022
Full Text: View/download PDF

14. Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

Author: Deng, Jiajun, Xie, Xurong, Wang, Tianzi, Cui, Mingyu, Xue, Boyang, Jin, Zengrui, Geng, Mengzhe, Li, Guinan, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervision error rate is reduced using confidence score based selection of the more "trustworthy" subset of speaker specific data. A confidence estimation module is used to smooth the over-confident Conformer decoder output probabilities before serving as confidence scores. The increased data sparsity due to speaker level data selection is addressed using Bayesian estimation of LHUC parameters. Experiments on the 300-hour Switchboard corpus suggest that the proposed LHUC-SAT Conformer with confidence score based test time unsupervised adaptation outperformed the baseline speaker independent and i-vector adapted Conformer systems by up to 1.0%, 1.0%, and 1.2% absolute (9.0%, 7.9%, and 8.9% relative) word error rate (WER) reductions on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Consistent performance improvements were retained after external Transformer and LSTM language models were used for rescoring., Comment: It's accepted to INTERSPEECH 2022. arXiv admin note: text overlap with arXiv:2206.11596
Published: 2022

15. Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Author: Hu, Shujie, Xie, Xurong, Geng, Mengzhe, Cui, Mingyu, Deng, Jiajun, Li, Guinan, Wang, Tianzi, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems constructed using acoustic features only by statistically significant word or character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding were applied., Comment: accepted by INTERSPEECH 2023
Published: 2022

16. On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Author: Geng, Mengzhe, Xie, Xurong, Su, Rongfeng, Yu, Jianwei, Jin, Zengrui, Wang, Tianzi, Hu, Shujie, Ye, Zi, Meng, Helen, and Liu, Xunying
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Accurate recognition of dysarthric and elderly speech remain challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender, when aggregated with age and speech impairment, create large diversity among these speakers. Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. To this end, this paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods: variance-regularized spectral basis embedding (SVR) and spectral feature driven f-LHUC transforms. Experiments conducted on UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest the proposed on-the-fly speaker adaptation approaches consistently outperform baseline iVector adapted hybrid DNN/TDNN and E2E Conformer systems by statistically significant WER reduction of 2.48%-2.85% absolute (7.92%-8.06% relative), and offline model based LHUC adaptation by 1.82% absolute (5.63% relative) respectively., Comment: Accepted to INTERSPEECH 2023
Published: 2022

17. Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Author: Hu, Shujie, Liu, Shansong, Xie, Xurong, Geng, Mengzhe, Wang, Tianzi, Hu, Shoukang, Cui, Mingyu, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence
Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training before being cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features. Mixture density networks based neural A2A inversion models were used. A cross-domain feature adaptation network was also used to reduce the acoustic mismatch between the TORGO and UASpeech data. On both tasks, incorporating the A2A generated articulatory features consistently outperformed the baseline hybrid DNN/TDNN, CTC and Conformer based end-to-end systems constructed using acoustic features only. The best multi-modal system incorporating video modality and the cross-domain articulatory features as well as data augmentation and learning hidden unit contributions (LHUC) speaker adaptation produced the lowest published word error rate (WER) of 24.82% on the 16 dysarthric speakers of the benchmark UASpeech task., Comment: accepted by ICASSP 2022
Published: 2022

18. Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Author: Geng, Mengzhe, Xie, Xurong, Ye, Zi, Wang, Tianzi, Li, Guinan, Hu, Shujie, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Quantitative Biology - Quantitative Methods
Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create large diversity among speakers. To this end, speaker adaptation techniques play a key role in personalization of ASR systems for such users. Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum decomposition are proposed in this paper to facilitate auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN/TDNN and end-to-end Conformer speech recognition systems. Experiments were conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed spectro-temporal deep feature adapted systems outperformed baseline i-Vector and xVector adaptation by up to 2.63% absolute (8.63% relative) reduction in word error rate (WER). Consistent performance improvements were retained after model based speaker adaptation using learning hidden unit contributions (LHUC) was further applied. The best speaker adapted system using the proposed spectral basis embedding features produced the lowest published WER of 25.05% on the UASpeech test set of 16 dysarthric speakers., Comment: In submission to IEEE/ACM Transactions on Audio Speech and Language Processing
Published: 2022

19. Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Author: Xie, Xurong, Sui, Xiang, Liu, Xunying, and Wang, Lan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The Mandarin Chinese language is known to be strongly influenced by a rich set of regional accents, while Mandarin speech with each accent is quite low resource. Hence, an important task in Mandarin speech recognition is to appropriately model the acoustic variabilities imposed by accents. In this paper, an investigation of implicit and explicit use of accent information on a range of deep neural network (DNN) based acoustic modelling techniques is conducted. Meanwhile, approaches of multi-accent modelling including multi-style training, multi-accent decision tree state tying, DNN tandem and multi-level adaptive network (MLAN) tandem hidden Markov model (HMM) modelling are combined and compared in this paper. On a low resource accented Mandarin speech recognition task consisting of four regional accents, an improved MLAN tandem HMM systems explicitly leveraging the accent information was proposed and significantly outperformed the baseline accent independent DNN tandem systems by 0.8%-1.5% absolute (6%-9% relative) in character error rate after sequence level discriminative training and adaptation., Comment: Published in JOURNAL OF INTEGRATION TECHNOLOGY CNKI:SUN:JCJI.0.2015-06-003
Published: 2022

20. Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition

Author: Xie, Xurong, Ruzi, Rukiye, Liu, Xunying, and Wang, Lan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Dysarthric speech recognition is a challenging task due to acoustic variability and limited amount of available data. Diverse conditions of dysarthric speakers account for the acoustic variability, which make the variability difficult to be modeled precisely. This paper presents a variational auto-encoder based variability encoder (VAEVE) to explicitly encode such variability for dysarthric speech. The VAEVE makes use of both phoneme information and low-dimensional latent variable to reconstruct the input acoustic features, thereby the latent variable is forced to encode the phoneme-independent variability. Stochastic gradient variational Bayes algorithm is applied to model the distribution for generating variability encodings, which are further used as auxiliary features for DNN acoustic modeling. Experiment results conducted on the UASpeech corpus show that the VAEVE based variability encodings have complementary effect to the learning hidden unit contributions (LHUC) speaker adaptation. The systems using variability encodings consistently outperform the comparable baseline systems without using them, and" obtain absolute word error rate (WER) reduction by up to 2.2% on dysarthric speech with "Very lowintelligibility level, and up to 2% on the "Mixed" type of dysarthric speech with diverse or uncertain conditions., Comment: Published in Interspeech 2021, 4808-4812
Published: 2022
Full Text: View/download PDF

21. Recent Progress in the CUHK Dysarthric Speech Recognition System

Author: Liu, Shansong, Geng, Mengzhe, Hu, Shoukang, Xie, Xurong, Cui, Mingyu, Yu, Jianwei, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.
Published: 2022
Full Text: View/download PDF

22. Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Author: Geng, Mengzhe, Xie, Xurong, Liu, Shansong, Yu, Jianwei, Hu, Shoukang, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthric speakers., Comment: Proceedings of INTERSPEECH 2020
Published: 2022
Full Text: View/download PDF

23. Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Author: Geng, Mengzhe, Liu, Shansong, Yu, Jianwei, Xie, Xurong, Hu, Shoukang, Ye, Zi, Jin, Zengrui, Liu, Xunying, and Meng, Helen
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers, Comment: Proceedings of INTERSPEECH 2021
Published: 2022
Full Text: View/download PDF

24. Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Author: Hu, Shoukang, Xie, Xurong, Cui, Mingyu, Deng, Jiajun, Liu, Shansong, Yu, Jianwei, Geng, Mengzhe, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7^28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that ..., Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2007.08818
Published: 2022

25. A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition

Author: Li, Jin, Su, Rongfeng, Xie, Xurong, Yan, Nan, and Wang, Lan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transformer based end-to-end modelling approaches with multiple stream inputs have been achieved great success in various automatic speech recognition (ASR) tasks. An important issue associated with such approaches is that the intermediate features derived from each stream might have similar representations and thus it is lacking of feature diversity, such as the descriptions related to speaker characteristics. To address this issue, this paper proposed a novel multi-level acoustic feature extraction framework that can be easily combined with Transformer based ASR models. The framework consists of two input streams: a shallow stream with high-resolution spectrograms and a deep stream with low-resolution spectrograms. The shallow stream is used to acquire traditional shallow features that is beneficial for the classification of phones or words while the deep stream is used to obtain utterance-level speaker-invariant deep features for improving the feature diversity. A feature correlation based fusion strategy is used to aggregate both features across the frequency and time domains and then fed into the Transformer encoder-decoder module. By using the proposed multi-level acoustic feature extraction framework, state-of-the-art word error rate of 21.7% and 2.5% were obtained on the HKUST Mandarin telephone and Librispeech speech recognition tasks respectively., Comment: Accepted by Interspeech 2022
Published: 2021

26. Adversarial Data Augmentation for Disordered Speech Recognition

Author: Jin, Zengrui, Geng, Mengzhe, Xie, Xurong, Yu, Jianwei, Liu, Shansong, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. To this end, data augmentation techniques play a vital role in current disordered speech recognition systems. In contrast to existing data augmentation techniques only modifying the speaking rate or overall shape of spectral contour, fine-grained spectro-temporal differences between disordered and normal speech are modelled using deep convolutional generative adversarial networks (DCGAN) during data augmentation to modify normal speech spectra into those closer to disordered speech. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline augmentation methods using tempo or speed perturbation on a state-of-the-art hybrid DNN system. An overall word error rate (WER) reduction up to 3.05\% (9.7\% relative) was obtained over the baseline system using no data augmentation. The final learning hidden unit contribution (LHUC) speaker adapted system using the best adversarial augmentation approach gives an overall WER of 25.89% on the UASpeech test set of 16 dysarthric speakers., Comment: 5 pages, 3 figures, INTERSPEECH 2021
Published: 2021

27. Bayesian Learning for Deep Neural Network Adaptation

Author: Xie, Xurong, Liu, Xunying, Lee, Tan, and Wang, Lan
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Speaker adaptation techniques play a vital role to reduce the mismatch. Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness. When the amount of speaker level data is limited, speaker adaptation is prone to overfitting and poor generalization. To address the issue, this paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty given limited speaker specific adaptation data. This framework is investigated in three forms of model based DNN adaptation techniques: Bayesian learning of hidden unit contributions (BLHUC), Bayesian parameterized activation functions (BPAct), and Bayesian hidden unit bias vectors (BHUB). In the three methods, deterministic SD parameters are replaced by latent variable posterior distributions for each speaker, whose parameters are efficiently estimated using a variational inference based approach. Experiments conducted on 300-hour speed perturbed Switchboard corpus trained LF-MMI TDNN/CNN-TDNN systems suggest the proposed Bayesian adaptation approaches consistently outperform the deterministic adaptation on the NIST Hub5'00 and RT03 evaluation sets. When using only the first five utterances from each speaker as adaptation data, significant word error rate reductions up to 1.4% absolute (7.2% relative) were obtained on the CallHome subset. The efficacy of the proposed Bayesian adaptation techniques is further demonstrated in a comparison against the state-of-the-art performance obtained on the same task using the most recent systems reported in the literature., Comment: published in TASLP, and with extra appendices of released codes and updated results
Published: 2020
Full Text: View/download PDF

28. Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

Author: Hu, Shoukang, Xie, Xurong, Liu, Shansong, Yu, Jianwei, Ye, Zi, Geng, Mengzhe, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Discriminative training techniques define state-of-the-art performance for automatic speech recognition systems. However, they are inherently prone to overfitting, leading to poor generalization performance when using limited training data. In order to address this issue, this paper presents a full Bayesian framework to account for model uncertainty in sequence discriminative training of factored TDNN acoustic models. Several Bayesian learning based TDNN variant systems are proposed to model the uncertainty over weight parameters and choices of hidden activation functions, or the hidden layer outputs. Efficient variational inference approaches using a few as one single parameter sample ensure their computational cost in both training and evaluation time comparable to that of the baseline TDNN systems. Statistically significant word error rate (WER) reductions of 0.4%-1.8% absolute (5%-11% relative) were obtained over a state-of-the-art 900 hour speed perturbed Switchboard corpus trained baseline LF-MMI factored TDNN system using multiple regularization methods including F-smoothing, L2 norm penalty, natural gradient, model averaging and dropout, in addition to i-Vector plus learning hidden unit contribution (LHUC) based speaker adaptation and RNNLM rescoring. Consistent performance improvements were also obtained on a 450 hour HKUST conversational Mandarin telephone speech recognition task. On a third cross domain adaptation task requiring rapidly porting a 1000 hour LibriSpeech data trained system to a small DementiaBank elderly speech corpus, the proposed Bayesian TDNN LF-MMI systems outperformed the baseline system using direct weight fine-tuning by up to 2.5\% absolute WER reduction., Comment: Published in TASLP
Published: 2020

29. Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Author: Hu, Shoukang, Xie, Xurong, Liu, Shansong, Cui, Mingyu, Geng, Mengzhe, Liu, Xunying, and Meng, Helen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Deep neural networks (DNNs) based automatic speech recognition (ASR) systems are often designed using expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training; Gumbel-Softmax and pipelined DARTS reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to adjust the trade-off between performance and system complexity. Parameter sharing among candidate architectures allows efficient search over up to $7^{28}$ different TDNN systems. Experiments conducted on the 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems using manual network design or random architecture search after LHUC speaker adaptation and RNNLM rescoring. Absolute word error rate (WER) reductions up to 1.0\% and relative model size reduction of 28\% were obtained. Consistent performance improvements were also obtained on a UASpeech disordered speech recognition task using the proposed NAS approaches., Comment: Accepted by ICASSP 2021
Published: 2020

30. Towards High-Performance and Low-Latency Feature-Based Speaker Adaptation of Conformer Speech Recognition Systems

Author: Deng, Jiajun, primary, Xie, Xurong, additional, Li, Guinan, additional, Cui, Mingyu, additional, Geng, Mengzhe, additional, Jin, Zengrui, additional, Wang, Tianzi, additional, Hu, Shujie, additional, Li, Zhaoqing, additional, and Liu, Xunying, additional
Published: 2024
Full Text: View/download PDF

31. Towards Automatic Data Augmentation for Disordered Speech Recognition

Author: Jin, Zengrui, primary, Xie, Xurong, additional, Wang, Tianzi, additional, Geng, Mengzhe, additional, Deng, Jiajun, additional, Li, Guinan, additional, Hu, Shujie, additional, and Liu, Xunying, additional
Published: 2024
Full Text: View/download PDF

32. Probing Lexical Ambiguity in Chinese Characters via Their Word Formations: Convergence of Perceived and Computed Metrics

Author: Wang, Tianqi, primary, Xu, Xu, additional, Xie, Xurong, additional, and Ng, Manwa Lawrence, additional
Published: 2023
Full Text: View/download PDF

33. Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Author: Deng, Jiajun, primary, Li, Guinan, additional, Xie, Xurong, additional, Jin, Zengrui, additional, Cui, Mingyu, additional, Wang, Tianzi, additional, Hu, Shujie, additional, Geng, Mengzhe, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

34. On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Author: Geng, Mengzhe, primary, Xie, Xurong, additional, Su, Rongfeng, additional, Yu, Jianwei, additional, Jin, Zengrui, additional, Wang, Tianzi, additional, Hu, Shujie, additional, Ye, Zi, additional, Meng, Helen, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

35. Use of Speech Impairment Severity for Dysarthric Speech Recognition

Author: Geng, Mengzhe, primary, Jin, Zengrui, additional, Wang, Tianzi, additional, Hu, Shujie, additional, Deng, Jiajun, additional, Cui, Mingyu, additional, Li, Guinan, additional, Yu, Jianwei, additional, Xie, Xurong, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

36. Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Author: Hu, Shujie, primary, Xie, Xurong, additional, Geng, Mengzhe, additional, Cui, Mingyu, additional, Deng, Jiajun, additional, Li, Guinan, additional, Wang, Tianzi, additional, Meng, Helen, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

37. Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Author: Jin, Zengrui, primary, Xie, Xurong, additional, Geng, Mengzhe, additional, Wang, Tianzi, additional, Hu, Shujie, additional, Deng, Jiajun, additional, Li, Guinan, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

38. Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition

Author: Hu, Shujie, primary, Xie, Xurong, additional, Jin, Zengrui, additional, Geng, Mengzhe, additional, Wang, Yi, additional, Cui, Mingyu, additional, Deng, Jiajun, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2023
Full Text: View/download PDF

39. Unsupervised Model-Based Speaker Adaptation of End-To-End Lattice-Free MMI Model for Speech Recognition

Author: Xie, Xurong, primary, Liu, Xunying, additional, Chen, Hui, additional, and Wang, Hongan, additional
Published: 2023
Full Text: View/download PDF

40. ChallengeDetect: Investigating the Potential of Detecting In-Game Challenge Experience from Physiological Measures

Author: Peng, Xiaolan, primary, Xie, Xurong, additional, Huang, Jin, additional, Jiang, Chutian, additional, Wang, Haonian, additional, Denisova, Alena, additional, Chen, Hui, additional, Tian, Feng, additional, and Wang, Hongan, additional
Published: 2023
Full Text: View/download PDF

41. ChallengeDetect: Investigating the Potential of Detecting In-Game Challenge Experience from Physiological Measures

Author: Peng, Xiaolan, Xie, Xurong, Huang, Jin, Jiang, Chutian, Wang, Haonian, Denisova, Alena, Chen, Hui, Tian, Feng, Wang, Hongan, Peng, Xiaolan, Xie, Xurong, Huang, Jin, Jiang, Chutian, Wang, Haonian, Denisova, Alena, Chen, Hui, Tian, Feng, and Wang, Hongan
Abstract: Challenge is the core element of digital games. The wide spectrum of physical, cognitive, and emotional challenge experiences provided by modern digital games can be evaluated subjectively using a questionnaire, the CORGIS, which allows for a post hoc evaluation of the overall experience that occurred during game play. Measuring this experience dynamically and objectively, however, would allow for a more holistic view of the moment-to-moment experiences of players. This study, therefore, explored the potential of detecting perceived challenge from physiological signals. For this, we collected physiological responses from 32 players who engaged in three typical game scenarios. Using perceived challenge ratings from players and extracted physiological features, we applied multiple machine learning methods and metrics to detect challenge experiences. Results show that most methods achieved a detection accuracy of around 80%. We discuss in-game challenge perception, challenge-related physiological indicators and AI-supported challenge detection to inform future work on challenge evaluation.
Published: 2023

42. Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Author: Deng, Jiajun, primary, Xie, Xurong, additional, Wang, Tianzi, additional, Cui, Mingyu, additional, Xue, Boyang, additional, Jin, Zengrui, additional, Li, Guinan, additional, Hu, Shujie, additional, and Liu, Xunying, additional
Published: 2023
Full Text: View/download PDF

43. Photosynthetic index and nitrogen assimilation in rapeseed seedlings transplanted in soil with ammonium glufosinate/Indice fotossintetico e assimilacao de nitrogenio em plantulas de colza transplantadas em resposta a residuos de glufosinato presentes no solo

Author: Cui, Cui, Xie, Xurong, Wang, Liu-Yan, Wang, Rui-Li, Lei, Wei, Lv, Jun, Chen, Liuyi, Gao, Huan-Huan, Ye, Sang, Huang, Linya, and Zhou, Qing-Yuan
Published: 2020
Full Text: View/download PDF

44. A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition

Author: Li, Jin, primary, Su, Rongfeng, additional, Xie, Xurong, additional, Wang, Lan, additional, and Yan, Nan, additional
Published: 2022
Full Text: View/download PDF

45. Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

Author: DENG, Jiajun, primary, Xie, Xurong, additional, Wang, Tianzi, additional, Cui, Mingyu, additional, Xue, Boyang, additional, Jin, Zengrui, additional, Geng, Mengzhe, additional, Li, Guinan, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2022
Full Text: View/download PDF

46. Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Author: Cui, Mingyu, primary, DENG, Jiajun, additional, Hu, Shoukang, additional, Xie, Xurong, additional, Wang, Tianzi, additional, HU, Shujie, additional, Geng, Mengzhe, additional, Xue, Boyang, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2022
Full Text: View/download PDF

47. Exploiting Cross Domain Acoustic-to-Articulatory Inverted Features for Disordered Speech Recognition

Author: Hu, Shujie, primary, Liu, Shansong, additional, Xie, Xurong, additional, Geng, Mengzhe, additional, Wang, Tianzi, additional, Hu, Shoukang, additional, Cui, Mingyu, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2022
Full Text: View/download PDF

48. Detecting challenge from physiological signals: A primary study with a typical game scenario

Author: Peng, Xiaolan, primary, Meng, Chenyu, additional, Xie, Xurong, additional, Huang, Jin, additional, Chen, Hui, additional, and Wang, Hongan, additional
Published: 2022
Full Text: View/download PDF

49. Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Author: Hu, Shoukang, primary, Xie, Xurong, additional, Cui, Mingyu, additional, Deng, Jiajun, additional, Liu, Shansong, additional, Yu, Jianwei, additional, Geng, Mengzhe, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2022
Full Text: View/download PDF

50. Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Author: Geng, Mengzhe, primary, Xie, Xurong, additional, Ye, Zi, additional, Wang, Tianzi, additional, Li, Guinan, additional, Hu, Shujie, additional, Liu, Xunying, additional, and Meng, Helen, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

120 results on '"Xie, Xurong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources