Author: "Zhang, Wei-Qiang" / Topic: fos: electrical engineering, electronic engineering, information engineering - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Wei-Qiang"' showing total 12 results

Start Over Author "Zhang, Wei-Qiang" Topic fos: electrical engineering, electronic engineering, information engineering

12 results on '"Zhang, Wei-Qiang"'

1. DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Author: Wang, Haoyu, Wang, Siyuan, Zhang, Wei-Qiang, and Bai, Jinfeng
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data. We also design a layer-jumping initialization method to fully leverage the teacher's pre-trained weights. Experiments on 2 kinds of teacher models and 15 low-resource languages show that our method can reduce the parameters by 50% while maintaining cross-lingual representation ability. Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models., Accepted by INTERSPEECH 2023
Published: 2023

2. Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features

Author: Chen, Xuchu, Pu, Yu, Li, Jinpeng, and Zhang, Wei-Qiang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present our submission to the ICASSP-SPGC-2023 ADReSS-M Challenge Task, which aims to investigate which acoustic features can be generalized and transferred across languages for Alzheimer's Disease (AD) prediction. The challenge consists of two tasks: one is to classify the speech of AD patients and healthy individuals, and the other is to infer Mini Mental State Examination (MMSE) score based on speech only. The difficulty is mainly embodied in the mismatch of the dataset, in which the training set is in English while the test set is in Greek. We extract paralinguistic features using openSmile toolkit and acoustic features using XLSR-53. In addition, we extract linguistic features after transcribing the speech into text. These features are used as indicators for AD detection in our method. Our method achieves an accuracy of 69.6% on the classification task and a root mean squared error (RMSE) of 4.788 on the regression task. The results show that our proposed method is expected to achieve automatic multilingual Alzheimer's Disease detection through spontaneous speech., accepted by ICASSP 2023
Published: 2023

3. Task-Agnostic Structured Pruning of Speech Representation Models

Author: Wang, Haoyu, Wang, Siyuan, Zhang, Wei-Qiang, Suo, Hongbin, and Wan, Yulong
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised pre-trained models such as Wav2vec2, Hubert, and WavLM have been shown to significantly improve many speech tasks. However, their large memory and strong computational requirements hinder their industrial applicability. Structured pruning is a hardware-friendly model compression technique but usually results in a larger loss of accuracy. In this paper, we propose a fine-grained attention head pruning method to compensate for the performance degradation. In addition, we also introduce the straight through estimator into the L0 regularization to further accelerate the pruned model. Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks and outperforms the Wav2vec 2.0 base model on average, with 72% fewer parameters and 2 times faster inference speed., Comment: Accepted by INTERSPEECH 2023
Published: 2023
Full Text: View/download PDF

4. BERT-LID: Leveraging BERT to Improve Spoken Language Identification

Author: Nie, Yuting, Zhao, Junhong, Zhang, Wei-Qiang, and Bai, Jinfeng
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Uncategorized
Abstract: Language identification is the task of automatically determining the identity of a language conveyed by a spoken segment. It has a profound impact on the multilingual interoperability of an intelligent speech system. Despite language identification attaining high accuracy on medium or long utterances(>3s), the performance on short utterances (, accepted by ISCSLP 2022
Published: 2023
Full Text: View/download PDF

5. Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

Author: Tang, Changli, Wang, Yujin, Chen, Xie, and Zhang, Wei-Qiang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-supervised learning (SSL) has achieved great success in various areas including speech processing. Recently, it is proven that speech based SSL models are able to extract superior universal representations on a range of downstream tasks compared to traditional hand-craft feature (e.g. FBank, MFCC) in the SUPERB benchmark. However, different types of SSL models might exhibit distinct strengths on different downstream tasks. In order to better utilize the potential power of SSL models, in this work, we explore the effective fusion on multiple SSL models. A series of model fusion algorithms are investigated and compared by combining two types of SSL models, Hubert and Data2vec, on two representative tasks from SUPERB benchmark, which are speaker identification (SID) and automatic speech recognition (ASR) tasks. The experimental results demonstrate that our proposed fusion algorithms can further boost the individual model significantly., Accepted by NCMMSC2022
Published: 2022

6. Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

Author: Wang, Haoyu, Zhang, Wei-Qiang, Suo, Hongbin, and Wan, Yulong
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages, and the average error rate on 8 languages is 33.77%., accepted by ISCSLP 2022
Published: 2022

7. LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification

Author: Chen, Xing, Wang, Jie, Zhang, Xiao-Lei, Zhang, Wei-Qiang, and Yang, Kunde
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses., 13 pages, 9 figures
Published: 2022

8. Summary on the ISCSLP 2022 Chinese-English Code-Switching ASR Challenge

Author: Deng, Shuhao, Li, Chengfei, Bai, Jinfeng, Zhang, Qingqing, Zhang, Wei-Qiang, Yang, Runyan, Cheng, Gaofeng, Zhang, Pengyuan, and Yan, Yonghong
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the development of code-switching automatic speech recognition. The ISCSLP 2022 CSASR challenge provided two training sets, TAL_CSASR corpus and MagicData-RAMC corpus, a development and a test set for participants, which are used for CSASR model training and evaluation. Along with the challenge, we also provide the baseline system performance for reference. As a result, more than 40 teams participated in this challenge, and the winner team achieved 16.70% Mixture Error Rate (MER) performance on the test set and has achieved 9.8% MER absolute improvement compared with the baseline system. In this paper, we will describe the datasets, the associated baselines system and the requirements, and summarize the CSASR challenge results and major techniques and tricks used in the submitted systems., Comment: accepted by ISCSLP 2022
Published: 2022
Full Text: View/download PDF

9. Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Author: Wang, Yujin, Tang, Changli, Ma, Ziyang, Zheng, Zhisheng, Chen, Xie, and Zhang, Wei-Qiang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.
Published: 2022
Full Text: View/download PDF

10. Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Author: Yan, Yuzi, Zhang, Wei-Qiang, and Johnson, Michael T.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline., Comment: 4 pages
Published: 2021
Full Text: View/download PDF

11. AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Author: Yan, Yuzi, Tan, Xu, Li, Bohan, Zhang, Guangyan, Qin, Tao, Zhao, Sheng, Shen, Yuan, Zhang, Wei-Qiang, and Liu, Tie-Yan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems., Comment: Accepted by INTERSPEECH 2021
Published: 2021
Full Text: View/download PDF

12. THUEE system description for NIST 2019 SRE CTS Challenge

Author: Liu, Yi, Liang, Tianyu, Xu, Can, Zhang, Xianwei, Chen, Xianhong, Zhang, Wei-Qiang, He, Liang, song, Dandan, Li, Ruyun, Wu, Yangcheng, Ouyang, Peng, and Yin, Shouyi
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes the systems submitted by the department of electronic engineering, institute of microelectronics of Tsinghua university and TsingMicro Co. Ltd. (THUEE) to the NIST 2019 speaker recognition evaluation CTS challenge. Six subsystems, including etdnn/ams, ftdnn/as, eftdnn/ams, resnet, multitask and c-vector are developed in this evaluation., Comment: This is the system description of THUEE submitted to NIST SRE 2019
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

12 results on '"Zhang, Wei-Qiang"'

1. DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

2. Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features

3. Task-Agnostic Structured Pruning of Speech Representation Models

4. BERT-LID: Leveraging BERT to Improve Spoken Language Identification

5. Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

6. Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

7. LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification

8. Summary on the ISCSLP 2022 Chinese-English Code-Switching ASR Challenge

9. Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

10. Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

11. AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

12. THUEE system description for NIST 2019 SRE CTS Challenge

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

12 results on '"Zhang, Wei-Qiang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources