Author: "Zhang, Huayun" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Huayun"' showing total 7 results

Start Over Author "Zhang, Huayun" Database arXiv

7 results on '"Zhang, Huayun"'

1. Semi-supervised Learning For Robust Speech Evaluation

Author: Zhang, Huayun, Wong, Jeremy H. M., Lin, Geyu, and Chen, Nancy F.
Subjects: Computer Science - Artificial Intelligence
Abstract: Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-of-distribution samples, which inevitably exist in real-world deployment scenarios. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization to approximate subjective evaluation criteria. In particular, normalized mutual information is used to quantify the speech characteristics from the learner and the reference. An anchor model is trained using pseudo labels to predict the correctness of pronunciation. An interpolated loss function is proposed to minimize not only the prediction error with respect to ground-truth scores but also the divergence between two probability distributions estimated by the speech evaluation model and the anchor model. Compared to other state-of-the-art methods on a public data-set, this approach not only achieves high performance while evaluating the entire test-set as a whole, but also brings the most evenly distributed prediction error across distinct proficiency levels. Furthermore, empirical results show the model accuracy on out-of-distribution data also compares favorably with competitive baselines., Comment: 6 pages
Published: 2024

2. Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Author: Gao, Xiaoxue, Zhang, Chen, Chen, Yiming, Zhang, Huayun, and Chen, Nancy F.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Current emotional text-to-speech (TTS) models predominantly conduct supervised training to learn the conversion from text and desired emotion to its emotional speech, focusing on a single emotion per text-speech pair. These models only learn the correct emotional outputs without fully comprehending other emotion characteristics, which limits their capabilities of capturing the nuances between different emotions. We propose a controllable Emo-DPO approach, which employs direct preference optimization to differentiate subtle emotional nuances between emotions through optimizing towards preferred emotions over less preferred emotional ones. Instead of relying on traditional neural architectures used in existing emotional TTS models, we propose utilizing the emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning and instruction-following capabilities. Comprehensive experiments confirm that our proposed method outperforms the existing baselines., Comment: 5 pages
Published: 2024

3. PRESENT: Zero-Shot Text-to-Prosody Control

Author: Lam, Perry, Zhang, Huayun, Chen, Nancy F., Sisman, Berrak, and Herremans, Dorien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve the prosody of questions, and use it to generate Mandarin, a tonal language where vowel pitch varies at subphoneme level. We attain 25.3% hanzi CER and 13.0% pinyin CER with the JETS model. All our code and audio samples are available online.
Published: 2024

4. Multiple output samples per input in a single-output Gaussian process

Author: Wong, Jeremy H. M., Zhang, Huayun, and Chen, Nancy F.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters., Comment: This paper is presented in the "Symposium for Celebrating 40 Years of Bayesian Learning in Speech and Language Processing and Beyond", which is a satellite event of the ASRU workshop, on 20 December 2023. https://bayesian40.github.io/
Published: 2023

5. SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

Author: Lam, Perry, Zhang, Huayun, Chen, Nancy F., Sisman, Berrak, and Herremans, Dorien
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. This decremental approach differs from current methods of incrementing sparsity to a desired target, which costs significantly more time than dense training. We call our method SNIPER training: Single-shot Initialization Pruning Evolving-Rate training. Our experiments on FastSpeech2 show that we were able to obtain better losses in the first few training epochs with SNIPER, and that the final SNIPER-trained models outperformed constant-sparsity models and edged out dense models, with negligible difference in training time.
Published: 2022

6. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

Author: Lam, Perry, Zhang, Huayun, Chen, Nancy F., and Sisman, Berrak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: Neural models are known to be over-parameterized, and recent work has shown that sparse text-to-speech (TTS) models can outperform dense models. Although a plethora of sparse methods has been proposed for other domains, such methods have rarely been applied in TTS. In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity? We compare a Tacotron2 baseline and the results of applying five techniques. We then evaluate the performance via the factors of naturalness, intelligibility and prosody, while reporting model size and training time. Complementary to prior research, we find that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters. To our best knowledge, this is the first work that compares sparsity paradigms in text-to-speech synthesis.
Published: 2022
Full Text: View/download PDF

7. Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Author: Zhang, Huayun, Shi, Ke, and Chen, Nancy F.
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech evaluation is an essential component in computer-assisted language learning (CALL). While speech evaluation on English has been popular, automatic speech scoring on low resource languages remains challenging. Work in this area has focused on monolingual specific designs and handcrafted features stemming from resource-rich languages like English. Such approaches are often difficult to generalize to other languages, especially if we also want to consider suprasegmental qualities such as rhythm. In this work, we examine three different languages that possess distinct rhythm patterns: English (stress-timed), Malay (syllable-timed), and Tamil (mora-timed). We exploit robust feature representations inspired by music processing and vector representation learning. Empirical validations show consistent gains for all three languages when predicting pronunciation, rhythm and intonation performance., Comment: Accepted at INTERSPEECH 2021
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

7 results on '"Zhang, Huayun"'

1. Semi-supervised Learning For Robust Speech Evaluation

2. Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

3. PRESENT: Zero-Shot Text-to-Prosody Control

4. Multiple output samples per input in a single-output Gaussian process

5. SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

6. EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

7. Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

7 results on '"Zhang, Huayun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources