572 results on '"Wang, Hsin-Min"'
Search Results
2. Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
- Author
-
Wang, Chien-Chun, Chen, Li-Wei, Lee, Hung-Shin, Chen, Berlin, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation., Comment: Accepted to IEEE SLT 2024
- Published
- 2024
3. SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- Author
-
Yin, Chun, Chi, Tai-Shih, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability., Comment: Accepted to INTERSPEECH 2024
- Published
- 2024
4. Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes
- Author
-
Hashmi, Ammarah, Shahzad, Sahibzada Adil, Lin, Chia-Wen, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computers and Society ,Computer Science - Machine Learning ,Computer Science - Multimedia - Abstract
The emergence of contemporary deepfakes has attracted significant attention in machine learning research, as artificial intelligence (AI) generated synthetic media increases the incidence of misinterpretation and is difficult to distinguish from genuine content. Currently, machine learning techniques have been extensively studied for automatically detecting deepfakes. However, human perception has been less explored. Malicious deepfakes could ultimately cause public and social problems. Can we humans correctly perceive the authenticity of the content of the videos we watch? The answer is obviously uncertain; therefore, this paper aims to evaluate the human ability to discern deepfake videos through a subjective study. We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models. To this end, we used gamification concepts to provide 110 participants (55 native English speakers and 55 non-native English speakers) with a webbased platform where they could access a series of 40 videos (20 real and 20 fake) to determine their authenticity. Each participant performed the experiment twice with the same 40 videos in different random orders. The videos are manually selected from the FakeAVCeleb dataset. We found that all AI models performed better than humans when evaluated on the same 40 videos. The study also reveals that while deception is not impossible, humans tend to overestimate their detection capabilities. Our experimental results may help benchmark human versus machine performance, advance forensics analysis, and enable adaptive countermeasures.
- Published
- 2024
5. SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
- Author
-
Wang, Hsuan-Fu, Shih, Yi-Jen, Chang, Heng-Jui, Berry, Layne, Peng, Puyuan, Lee, Hung-yi, Wang, Hsin-Min, and Harwath, David
- Subjects
Computer Science - Computation and Language ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks., Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop
- Published
- 2024
6. HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids
- Author
-
Wisnu, Dyah A. M. G., Rini, Stefano, Zezario, Ryandhimas E., Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
This paper introduces HAAQI-Net, a non-intrusive deep learning model for music audio quality assessment tailored for hearing aid users. Unlike traditional methods like the Hearing Aid Audio Quality Index (HAAQI), which rely on intrusive comparisons to a reference signal, HAAQI-Net offers a more accessible and efficient alternative. Using a bidirectional Long Short-Term Memory (BLSTM) architecture with attention mechanisms and features from the pre-trained BEATs model, HAAQI-Net predicts HAAQI scores directly from music audio clips and hearing loss patterns. Results show HAAQI-Net's effectiveness, with predicted scores achieving a Linear Correlation Coefficient (LCC) of 0.9368, a Spearman's Rank Correlation Coefficient (SRCC) of 0.9486, and a Mean Squared Error (MSE) of 0.0064, reducing inference time from 62.52 seconds to 2.54 seconds. Although effective, feature extraction via the large BEATs model incurs computational overhead. To address this, a knowledge distillation strategy creates a student distillBEATs model, distilling information from the teacher BEATs model during HAAQI-Net training, reducing required parameters. The distilled HAAQI-Net maintains strong performance with an LCC of 0.9071, an SRCC of 0.9307, and an MSE of 0.0091, while reducing parameters by 75.85% and inference time by 96.46%. This reduction enhances HAAQI-Net's efficiency and scalability, making it viable for real-world music audio quality assessment in hearing aid settings. This work also opens avenues for further research into optimizing deep learning models for specific applications, contributing to audio signal processing and quality assessment by providing insights into developing efficient and accurate models for practical applications in hearing aid technology.
- Published
- 2024
7. Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
- Author
-
Chiang, Hsin-Tien, Fu, Szu-Wei, Wang, Hsin-Min, Tsao, Yu, and Hansen, John H. L.
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability.
- Published
- 2023
8. AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection
- Author
-
Shahzad, Sahibzada Adil, Hashmi, Ammarah, Peng, Yan-Tsung, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneously. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
- Published
- 2023
9. AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection
- Author
-
Hashmi, Ammarah, Shahzad, Sahibzada Adil, Lin, Chia-Wen, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. While there are some methods in the literature that exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multi-modal datasets of deepfake videos involving acoustic and visual manipulations. Moreover, these existing methods are mostly based on CNN and suffer from low detection accuracy. Inspired by the recent success of Transformer in various fields, to address the challenges posed by deepfake technology, in this paper, we propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation to achieve effective video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multi-modal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that our best model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset.
- Published
- 2023
10. The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
- Author
-
Cooper, Erica, Huang, Wen-Chin, Tsao, Yu, Wang, Hsin-Min, Toda, Tomoki, and Yamagishi, Junichi
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches., Comment: Accepted to ASRU 2023
- Published
- 2023
11. A Study on Incorporating Whisper for Robust Speech Assessment
- Author
-
Zezario, Ryandhimas E., Chen, Yu-Wen, Fu, Szu-Wei, Tsao, Yu, Wang, Hsin-Min, and Fuh, Chiou-Shann
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems., Comment: Accepted to IEEE ICME 2024
- Published
- 2023
12. Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
- Author
-
Ahmed, Shafique, Chen, Chia-Wei, Ren, Wenze, Li, Chin-Jou, Chu, Ernie, Chen, Jun-Cheng, Hussain, Amir, Wang, Hsin-Min, Tsao, Yu, and Hou, Jen-Cheng
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate the effectiveness of DCUC-Net, as it outperforms the baseline model from the COG-MHEAR AVSE Challenge 2023 by a notable margin of 0.14 in terms of PESQ. Additionally, the proposed DCUC-Net performs comparably to a state-of-the-art model and outperforms all other compared models on the Taiwan Mandarin speech with video (TMSV) dataset.
- Published
- 2023
13. Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
- Author
-
Zezario, Ryandhimas E., Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes metadata from speech signals by using a classifier that distinguishes different enhancement methods. Furthermore, MBI-Net+ integrates the hearing-aid speech perception index (HASPI) as a supplementary metric into the objective function to further boost prediction performance. Experimental results demonstrate that MBI-Net+ surpasses several intrusive baseline systems and MBI-Net on the Clarity Prediction Challenge 2023 dataset, validating the effectiveness of incorporating Whisper embeddings, speech metadata, and related complementary metrics to improve prediction performance for HA., Comment: Accepted to Interspeech 2024
- Published
- 2023
14. Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model
- Author
-
Zezario, Ryandhimas E., Bai, Bo-Ren Brian, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models., Comment: Accepted to IEEE ICASSP 2024
- Published
- 2023
15. Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features
- Author
-
Chen, Hsin-Hao, Chien, Yung-Lun, Yen, Ming-Chi, Tsai, Shu-Wei, Tsao, Yu, Chi, Tai-shih, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (ELVC) is a method used to improve the intelligibility and quality of EL speech. In this paper, we propose a novel ELVC system that incorporates cross-domain features, specifically spectral features and self-supervised learning (SSL) embeddings. The experimental results show that applying cross-domain features can notably improve the conversion performance for the ELVC task compared with utilizing only traditional spectral features., Comment: Accepted to INTERSPEECH 2023
- Published
- 2023
16. Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
- Author
-
Chien, Yung-Lun, Chen, Hsin-Hao, Yen, Ming-Chi, Tsai, Shu-Wei, Wang, Hsin-Min, Tsao, Yu, and Chi, Tai-Shih
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC., Comment: Accepted to INTERSPEECH 2023
- Published
- 2023
17. BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm
- Author
-
Chen, Yu-Wen, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Computer Science - Neural and Evolutionary Computing ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus., Comment: accepted by APSIPA Transactions on Signal and Information Processing
- Published
- 2022
18. A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech
- Author
-
Chen, Li-Wei, Cheng, Yao-Fei, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing 1) the teacher model's estimated speech and noise for enhanced-target training or 2) raw noisy speech and the teacher model's estimated noise for noisy-target training. Experimental results show that our proposed method outperforms several baselines, especially with the teacher/student inference, where predicted clean speech is derived successively through the teacher and final student models., Comment: Accepted by Interspeech 2023
- Published
- 2022
19. CasNet: Investigating Channel Robustness for Speech Separation
- Author
-
Wang, Fan-Lin, Cheng, Yao-Fei, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recording channel mismatch between training and testing conditions has been shown to be a serious problem for speech separation. This situation greatly reduces the separation performance, and cannot meet the requirement of daily use. In this study, inheriting the use of our previously constructed TAT-2mix corpus, we address the channel mismatch problem by proposing a channel-aware audio separation network (CasNet), a deep learning framework for end-to-end time-domain speech separation. CasNet is implemented on top of TasNet. Channel embedding (characterizing channel information in a mixture of multiple utterances) generated by Channel Encoder is introduced into the separation module by the FiLM technique. Through two training strategies, we explore two roles that channel embedding may play: 1) a real-life noise disturbance, making the model more robust, or 2) a guide, instructing the separation model to retain the desired channel information. Experimental results on TAT-2mix show that CasNet trained with both training strategies outperforms the TasNet baseline, which does not use channel embeddings., Comment: Submitted to ICASSP 2023
- Published
- 2022
20. Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics
- Author
-
Tsai, Wei-Ho, Rodgers, Dwight, and Wang, Hsin-Min
- Published
- 2004
21. Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN
- Author
-
Cho, Yin-Ping, Tsao, Yu, Wang, Hsin-Min, and Liu, Yi-Wen
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound ,Electrical Engineering and Systems Science - Signal Processing - Abstract
Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits improvements over previous landmark counterparts in terms of musical expressiveness and high-frequency acoustic details. Moreover, the adversarial acoustic model converged stably without the need to enforce reconstruction objectives, indicating the convergence stability of the proposed DDPM and WGAN combined architecture over alternative GAN-based SVS systems.
- Published
- 2022
22. NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling
- Author
-
Lee, Chi-Chang, Hu, Cheng-Hung, Lin, Yu-Chen, Chen, Chu-Song, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning - Abstract
For deep learning-based speech enhancement (SE) systems, the training-test acoustic mismatch can cause notable performance degradation. To address the mismatch issue, numerous noise adaptation strategies have been derived. In this paper, we propose a novel method, called noise adaptive speech enhancement with target-conditional resampling (NASTAR), which reduces mismatches with only one sample (one-shot) of noisy speech in the target environment. NASTAR uses a feedback mechanism to simulate adaptive training data via a noise extractor and a retrieval model. The noise extractor estimates the target noise from the noisy speech, called pseudo-noise. The noise retrieval model retrieves relevant noise samples from a pool of noise signals according to the noisy speech, called relevant-cohort. The pseudo-noise and the relevant-cohort set are jointly sampled and mixed with the source speech corpus to prepare simulated training data for noise adaptation. Experimental results show that NASTAR can effectively use one noisy speech sample to adapt an SE model to a target condition. Moreover, both the noise extractor and the noise retrieval model contribute to model adaptation. To our best knowledge, NASTAR is the first work to perform one-shot noise adaptation through noise extraction and retrieval., Comment: Accepted to Interspeech 2022
- Published
- 2022
23. A Study of Using Cepstrogram for Countermeasure Against Replay Attacks
- Author
-
Lee, Shih-Kuang, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Cryptography and Security ,Computer Science - Sound - Abstract
This study investigated the cepstrogram properties and demonstrated their effectiveness as powerful countermeasures against replay attacks. A cepstrum analysis of replay attacks suggests that crucial information for anti-spoofing against replay attacks may be retained in the cepstrogram. When building countermeasures against replay attacks, experiments on the ASVspoof 2019 physical access database demonstrate that the cepstrogram is more effective than other features in both single and fusion systems. Our LCNN-based single and fusion systems with the cepstrogram feature outperformed the corresponding LCNN-based systems without the cepstrogram feature and several state-of-the-art single and fusion systems in the literature., Comment: Submitted to SLT 2022
- Published
- 2022
24. MTI-Net: A Multi-Target Speech Intelligibility Prediction Model
- Author
-
Zezario, Ryandhimas E., Fu, Szu-wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility scores. This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures. Specifically, given a speech utterance, MTI-Net is designed to predict human subjective listening test results and word error rate (WER) scores. We also investigate several methods that can improve the prediction performance of MTI-Net. First, we compare different features (including low-level features and embeddings from self-supervised learning (SSL) models) and prediction targets of MTI-Net. Second, we explore the effect of transfer learning and multi-tasking learning on training MTI-Net. Finally, we examine the potential advantages of fine-tuning SSL embeddings. Experimental results demonstrate the effectiveness of using cross-domain features, multi-task learning, and fine-tuning SSL embeddings. Furthermore, it is confirmed that the intelligibility and WER scores predicted by MTI-Net are highly correlated with the ground-truth scores., Comment: Accepted to Interspeech 2022
- Published
- 2022
25. MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids
- Author
-
Zezario, Ryandhimas E., Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale listening tests is time-consuming and expensive. Therefore, several evaluation metrics were derived as surrogates for subjective listening test results. In this study, we propose a multi-branched speech intelligibility prediction model (MBI-Net), for predicting the subjective intelligibility scores of HA users. MBI-Net consists of two branches of models, with each branch consisting of a hearing loss model, a cross-domain feature extraction module, and a speech intelligibility prediction model, to process speech signals from one channel. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores. Experimental results confirm the effectiveness of MBI-Net, which produces higher prediction scores than the baseline system in Track 1 and Track 2 on the Clarity Prediction Challenge 2022 dataset., Comment: Accepted to Interspeech 2022
- Published
- 2022
26. Filter-based Discriminative Autoencoders for Children Speech Recognition
- Author
-
Tai, Chiang-Lin, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Children speech recognition is indispensable but challenging due to the diversity of children's speech. In this paper, we propose a filter-based discriminative autoencoder for acoustic modeling. To filter out the influence of various speaker types and pitches, auxiliary information of the speaker and pitch features is input into the encoder together with the acoustic features to generate phonetic embeddings. In the training phase, the decoder uses the auxiliary information and the phonetic embedding extracted by the encoder to reconstruct the input acoustic features. The autoencoder is trained by simultaneously minimizing the ASR loss and feature reconstruction error. The framework can make the phonetic embedding purer, resulting in more accurate senone (triphone-state) scores. Evaluated on the test set of the CMU Kids corpus, our system achieves a 7.8% relative WER reduction compared to the baseline system. In the domain adaptation experiment, our system also outperforms the baseline system on the British-accent PF-STAR task., Comment: Published in EUSIPCO 2022
- Published
- 2022
27. Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
- Author
-
Wang, Fan-Lin, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data., Comment: Published in Interspeech 2022
- Published
- 2022
28. Generation of Speaker Representations Using Heterogeneous Training Batch Assembly
- Author
-
Peng, Yu-Huai, Lee, Hung-Shin, Huang, Pin-Tuan, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is superior to the baseline system using x-vectors in two speaker diarization tasks. In the CALLHOME task trained on the NIST SRE and Switchboard datasets, our system achieves a relative reduction of 12.93% in DER. In Track 2 of CHiME-6, our system provides 13.24%, 12.60%, and 5.65% relative reductions in DER, JER, and WER, respectively., Comment: Published in APSIPA ASC 2021
- Published
- 2022
29. Multi-target Extractor and Detector for Unknown-number Speaker Diarization
- Author
-
Cheng, Chin-Yi, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines., Comment: Accepted by IEEE Signal Processing Letters
- Published
- 2022
- Full Text
- View/download PDF
30. Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition
- Author
-
Lee, Hung-Shin, Tsao, Yu, Jeng, Shyh-Kang, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods., Comment: Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079
- Published
- 2022
31. Speech-enhanced and Noise-aware Networks for Robust Speech Recognition
- Author
-
Lee, Hung-Shin, Chen, Pin-Yuan, Cheng, Yao-Fei, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task., Comment: Published in ISCSLP 2022
- Published
- 2022
32. Chain-based Discriminative Autoencoders for Speech Recognition
- Author
-
Lee, Hung-Shin, Huang, Pin-Tuan, Cheng, Yao-Fei, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems., Comment: Published in Interspeech 2022
- Published
- 2022
33. The VoiceMOS Challenge 2022
- Author
-
Huang, Wen-Chin, Cooper, Erica, Tsao, Yu, Wang, Hsin-Min, Toda, Tomoki, and Yamagishi, Junichi
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting., Comment: Accepted to Interspeech 2022
- Published
- 2022
34. Partially Fake Audio Detection by Self-attention-based Fake Span Discovery
- Author
-
Wu, Haibin, Kuo, Heng-Cheng, Zheng, Naijun, Hung, Kuo-Hsuan, Lee, Hung-Yi, Tsao, Yu, Wang, Hsin-Min, and Meng, Helen
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022., Comment: Submitted to ICASSP 2022
- Published
- 2022
35. EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement
- Author
-
Wang, Kuan-Chen, Liu, Kai-Chun, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound ,Quantitative Biology - Quantitative Methods - Abstract
Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE., Comment: 5 pages, 4 figures, and 3 tables
- Published
- 2022
36. HASA-net: A non-intrusive hearing-aid speech assessment network
- Author
-
Chiang, Hsin-Tien, Wu, Yi-Chiao, Yu, Cheng, Toda, Tomoki, Wang, Hsin-Min, Hu, Yih-Chun, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Artificial Intelligence ,Computer Science - Sound - Abstract
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-loss factors. In this study, we propose a DNN-based hearing aid speech assessment network (HASA-Net), formed by a bidirectional long short-term memory (BLSTM) model, to predict speech quality and intelligibility scores simultaneously according to input speech signals and specified hearing-loss patterns. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics, hearing aid speech quality index (HASQI) and hearing aid speech perception index (HASPI), respectively.
- Published
- 2021
37. Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features
- Author
-
Zezario, Ryandhimas E., Fu, Szu-Wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
- Published
- 2021
38. Speech Enhancement-assisted Voice Conversion in Noisy Environments
- Author
-
Chan, Yun-Ju, Peng, Chiang-Jen, Wang, Syu-Siang, Wang, Hsin-Min, Tsao, Yu, and Chi, Tai-Shih
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although good quality of the converted speech can be observed when VC is applied in a clean environment, the quality degrades drastically when the system is run in noisy conditions. In order to address this issue, we propose a novel speech enhancement (SE)-assisted VC system that utilizes the SE techniques for signal pre-processing, where the VC and SE components are optimized in an joint training strategy with the aim to provide high-quality converted speech signals. We adopt a popular model, StarGAN, as the VC component and thus call the combined system as EStarGAN. We test the proposed EStarGAN system using a Mandarin speech corpus. The experimental results first verified the effectiveness of joint training strategy used in EStarGAN. Moreover, EStarGAN demonstrated performance robustness in various unseen noisy environments. The subjective listening test results further showed that EStarGAN can improve the sound quality of speech signals converted from noise-corrupted source utterances.
- Published
- 2021
39. Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion
- Author
-
Liou, Yi-Syuan, Huang, Wen-Chin, Yen, Ming-Chi, Tsai, Shu-Wei, Peng, Yu-Huai, Toda, Tomoki, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We investigate two naive lip representations and distance metrics, and experimental results demonstrate that the proposed method can significantly outperform the audio-only alignment in terms of objective and subjective evaluations., Comment: Accepted to APSIPA ASC 2021
- Published
- 2021
40. SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours
- Author
-
Chen, Yi-Wei, Lee, Hung-Shin, Chen, Yen-Hsing, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The surprisingness of a song is an essential and seemingly subjective factor in determining whether the listener likes it. With the help of information theory, it can be described as the transition probability of a music sequence modeled as a Markov chain. In this study, we introduce the concept of deriving entropy variations over time, so that the surprise contour of each chord sequence can be extracted. Based on this, we propose a user-controllable framework that uses a conditional variational autoencoder (CVAE) to harmonize the melody based on the given chord surprise indication. Through explicit conditions, the model can randomly generate various and harmonic chord progressions for a melody, and the Spearman's correlation and p-value significance show that the resulting chord progressions match the given surprise contour quite well. The vanilla CVAE model was evaluated in a basic melody harmonization task (no surprise control) in terms of six objective metrics. The results of experiments on the Hooktheory Lead Sheet Dataset show that our model achieves performance comparable to the state-of-the-art melody harmonization model., Comment: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021
- Published
- 2021
41. SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- Author
-
Hu, Cheng-Hung, Peng, Yu-Huai, Yamagishi, Junichi, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels., Comment: To appear in IEEE Signal Processing Letters (SPL)
- Published
- 2021
- Full Text
- View/download PDF
42. Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation
- Author
-
Wang, Fan-Lin, Peng, Yu-Huai, Lee, Hung-Shin, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dual-path filter network (DPFN). Our model focuses on the post-processing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT)., Comment: Accepted by Interspeech2021
- Published
- 2021
43. Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder
- Author
-
Wu, Yi-Chiao, Hu, Cheng-Hung, Lee, Hung-Shin, Peng, Yu-Huai, Huang, Wen-Chin, Tsao, Yu, Wang, Hsin-Min, and Toda, Tomoki
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of limited target data, a data augmentation method based on speaker representation and similarity measurement of speaker verification is proposed in this paper. The proposed method selects utterances that have similar speaker identity to the target speaker from an external corpus, and then combines the selected utterances with the limited target data for SD vocoder adaptation. The evaluation results show that, compared with the vocoder adapted using only limited target data, the vocoder adapted using augmented data improves both the quality and similarity of synthesized speech., Comment: 5 pages, 1 figure, 3 tables, Proc. Interspeech, 2021
- Published
- 2021
44. A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
- Author
-
Huang, Wen-Chin, Kobayashi, Kazuhiro, Peng, Yu-Huai, Liu, Ching-Feng, Tsao, Yu, Wang, Hsin-Min, and Toda, Tomoki
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality. We investigate several design options. Experimental evaluation results demonstrate the potential of our approach to improving the quality of the dysarthric speech while maintaining the speaker identity., Comment: Accepted to Interspeech 2021. 5 pages, 3 figures, 1 table
- Published
- 2021
45. Sequence to General Tree: Knowledge-Guided Geometry Word Problem Solving
- Author
-
Tsai, Shih-hung, Liang, Chao-Chun, Wang, Hsin-Min, and Su, Keh-Yih
- Subjects
Computer Science - Artificial Intelligence - Abstract
With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems. However, these SOTA solvers only generate binary expression trees that contain basic arithmetic operators and do not explicitly use the math formulas. As a result, the expression trees they produce are lengthy and uninterpretable because they need to use multiple operators and constants to represent one single formula. In this paper, we propose sequence-to-general tree (S2G) that learns to generate interpretable and executable operation trees where the nodes can be formulas with an arbitrary number of arguments. With nodes now allowed to be formulas, S2G can learn to incorporate mathematical domain knowledge into problem-solving, making the results more interpretable. Experiments show that S2G can achieve a better performance against strong baselines on problems that require domain knowledge., Comment: ACL2021
- Published
- 2021
46. AlloST: Low-resource Speech Translation without Source Transcription
- Author
-
Cheng, Yao-Fei, Lee, Hung-Shin, and Wang, Hsin-Min
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence. Due to the conversion of symbols, a segmented sequence represents not only pronunciation but also language-dependent information lacking in phones. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription., Comment: Accepted by Interspeech2021
- Published
- 2021
47. The AS-NU System for the M2VoC Challenge
- Author
-
Hu, Cheng-Hung, Wu, Yi-Chiao, Huang, Wen-Chin, Peng, Yu-Huai, Chen, Yu-Wen, Ku, Pin-Jui, Toda, Tomoki, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC). The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning. Due to the serious lack of data in the second track, we selected the speaker most similar to the target speaker from the training data of the TTS system, and used the speaker's utterances and the given 5 target utterances to fine-tune our model. The evaluation results show that our systems on the two tracks perform similarly in terms of quality, but there is still a clear gap between the similarity score of the second track and the similarity score of the first track.
- Published
- 2021
48. Speech Recognition by Simply Fine-tuning BERT
- Author
-
Huang, Wen-Chin, Wu, Chia-Hua, Luo, Shang-Bao, Chen, Kuan-Yu, Wang, Hsin-Min, and Toda, Tomoki
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance., Comment: Accepted to ICASSP 2021
- Published
- 2021
49. Speech Enhancement with Zero-Shot Model Selection
- Author
-
Zezario, Ryandhimas E., Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we combine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performance. The proposed approach is realized in the offline and online phases. The offline phase clusters the entire set of training data into multiple subsets and trains a specialized SE model (termed component SE model) with each subset. The online phase selects the most suitable component SE model to perform the enhancement. Furthermore, two selection strategies were developed: selection based on the quality score (QS) and selection based on the quality embedding (QE). Both QS and QE were obtained using a Quality-Net, a non-intrusive quality assessment network. Experimental results confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems and other model selection systems, which indicates the effectiveness of the proposed approach in providing robust SE performance., Comment: Accepted in EUSIPCO 2021
- Published
- 2020
50. STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model
- Author
-
Zezario, Ryandhimas E., Fu, Szu-Wei, Fuh, Chiou-Shann, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The calculation of most objective speech intelligibility assessment metrics requires clean speech as a reference. Such a requirement may limit the applicability of these metrics in real-world scenarios. To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The input and output of STOI-Net are speech spectral features and predicted STOI scores, respectively. The model is formed by the combination of a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism. Experimental results show that the STOI score estimated by STOI-Net has a good correlation with the actual STOI score when tested with noisy and enhanced speech utterances. The correlation values are 0.97 and 0.83, respectively, for the seen test condition (the test speakers and noise types are involved in the training set) and the unseen test condition (the test speakers and noise types are not involved in the training set). The results confirm the capability of STOI-Net to accurately predict the STOI scores without referring to clean speech., Comment: Accepted in APSIPA 2020
- Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.