146 results on '"Wang, Hsin-Min"'
Search Results
2. Robust Audio-Visual Speech Enhancement: Correcting Misassignments in Complex Environments with Advanced Post-Processing
- Author
-
Ren, Wenze, Hung, Kuo-Hsuan, Chao, Rong, Li, YouJin, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
This paper addresses the prevalent issue of incorrect speech output in audio-visual speech enhancement (AVSE) systems, which is often caused by poor video quality and mismatched training and test data. We introduce a post-processing classifier (PPC) to rectify these erroneous outputs, ensuring that the enhanced speech corresponds accurately to the intended speaker. We also adopt a mixup strategy in PPC training to improve its robustness. Experimental results on the AVSE-challenge dataset show that integrating PPC into the AVSE model can significantly improve AVSE performance, and combining PPC with the AVSE model trained with permutation invariant training (PIT) yields the best performance. The proposed method substantially outperforms the baseline model by a large margin. This work highlights the potential for broader applications across various modalities and architectures, providing a promising direction for future research in this field., Comment: The 27th International Conference of the Oriental COCOSDA
- Published
- 2024
3. Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition
- Author
-
Wang, Chien-Chun, Chen, Li-Wei, Chou, Cheng-Kang, Lee, Hung-Shin, Chen, Berlin, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics., Comment: Submitted to ICASSP 2025
- Published
- 2024
4. Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement
- Author
-
Ren, Wenze, Wu, Haibin, Lin, Yi-Cheng, Chen, Xuanjun, Chao, Rong, Hung, Kuo-Hsuan, Li, You-Jin, Ting, Wen-Yuan, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.
- Published
- 2024
5. A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models
- Author
-
Zezario, Ryandhimas E., Siniscalchi, Sabato M., Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate assessment metrics predicted by GPT-4o and GPT-Whisper examining their correlations with human-based quality and intelligibility assessments, and character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is not effective for audio analysis; whereas, GPT-Whisper demonstrates higher prediction, showing moderate correlation with speech quality and intelligibility, and high correlation with CER. Compared to supervised non-intrusive neural speech assessment models, namely MOS-SSL and MTI-Net, GPT-Whisper yields a notably higher Spearman's rank correlation with the CER of Whisper. These findings validate GPT-Whisper as a reliable method for accurate zero-shot speech assessment without requiring additional training data (speech data and corresponding assessment scores).
- Published
- 2024
6. Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
- Author
-
Cheng, Yao-Fei, Chen, Li-Wei, Lee, Hung-Shin, and Wang, Hsin-Min
- Subjects
Computer Science - Computation and Language ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
- Published
- 2024
7. The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction
- Author
-
Huang, Wen-Chin, Fu, Szu-Wei, Cooper, Erica, Zezario, Ryandhimas E., Toda, Tomoki, Wang, Hsin-Min, Yamagishi, Junichi, and Tsao, Yu
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction., Comment: Accepted to SLT2024
- Published
- 2024
8. Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
- Author
-
Wang, Chien-Chun, Chen, Li-Wei, Lee, Hung-Shin, Chen, Berlin, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation., Comment: Accepted to IEEE SLT 2024
- Published
- 2024
9. SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- Author
-
Yin, Chun, Chi, Tai-Shih, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability., Comment: Accepted to INTERSPEECH 2024
- Published
- 2024
10. Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes
- Author
-
Hashmi, Ammarah, Shahzad, Sahibzada Adil, Lin, Chia-Wen, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computers and Society ,Computer Science - Machine Learning ,Computer Science - Multimedia - Abstract
The emergence of contemporary deepfakes has attracted significant attention in machine learning research, as artificial intelligence (AI) generated synthetic media increases the incidence of misinterpretation and is difficult to distinguish from genuine content. Currently, machine learning techniques have been extensively studied for automatically detecting deepfakes. However, human perception has been less explored. Malicious deepfakes could ultimately cause public and social problems. Can we humans correctly perceive the authenticity of the content of the videos we watch? The answer is obviously uncertain; therefore, this paper aims to evaluate the human ability to discern deepfake videos through a subjective study. We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models. To this end, we used gamification concepts to provide 110 participants (55 native English speakers and 55 non-native English speakers) with a webbased platform where they could access a series of 40 videos (20 real and 20 fake) to determine their authenticity. Each participant performed the experiment twice with the same 40 videos in different random orders. The videos are manually selected from the FakeAVCeleb dataset. We found that all AI models performed better than humans when evaluated on the same 40 videos. The study also reveals that while deception is not impossible, humans tend to overestimate their detection capabilities. Our experimental results may help benchmark human versus machine performance, advance forensics analysis, and enable adaptive countermeasures.
- Published
- 2024
11. SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
- Author
-
Wang, Hsuan-Fu, Shih, Yi-Jen, Chang, Heng-Jui, Berry, Layne, Peng, Puyuan, Lee, Hung-yi, Wang, Hsin-Min, and Harwath, David
- Subjects
Computer Science - Computation and Language ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks., Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop
- Published
- 2024
12. HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids
- Author
-
Wisnu, Dyah A. M. G., Rini, Stefano, Zezario, Ryandhimas E., Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
This paper introduces HAAQI-Net, a non-intrusive deep learning model for music audio quality assessment tailored for hearing aid users. Unlike traditional methods like the Hearing Aid Audio Quality Index (HAAQI), which rely on intrusive comparisons to a reference signal, HAAQI-Net offers a more accessible and efficient alternative. Using a bidirectional Long Short-Term Memory (BLSTM) architecture with attention mechanisms and features from the pre-trained BEATs model, HAAQI-Net predicts HAAQI scores directly from music audio clips and hearing loss patterns. Results show HAAQI-Net's effectiveness, with predicted scores achieving a Linear Correlation Coefficient (LCC) of 0.9368, a Spearman's Rank Correlation Coefficient (SRCC) of 0.9486, and a Mean Squared Error (MSE) of 0.0064, reducing inference time from 62.52 seconds to 2.54 seconds. Although effective, feature extraction via the large BEATs model incurs computational overhead. To address this, a knowledge distillation strategy creates a student distillBEATs model, distilling information from the teacher BEATs model during HAAQI-Net training, reducing required parameters. The distilled HAAQI-Net maintains strong performance with an LCC of 0.9071, an SRCC of 0.9307, and an MSE of 0.0091, while reducing parameters by 75.85% and inference time by 96.46%. This reduction enhances HAAQI-Net's efficiency and scalability, making it viable for real-world music audio quality assessment in hearing aid settings. This work also opens avenues for further research into optimizing deep learning models for specific applications, contributing to audio signal processing and quality assessment by providing insights into developing efficient and accurate models for practical applications in hearing aid technology.
- Published
- 2024
13. Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
- Author
-
Chiang, Hsin-Tien, Fu, Szu-Wei, Wang, Hsin-Min, Tsao, Yu, and Hansen, John H. L.
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability.
- Published
- 2023
14. AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection
- Author
-
Shahzad, Sahibzada Adil, Hashmi, Ammarah, Peng, Yan-Tsung, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneously. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
- Published
- 2023
15. AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection
- Author
-
Hashmi, Ammarah, Shahzad, Sahibzada Adil, Lin, Chia-Wen, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. While there are some methods in the literature that exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multi-modal datasets of deepfake videos involving acoustic and visual manipulations. Moreover, these existing methods are mostly based on CNN and suffer from low detection accuracy. Inspired by the recent success of Transformer in various fields, to address the challenges posed by deepfake technology, in this paper, we propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation to achieve effective video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multi-modal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that our best model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset.
- Published
- 2023
16. The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
- Author
-
Cooper, Erica, Huang, Wen-Chin, Tsao, Yu, Wang, Hsin-Min, Toda, Tomoki, and Yamagishi, Junichi
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches., Comment: Accepted to ASRU 2023
- Published
- 2023
17. A Study on Incorporating Whisper for Robust Speech Assessment
- Author
-
Zezario, Ryandhimas E., Chen, Yu-Wen, Fu, Szu-Wei, Tsao, Yu, Wang, Hsin-Min, and Fuh, Chiou-Shann
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems., Comment: Accepted to IEEE ICME 2024
- Published
- 2023
18. Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
- Author
-
Ahmed, Shafique, Chen, Chia-Wei, Ren, Wenze, Li, Chin-Jou, Chu, Ernie, Chen, Jun-Cheng, Hussain, Amir, Wang, Hsin-Min, Tsao, Yu, and Hou, Jen-Cheng
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound - Abstract
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate the effectiveness of DCUC-Net, as it outperforms the baseline model from the COG-MHEAR AVSE Challenge 2023 by a notable margin of 0.14 in terms of PESQ. Additionally, the proposed DCUC-Net performs comparably to a state-of-the-art model and outperforms all other compared models on the Taiwan Mandarin speech with video (TMSV) dataset.
- Published
- 2023
19. Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
- Author
-
Zezario, Ryandhimas E., Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes metadata from speech signals by using a classifier that distinguishes different enhancement methods. Furthermore, MBI-Net+ integrates the hearing-aid speech perception index (HASPI) as a supplementary metric into the objective function to further boost prediction performance. Experimental results demonstrate that MBI-Net+ surpasses several intrusive baseline systems and MBI-Net on the Clarity Prediction Challenge 2023 dataset, validating the effectiveness of incorporating Whisper embeddings, speech metadata, and related complementary metrics to improve prediction performance for HA., Comment: Accepted to Interspeech 2024
- Published
- 2023
20. Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model
- Author
-
Zezario, Ryandhimas E., Bai, Bo-Ren Brian, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models., Comment: Accepted to IEEE ICASSP 2024
- Published
- 2023
21. Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features
- Author
-
Chen, Hsin-Hao, Chien, Yung-Lun, Yen, Ming-Chi, Tsai, Shu-Wei, Tsao, Yu, Chi, Tai-shih, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Patients who have had their entire larynx removed, including the vocal folds, owing to throat cancer may experience difficulties in speaking. In such cases, electrolarynx devices are often prescribed to produce speech, which is commonly referred to as electrolaryngeal speech (EL speech). However, the quality and intelligibility of EL speech are poor. To address this problem, EL voice conversion (ELVC) is a method used to improve the intelligibility and quality of EL speech. In this paper, we propose a novel ELVC system that incorporates cross-domain features, specifically spectral features and self-supervised learning (SSL) embeddings. The experimental results show that applying cross-domain features can notably improve the conversion performance for the ELVC task compared with utilizing only traditional spectral features., Comment: Accepted to INTERSPEECH 2023
- Published
- 2023
22. Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
- Author
-
Chien, Yung-Lun, Chen, Hsin-Hao, Yen, Ming-Chi, Tsai, Shu-Wei, Wang, Hsin-Min, Tsao, Yu, and Chi, Tai-Shih
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC., Comment: Accepted to INTERSPEECH 2023
- Published
- 2023
23. BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm
- Author
-
Chen, Yu-Wen, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Computer Science - Neural and Evolutionary Computing ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus., Comment: accepted by APSIPA Transactions on Signal and Information Processing
- Published
- 2022
24. A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech
- Author
-
Chen, Li-Wei, Cheng, Yao-Fei, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing 1) the teacher model's estimated speech and noise for enhanced-target training or 2) raw noisy speech and the teacher model's estimated noise for noisy-target training. Experimental results show that our proposed method outperforms several baselines, especially with the teacher/student inference, where predicted clean speech is derived successively through the teacher and final student models., Comment: Accepted by Interspeech 2023
- Published
- 2022
25. CasNet: Investigating Channel Robustness for Speech Separation
- Author
-
Wang, Fan-Lin, Cheng, Yao-Fei, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recording channel mismatch between training and testing conditions has been shown to be a serious problem for speech separation. This situation greatly reduces the separation performance, and cannot meet the requirement of daily use. In this study, inheriting the use of our previously constructed TAT-2mix corpus, we address the channel mismatch problem by proposing a channel-aware audio separation network (CasNet), a deep learning framework for end-to-end time-domain speech separation. CasNet is implemented on top of TasNet. Channel embedding (characterizing channel information in a mixture of multiple utterances) generated by Channel Encoder is introduced into the separation module by the FiLM technique. Through two training strategies, we explore two roles that channel embedding may play: 1) a real-life noise disturbance, making the model more robust, or 2) a guide, instructing the separation model to retain the desired channel information. Experimental results on TAT-2mix show that CasNet trained with both training strategies outperforms the TasNet baseline, which does not use channel embeddings., Comment: Submitted to ICASSP 2023
- Published
- 2022
26. Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN
- Author
-
Cho, Yin-Ping, Tsao, Yu, Wang, Hsin-Min, and Liu, Yi-Wen
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Sound ,Electrical Engineering and Systems Science - Signal Processing - Abstract
Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits improvements over previous landmark counterparts in terms of musical expressiveness and high-frequency acoustic details. Moreover, the adversarial acoustic model converged stably without the need to enforce reconstruction objectives, indicating the convergence stability of the proposed DDPM and WGAN combined architecture over alternative GAN-based SVS systems.
- Published
- 2022
27. NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling
- Author
-
Lee, Chi-Chang, Hu, Cheng-Hung, Lin, Yu-Chen, Chen, Chu-Song, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning - Abstract
For deep learning-based speech enhancement (SE) systems, the training-test acoustic mismatch can cause notable performance degradation. To address the mismatch issue, numerous noise adaptation strategies have been derived. In this paper, we propose a novel method, called noise adaptive speech enhancement with target-conditional resampling (NASTAR), which reduces mismatches with only one sample (one-shot) of noisy speech in the target environment. NASTAR uses a feedback mechanism to simulate adaptive training data via a noise extractor and a retrieval model. The noise extractor estimates the target noise from the noisy speech, called pseudo-noise. The noise retrieval model retrieves relevant noise samples from a pool of noise signals according to the noisy speech, called relevant-cohort. The pseudo-noise and the relevant-cohort set are jointly sampled and mixed with the source speech corpus to prepare simulated training data for noise adaptation. Experimental results show that NASTAR can effectively use one noisy speech sample to adapt an SE model to a target condition. Moreover, both the noise extractor and the noise retrieval model contribute to model adaptation. To our best knowledge, NASTAR is the first work to perform one-shot noise adaptation through noise extraction and retrieval., Comment: Accepted to Interspeech 2022
- Published
- 2022
28. A Study of Using Cepstrogram for Countermeasure Against Replay Attacks
- Author
-
Lee, Shih-Kuang, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Cryptography and Security ,Computer Science - Sound - Abstract
This study investigated the cepstrogram properties and demonstrated their effectiveness as powerful countermeasures against replay attacks. A cepstrum analysis of replay attacks suggests that crucial information for anti-spoofing against replay attacks may be retained in the cepstrogram. When building countermeasures against replay attacks, experiments on the ASVspoof 2019 physical access database demonstrate that the cepstrogram is more effective than other features in both single and fusion systems. Our LCNN-based single and fusion systems with the cepstrogram feature outperformed the corresponding LCNN-based systems without the cepstrogram feature and several state-of-the-art single and fusion systems in the literature., Comment: Submitted to SLT 2022
- Published
- 2022
29. MTI-Net: A Multi-Target Speech Intelligibility Prediction Model
- Author
-
Zezario, Ryandhimas E., Fu, Szu-wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility scores. This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures. Specifically, given a speech utterance, MTI-Net is designed to predict human subjective listening test results and word error rate (WER) scores. We also investigate several methods that can improve the prediction performance of MTI-Net. First, we compare different features (including low-level features and embeddings from self-supervised learning (SSL) models) and prediction targets of MTI-Net. Second, we explore the effect of transfer learning and multi-tasking learning on training MTI-Net. Finally, we examine the potential advantages of fine-tuning SSL embeddings. Experimental results demonstrate the effectiveness of using cross-domain features, multi-task learning, and fine-tuning SSL embeddings. Furthermore, it is confirmed that the intelligibility and WER scores predicted by MTI-Net are highly correlated with the ground-truth scores., Comment: Accepted to Interspeech 2022
- Published
- 2022
30. MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids
- Author
-
Zezario, Ryandhimas E., Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale listening tests is time-consuming and expensive. Therefore, several evaluation metrics were derived as surrogates for subjective listening test results. In this study, we propose a multi-branched speech intelligibility prediction model (MBI-Net), for predicting the subjective intelligibility scores of HA users. MBI-Net consists of two branches of models, with each branch consisting of a hearing loss model, a cross-domain feature extraction module, and a speech intelligibility prediction model, to process speech signals from one channel. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores. Experimental results confirm the effectiveness of MBI-Net, which produces higher prediction scores than the baseline system in Track 1 and Track 2 on the Clarity Prediction Challenge 2022 dataset., Comment: Accepted to Interspeech 2022
- Published
- 2022
31. Filter-based Discriminative Autoencoders for Children Speech Recognition
- Author
-
Tai, Chiang-Lin, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Children speech recognition is indispensable but challenging due to the diversity of children's speech. In this paper, we propose a filter-based discriminative autoencoder for acoustic modeling. To filter out the influence of various speaker types and pitches, auxiliary information of the speaker and pitch features is input into the encoder together with the acoustic features to generate phonetic embeddings. In the training phase, the decoder uses the auxiliary information and the phonetic embedding extracted by the encoder to reconstruct the input acoustic features. The autoencoder is trained by simultaneously minimizing the ASR loss and feature reconstruction error. The framework can make the phonetic embedding purer, resulting in more accurate senone (triphone-state) scores. Evaluated on the test set of the CMU Kids corpus, our system achieves a 7.8% relative WER reduction compared to the baseline system. In the domain adaptation experiment, our system also outperforms the baseline system on the British-accent PF-STAR task., Comment: Published in EUSIPCO 2022
- Published
- 2022
32. Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
- Author
-
Wang, Fan-Lin, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data., Comment: Published in Interspeech 2022
- Published
- 2022
33. Generation of Speaker Representations Using Heterogeneous Training Batch Assembly
- Author
-
Peng, Yu-Huai, Lee, Hung-Shin, Huang, Pin-Tuan, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is superior to the baseline system using x-vectors in two speaker diarization tasks. In the CALLHOME task trained on the NIST SRE and Switchboard datasets, our system achieves a relative reduction of 12.93% in DER. In Track 2 of CHiME-6, our system provides 13.24%, 12.60%, and 5.65% relative reductions in DER, JER, and WER, respectively., Comment: Published in APSIPA ASC 2021
- Published
- 2022
34. Multi-target Extractor and Detector for Unknown-number Speaker Diarization
- Author
-
Cheng, Chin-Yi, Lee, Hung-Shin, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines., Comment: Accepted by IEEE Signal Processing Letters
- Published
- 2022
- Full Text
- View/download PDF
35. Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition
- Author
-
Lee, Hung-Shin, Tsao, Yu, Jeng, Shyh-Kang, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods., Comment: Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079
- Published
- 2022
36. Speech-enhanced and Noise-aware Networks for Robust Speech Recognition
- Author
-
Lee, Hung-Shin, Chen, Pin-Yuan, Cheng, Yao-Fei, Tsao, Yu, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task., Comment: Published in ISCSLP 2022
- Published
- 2022
37. Chain-based Discriminative Autoencoders for Speech Recognition
- Author
-
Lee, Hung-Shin, Huang, Pin-Tuan, Cheng, Yao-Fei, and Wang, Hsin-Min
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems., Comment: Published in Interspeech 2022
- Published
- 2022
38. The VoiceMOS Challenge 2022
- Author
-
Huang, Wen-Chin, Cooper, Erica, Tsao, Yu, Wang, Hsin-Min, Toda, Tomoki, and Yamagishi, Junichi
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting., Comment: Accepted to Interspeech 2022
- Published
- 2022
39. Partially Fake Audio Detection by Self-attention-based Fake Span Discovery
- Author
-
Wu, Haibin, Kuo, Heng-Cheng, Zheng, Naijun, Hung, Kuo-Hsuan, Lee, Hung-Yi, Tsao, Yu, Wang, Hsin-Min, and Meng, Helen
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022., Comment: Submitted to ICASSP 2022
- Published
- 2022
40. EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement
- Author
-
Wang, Kuan-Chen, Liu, Kai-Chun, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound ,Quantitative Biology - Quantitative Methods - Abstract
Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE., Comment: 5 pages, 4 figures, and 3 tables
- Published
- 2022
41. HASA-net: A non-intrusive hearing-aid speech assessment network
- Author
-
Chiang, Hsin-Tien, Wu, Yi-Chiao, Yu, Cheng, Toda, Tomoki, Wang, Hsin-Min, Hu, Yih-Chun, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Artificial Intelligence ,Computer Science - Sound - Abstract
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-loss factors. In this study, we propose a DNN-based hearing aid speech assessment network (HASA-Net), formed by a bidirectional long short-term memory (BLSTM) model, to predict speech quality and intelligibility scores simultaneously according to input speech signals and specified hearing-loss patterns. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics, hearing aid speech quality index (HASQI) and hearing aid speech perception index (HASPI), respectively.
- Published
- 2021
42. Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features
- Author
-
Zezario, Ryandhimas E., Fu, Szu-Wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, and Tsao, Yu
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Machine Learning ,Computer Science - Sound - Abstract
In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
- Published
- 2021
43. Ginsenoside compound K reduces the progression of Huntington's disease via the inhibition of oxidative stress and overactivation of the ATM/AMPK pathway
- Author
-
Hua, Kuo-Feng, Chao, A-Ching, Lin, Ting-Yu, Chen, Wan-Tze, Lee, Yu-Chieh, Hsu, Wan-Han, Lee, Sheau-Long, Wang, Hsin-Min, Yang, Ding-I., and Ju, Tz-Chuen
- Published
- 2022
- Full Text
- View/download PDF
44. Multi-Modal Pedestrian Crossing Intention Prediction with Transformer-Based Model.
- Author
-
Wang, Ting-Wei, Lai, Shang-Hong, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
DRIVER assistance systems ,COMPUTER vision ,TRAFFIC safety ,PREDICTION models ,INFORMATION resources - Abstract
Pedestrian crossing intention prediction based on computer vision plays a pivotal role in enhancing the safety of autonomous driving and advanced driver assistance systems. In this paper, we present a novel multi-modal pedestrian crossing intention prediction framework leveraging the transformer model. By integrating diverse sources of information and leveraging the transformer's sequential modeling and parallelization capabilities, our system accurately predicts pedestrian crossing intentions. We introduce a novel representation of traffic environment data and incorporate lifted 3D human pose and head orientation data to enhance the model's understanding of pedestrian behavior. Experimental results demonstrate the state-of-the-art accuracy of our proposed system on benchmark datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. End-to-End Singing Transcription Based on CTC and HSMM Decoding with a Refined Score Representation.
- Author
-
Deng, Tengyu, Nakamura, Eita, Nishikimi, Ryo, Yoshii, Kazuyoshi, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
ARTIFICIAL neural networks ,MUSIC scores ,PROBLEM solving ,SINGING ,PRIOR learning - Abstract
This paper describes an end-to-end automatic singing transcription (AST) method that translates a music audio signal containing a vocal part into a symbolic musical score of sung notes. A common approach to sequence-to-sequence learning for this problem is to use the connectionist temporal classification (CTC), where a target score is represented as a sequence of notes with discrete pitches and note values. However, if the note value of some note is incorrectly estimated, the score times of the following notes are estimated incorrectly and the metrical structure of the estimated score collapses. To solve this problem, we propose a refined score representation using metrical positions of note onsets. To decode a musical score from the output of a deep neural network (DNN), we use a hidden semi-Markov model (HSMM) that incorporates prior knowledge about musical scores and temporal fluctuation in human performance. We show that the proposed method achieves the state-of-the-art performance and confirm the efficacy of the refined score representation and the decoding method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Estimating 3D Hand Poses and Shapes from Silhouettes.
- Author
-
Chang, Li-Jen, Liao, Yu-Cheng, Lin, Chia-Hui, Yang-Mao, Shys-Fang, Chen, Hwann-Tzong, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
SILHOUETTES ,ANNOTATIONS ,RECORDING & registration ,FORECASTING - Abstract
We present Mask2Hand, a self-trainable method for predicting 3D hand pose and shape from a single 2D binary silhouette. Without additional manual annotations, our method uses differentiable rendering to project 3D estimations onto the 2D silhouette. A tailored loss function, applied between the rendered and input silhouettes, provides a self-guidance mechanism during end-to-end optimization, which constrains global mesh registration and hand pose estimation. Our experiments show that Mask2Hand, using only a binary mask input, achieves accuracy comparable to state-ofthe- art methods requiring RGB or depth inputs on both unaligned and aligned datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Meta Soft Prompting and Learning.
- Author
-
Chien, Jen-Tzung, Chen, Ming-Yen, Lee, Ching-hsien, Xue, Jing-Hao, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
LANGUAGE models ,SOFT sets ,NATURAL languages ,LANGUAGE attrition ,CLASSIFICATION - Abstract
Traditionally, either applying the hard prompt for sentences by handcrafting the prompt templates or directly optimizing the soft or continuous prompt may not sufficiently generalize for unseen domain data. This paper presents a parameter efficient learning for domain-agnostic soft prompt which is developed for few-shot unsupervised domain adaptation. A pre-trained language model (PLM) is frozen and utilized to extract knowledge for unseen domains in various language understanding tasks. The meta learning and optimization over a set of trainable soft tokens is performed by minimizing the cross-entropy loss for masked language model from support and query data in source and target domains, respectively, where the masked tokens for text category and random masking are predicted. The meta soft prompt is learned through a doublylooped optimization for individual learners and a meta learner when implementing the unsupervised domain adaptation. The PLM is then closely adapted to compensate the domain shift in a target domain. The domain adaptation loss and the prompt-based classification loss are jointly minimized through meta learning. The experiments on multi-domain natural language understanding illustrate the merit of the proposed meta soft prompt in pre-trained language modeling under few-shot setting. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. A Lightweight Enhancement Approach for Real-Time Semantic Segmentation by Distilling Rich Knowledge from Pre-Trained Vision-Language Model.
- Author
-
Lin, Chia-Yi, Chen, Jun-Cheng, Wu, Ja-Ling, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
LEARNING strategies ,SPINE - Abstract
In this work, we propose a lightweight approach to enhance realtime semantic segmentation by leveraging the pre-trained visionlanguage models, specifically utilizing the text encoder of Contrastive Language-Image Pretraining (CLIP) to generate rich semantic embeddings for text labels. Then, our method distills this textual knowledge into the segmentation model, integrating the image and text embeddings to align visual and textual information. Additionally, we implement learnable prompt embeddings for better class-specific semantic comprehension. We propose a two-stage training strategy for efficient learning: the segmentation backbone initially learns from fixed text embeddings and subsequently optimizes prompt embeddings to streamline the learning process. The extensive evaluations and ablation studies validate our approach's ability to effectively improve the semantic segmentation model's performance over the compared methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes
- Author
-
Hashmi, Ammarah, primary, Shahzad, Sahibzada Adil, additional, Lin, Chia-Wen, additional, Tsao, Yu, additional, and Wang, Hsin-Min, additional
- Published
- 2024
- Full Text
- View/download PDF
50. Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model
- Author
-
Zezario, Ryandhimas E., primary, Brian Bai, Bo-Ren, additional, Fuh, Chiou-Shann, additional, Wang, Hsin-Min, additional, and Tsao, Yu, additional
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.