308 results on '"Hsin-Min Wang"'
Search Results
2. Generalization Ability Improvement of Speaker Representation and Anti-Interference for Speaker Verification
- Author
-
Qian-Bei Hong, Chung-Hsien Wu, and Hsin-Min Wang
- Subjects
Computational Mathematics ,Acoustics and Ultrasonics ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering - Published
- 2023
- Full Text
- View/download PDF
3. Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features
- Author
-
Chiou-Shann Fuh, Szu-Wei Fu, Ryandhimas Zezario, Hsin-Min Wang, Yu Tsao, and Fei Chen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computational Mathematics ,Acoustics and Ultrasonics ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
- Published
- 2023
- Full Text
- View/download PDF
4. Global Epidemiology of Hip Fractures
- Author
-
Chor‐Wing Sing, Tzu‐Chieh Lin, Sharon Bartholomew, J Simon Bell, Corina Bennett, Kebede Beyene, Pauline Bosco‐Levy, Brian D. Bradbury, Amy Hai Yan Chan, Manju Chandran, Cyrus Cooper, Maria de Ridder, Caroline Y. Doyon, Cécile Droz‐Perroteau, Ganga Ganesan, Sirpa Hartikainen, Jenni Ilomaki, Han Eol Jeong, Douglas P. Kiel, Kiyoshi Kubota, Edward Chia‐Cheng Lai, Jeff L. Lange, E. Michael Lewiecki, Julian Lin, Jiannong Liu, Joe Maskell, Mirhelen Mendes de Abreu, James O'Kelly, Nobuhiro Ooba, Alma B. Pedersen, Albert Prats‐Uribe, Daniel Prieto‐Alhambra, Simon Xiwen Qin, Ju‐Young Shin, Henrik T. Sørensen, Kelvin Bryan Tan, Tracy Thomas, Anna‐Maija Tolppanen, Katia M.C. Verhamme, Grace Hsin‐Min Wang, Sawaeng Watcharathanakij, Stephen J Wood, Ching‐Lung Cheung, Ian C.K. Wong, and Medical Informatics
- Subjects
SDG 3 - Good Health and Well-being ,Endocrinology, Diabetes and Metabolism ,Orthopedics and Sports Medicine - Abstract
In this international study, we examined the incidence of hip fractures, postfracture treatment, and all-cause mortality following hip fractures, based on demographics, geography, and calendar year. We used patient-level healthcare data from 19 countries and regions to identify patients aged 50 years and older hospitalized with a hip fracture from 2005 to 2018. The age- and sex-standardized incidence rates of hip fractures, post-hip fracture treatment (defined as the proportion of patients receiving anti-osteoporosis medication with various mechanisms of action [bisphosphonates, denosumab, raloxifene, strontium ranelate, or teriparatide] following a hip fracture), and the all-cause mortality rates after hip fractures were estimated using a standardized protocol and common data model. The number of hip fractures in 2050 was projected based on trends in the incidence and estimated future population demographics. In total, 4,115,046 hip fractures were identified from 20 databases. The reported age- and sex-standardized incidence rates of hip fractures ranged from 95.1 (95% confidence interval [CI] 94.8–95.4) in Brazil to 315.9 (95% CI 314.0–317.7) in Denmark per 100,000 population. Incidence rates decreased over the study period in most countries; however, the estimated total annual number of hip fractures nearly doubled from 2018 to 2050. Within 1 year following a hip fracture, post-hip fracture treatment ranged from 11.5% (95% CI 11.1% to 11.9%) in Germany to 50.3% (95% CI 50.0% to 50.7%) in the United Kingdom, and all-cause mortality rates ranged from 14.4% (95% CI 14.0% to 14.8%) in Singapore to 28.3% (95% CI 28.0% to 28.6%) in the United Kingdom. Males had lower use of anti-osteoporosis medication than females, higher rates of all-cause mortality, and a larger increase in the projected number of hip fractures by 2050. Substantial variations exist in the global epidemiology of hip fractures and postfracture outcomes. Our findings inform possible actions to reduce the projected public health burden of osteoporotic fractures among the aging population.
- Published
- 2023
- Full Text
- View/download PDF
5. Improved Lite Audio-Visual Speech Enhancement
- Author
-
Yu Tsao, Shang-Yi Chuang, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computational Mathematics ,Acoustics and Ultrasonics ,Audio and Speech Processing (eess.AS) ,Image and Video Processing (eess.IV) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical Engineering and Systems Science - Image and Video Processing ,Electrical and Electronic Engineering ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.
- Published
- 2022
- Full Text
- View/download PDF
6. Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition
- Author
-
Hsin-Min Wang, Chung-Hsien Wu, and Qian-Bei Hong
- Subjects
Signal Processing ,Information Systems - Published
- 2023
- Full Text
- View/download PDF
7. BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm
- Author
-
Yu Tsao, Hsin-Min Wang, and Yu-Wen Chen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Neural and Evolutionary Computing ,Machine Learning (cs.LG) ,Artificial Intelligence (cs.AI) ,Audio and Speech Processing (eess.AS) ,Signal Processing ,FOS: Electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Computation and Language (cs.CL) ,Information Systems ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus., accepted by APSIPA Transactions on Signal and Information Processing
- Published
- 2022
8. Lip Sync Matters: A Novel Multimodal Forgery Detector
- Author
-
Sahibzada Adil Shahzad, Ammarah Hashmi, Sarwar Khan, Yan-Tsung Peng, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
- Full Text
- View/download PDF
9. Multimodal Forgery Detection Using Ensemble Learning
- Author
-
Ammarah Hashmi, Sahibzada Adil Shahzad, Wasim Ahmad, Chia Wen Lin, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
- Full Text
- View/download PDF
10. Detecting Replay Attacks Using Single-Channel Audio: The Temporal Autocorrelation of Speech
- Author
-
Shih-Kuang Lee, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
- Full Text
- View/download PDF
11. Continued potassium supplementation use following loop diuretic discontinuation in older adults: An evaluation of a prescribing cascade relic
- Author
-
Grace Hsin‐Min Wang, Earl J. Morris, Steven M. Smith, Jesper Hallas, and Scott M. Vouri
- Subjects
Geriatrics and Gerontology - Abstract
The use of a new medication (e.g., potassium supplementation) for managing a drug-induced adverse event (e.g., loop diuretic-induced hypokalemia) constitutes a prescribing cascade. However, loop diuretics are often stopped while potassium may be unnecessarily continued (i.e., relic). We aimed to quantify the occurrence of relics using older adults previously experiencing a loop diuretic-potassium prescribing cascade as an example.We conducted a prescription sequence symmetry analysis using the population-based Medicare Fee-For-Service data (2011-2018) and partitioned the 150 days following potassium initiation by day to assess the daily treatment scenarios (i.e., loop diuretics alone, potassium alone, combination of loop diuretics and potassium, or neither). We calculated the proportion of patients developing the relic, proportion of person-days under potassium alone, the daily probability of the relic, and the proportion of patients filling potassium after loop diuretic discontinuation. We also identified the risk factors of the relic.We identified 284,369 loop diuretic initiators who were 8 times more likely to receive potassium supplementation simultaneously or after (i.e., the prescribing cascade), rather than before, loop diuretic initiation (aSR 8.0, 95% CI 7.9-8.2). Among the 66,451 loop diuretic initiators who subsequently (≤30 days) initiated potassium, 20,445 (30.8%) patients remained on potassium after loop diuretic discontinuation, and 9365 (14.1%) patients subsequently filled another potassium supplementation. Following loop diuretic initiation, 4.0% of person-days were for potassium alone, and daily probability of the relic was the highest after day 90 of loop diuretic initiation (5.6%). Older age, female sex, higher diuretic daily dose, and greater baseline comorbidities were risk factors for the relic, while patients having the same prescriber or pharmacy involved in the use of both medications were less likely to experience the relic.Our findings suggest the need for clinicians to be aware of the potential of relic to avoid unnecessary drug use.
- Published
- 2022
12. Learning to Visualize Music Through Shot Sequence for Automatic Concert Video Mashup
- Author
-
Hsin-Min Wang, Tyng-Luh Liu, Hong-Yuan Mark Liao, Jen-Chun Lin, Wen-Li Wei, and Hsiao-Rong Tyan
- Subjects
Computer science ,Knowledge engineering ,02 engineering and technology ,computer.software_genre ,Computer Science Applications ,Visualization ,Human–computer interaction ,ComputerApplications_MISCELLANEOUS ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Task analysis ,020201 artificial intelligence & image processing ,Mashup ,Electrical and Electronic Engineering ,computer ,Amateur ,Storytelling - Abstract
An experienced director usually switches among different types of shots to make visual storytelling more touching. When filming a musical performance, appropriate switching shots can produce some special effects, such as enhancing the expression of emotion or heating up the atmosphere. However, while the visual storytelling technique is often used in making professional recordings of a live concert, amateur recordings of audiences often lack such storytelling concepts and skills when filming the same event. Thus a versatile system that can perform video mashup to create a refined high-quality video from such amateur clips is desirable. To this end, we aim at translating the music into an attractive shot (type) sequence by learning the relation between music and visual storytelling of shots. The resulting shot sequence can then be used to better portray the visual storytelling of a song and guide the concert video mashup process. To achieve the task, we first introduces a novel probabilistic-based fusion approach, named as multi-resolution fused recurrent neural networks (MF-RNNs) with film-language, which integrates multi-resolution fused RNNs and a film-language model for boosting the translation performance. We then distill the knowledge in MF-RNNs with film-language into a lightweight RNN, which is more efficient and easier to deploy. The results from objective and subjective experiments demonstrate that both MF-RNNs with film-language and lightweight RNN can generate attractive shot sequences for music, thereby enhancing the viewing and listening experience.
- Published
- 2021
- Full Text
- View/download PDF
13. Association between gabapentinoids and oedema treated with loop diuretics: A pooled sequence symmetry analysis from the USA and Denmark
- Author
-
Scott Martin Vouri, Earl J. Morris, Grace Hsin‐Min Wang, Alyaa Hashim Jaber Bilal, Jesper Hallas, and Daniel Pilsgaard Henriksen
- Subjects
Pharmacology ,Adult ,Sodium Potassium Chloride Symporter Inhibitors ,Denmark ,Humans ,Edema ,Pharmacology (medical) ,Serotonin and Noradrenaline Reuptake Inhibitors ,Medicare ,Diuretics ,United States - Abstract
To assess the gabapentinoid-oedema-loop diuretic prescribing cascade in adults using large administrative health care databases from the USA and Denmark.This study used a sequence symmetry analysis to assess loop diuretic initiation before and after the initiation of gabapentinoids among patients aged 20 years or older without heart failure or chronic kidney disease. Data from MarketScan Commercial and Medicare Supplemental Claims databases (2005 to 2019) and Danish National Prescription Register (2005 to 2018) were analyzed. Use of loop diuretics associated with initiation of selective norepinephrine reuptake inhibitors (SNRI) was used as a negative control. We assessed the pooled temporality of loop diuretic initiation relative to gabapentinoid or SNRI initiation across the 2 countries. Secular trend-adjusted sequence ratios (aSRs) with 95% confidence intervals (CIs) were calculated using data from 90 days before and after initiation of gabapentinoids. Pooled ratio of aSRs were calculated by comparing gabapentinoids to SNRIs.Among the 1 511 493 gabapentinoid initiators (Denmark [n = 338 941]; USA [n = 1 172 552]), 20 139 patients had a new loop diuretic prescription 90 days before or after gabapentinoid initiation, resulting in a pooled aSR of 1.33 (95% CI 1.06-1.67). The pooled aSR for the negative control (i.e., SNRI) was 0.84 (95% CI 0.75-0.94), which resulted in a pooled ratio of aSRs of 1.58 (95% CI 1.23-2.04). Pooled estimated incidence of the gabapentinoid-loop diuretic prescribing cascade was 8.14 (95% CI, 1.92-34.49) events per 1000 patient-years.We identified evidence of the gabapentinoid-oedema-loop diuretic prescribing cascade in 2 countries.
- Published
- 2022
14. Partially Fake Audio Detection by Self-Attention-Based Fake Span Discovery
- Author
-
Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, and Helen Meng
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,ComputingMilieux_LEGALASPECTSOFCOMPUTING ,Computer Science - Sound ,Machine Learning (cs.LG) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022., Submitted to ICASSP 2022
- Published
- 2022
- Full Text
- View/download PDF
15. MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids
- Author
-
Ryandhimas Edo Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Machine Learning (cs.LG) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale listening tests is time-consuming and expensive. Therefore, several evaluation metrics were derived as surrogates for subjective listening test results. In this study, we propose a multi-branched speech intelligibility prediction model (MBI-Net), for predicting the subjective intelligibility scores of HA users. MBI-Net consists of two branches of models, with each branch consisting of a hearing loss model, a cross-domain feature extraction module, and a speech intelligibility prediction model, to process speech signals from one channel. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores. Experimental results confirm the effectiveness of MBI-Net, which produces higher prediction scores than the baseline system in Track 1 and Track 2 on the Clarity Prediction Challenge 2022 dataset., Accepted to Interspeech 2022
- Published
- 2022
16. Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion
- Author
-
Chen-Chou Lo, Hsin-Te Hwang, Hao Luo, Hsin-Min Wang, Wen-Chin Huang, Yu-Huai Peng, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Control and Optimization ,Computer science ,Speech recognition ,Speech processing ,Autoencoder ,Computer Science - Sound ,Machine Learning (cs.LG) ,Computer Science Applications ,Domain (software engineering) ,Constraint (information theory) ,Computational Mathematics ,Audio and Speech Processing (eess.AS) ,Artificial Intelligence ,Similarity (psychology) ,Classifier (linguistics) ,FOS: Electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Representation (mathematics) ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods., Accepted to IEEE Transactions on Emerging Topics in Computational Intelligence
- Published
- 2020
- Full Text
- View/download PDF
17. Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders
- Author
-
Ryandhimas E. Zezario, Xugang Lu, Syu-Siang Wang, Hsin-Min Wang, Cheng Yu, Jonathan Sherman, Yu Tsao, and Yi-Yen Hsieh
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Acoustics and Ultrasonics ,Computer science ,Speech recognition ,Noise reduction ,Computer Science - Sound ,Data modeling ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Noise measurement ,business.industry ,Deep learning ,Speech enhancement ,Computational Mathematics ,Noise ,Artificial intelligence ,0305 other medical science ,business ,Encoder ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this article, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: training and testing. In the training stage, we build multiple component models to form a multi-branched encoder based on a decision tree (DSDT). The DSDT is built based on prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the testing stage, noisy speech is first processed by each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics, automatic speech recognition results, and quality in subjective human listening tests.
- Published
- 2020
- Full Text
- View/download PDF
18. Subspace-Based Representation and Learning for Phonotactic Spoken Language Recognition
- Author
-
Yu Tsao, Hung-Shin Lee, Hsin-Min Wang, and Shyh-Kang Jeng
- Subjects
Computer Science - Machine Learning ,Sequence ,Computer Science - Computation and Language ,Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,Speech recognition ,Linear subspace ,Computer Science - Sound ,Matrix decomposition ,Support vector machine ,Computational Mathematics ,Kernel (linear algebra) ,Computer Science (miscellaneous) ,NIST ,Electrical and Electronic Engineering ,Subspace topology ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods., Comment: Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079
- Published
- 2020
- Full Text
- View/download PDF
19. Speech-enhanced and Noise-aware Networks for Robust Speech Recognition
- Author
-
Hung-Shin Lee, Pin-Yuan Chen, Yao-Fei Cheng, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Sound ,Machine Learning (cs.LG) ,Multimedia (cs.MM) ,Artificial Intelligence (cs.AI) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task., Published in ISCSLP 2022
- Published
- 2022
20. EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement
- Author
-
Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,FOS: Biological sciences ,FOS: Electrical engineering, electronic engineering, information engineering ,Quantitative Biology - Quantitative Methods ,Computer Science - Sound ,Quantitative Methods (q-bio.QM) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE., Comment: 5 pages, 4 figures, and 3 tables
- Published
- 2022
- Full Text
- View/download PDF
21. Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
- Author
-
Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer Science - Computation and Language ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) ,Multimedia (cs.MM) - Abstract
Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data., Comment: Published in Interspeech 2022
- Published
- 2022
- Full Text
- View/download PDF
22. Multi-target Extractor and Detector for Unknown-number Speaker Diarization
- Author
-
Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,Applied Mathematics ,Signal Processing ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Multimedia (cs.MM) - Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines., Comment: Accepted by IEEE Signal Processing Letters
- Published
- 2022
- Full Text
- View/download PDF
23. Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling
- Author
-
Ming-Chi Yen, Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Shu-Wei Tsai, Yu Tsao, Tomoki Toda, Jyh-Shing Roger Jang, and Hsin-Min Wang
- Published
- 2021
- Full Text
- View/download PDF
24. Investigation of a Single-Channel Frequency-Domain Speech Enhancement Network to Improve End-to-End Bengali Automatic Speech Recognition Under Unseen Noisy Conditions
- Author
-
Md Mahbub E Noor, Yen-Ju Lu, Syu-Siang Wang, Supratip Ghose, Chia-Yu Chang, Ryandhimas E. Zezario, Shafique Ahmed, Wei-Ho Chung, Yu Tsao, and Hsin-Min Wang
- Published
- 2021
- Full Text
- View/download PDF
25. HASA-net: A non-intrusive hearing-aid speech assessment network
- Author
-
Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Artificial Intelligence (cs.AI) ,Audio and Speech Processing (eess.AS) ,Computer Science - Artificial Intelligence ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-loss factors. In this study, we propose a DNN-based hearing aid speech assessment network (HASA-Net), formed by a bidirectional long short-term memory (BLSTM) model, to predict speech quality and intelligibility scores simultaneously according to input speech signals and specified hearing-loss patterns. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics, hearing aid speech quality index (HASQI) and hearing aid speech perception index (HASPI), respectively.
- Published
- 2021
26. Use of antipsychotic drugs and cholinesterase inhibitors and risk of falls and fractures: self-controlled case series
- Author
-
Tzu-Chi Liao, Kenneth K.C. Man, Edward Chia Cheng Lai, Grace Hsin-Min Wang, and Wei-Hung Chang
- Subjects
Male ,medicine.medical_specialty ,Databases, Factual ,medicine.medical_treatment ,Neurocognitive Disorders ,Taiwan ,Rate ratio ,Risk Assessment ,symbols.namesake ,Fractures, Bone ,Internal medicine ,medicine ,Humans ,Poisson regression ,Medical prescription ,Antipsychotic ,Cholinesterase ,Aged ,Aged, 80 and over ,biology ,business.industry ,Incidence (epidemiology) ,Incidence ,Research ,General Medicine ,Confidence interval ,symbols ,biology.protein ,Accidental Falls ,Female ,Cholinesterase Inhibitors ,Risk assessment ,business ,Antipsychotic Agents - Abstract
Objective To evaluate the association between the use of antipsychotic drugs and cholinesterase inhibitors and the risk of falls and fractures in elderly patients with major neurocognitive disorders. Design Self-controlled case series. Setting Taiwan’s National Health Insurance Database. Participants 15 278 adults, aged ≥65, with newly prescribed antipsychotic drugs and cholinesterase inhibitors, who had an incident fall or fracture between 2006 and 2017. Prescription records of cholinesterase inhibitors confirmed the diagnosis of major neurocognitive disorders; all use of cholinesterase inhibitors was reviewed by experts. Main outcome measures Conditional Poisson regression was used to derive incidence rate ratios and 95% confidence intervals for evaluating the risk of falls and fractures for different treatment periods: use of cholinesterase inhibitors alone, antipsychotic drugs alone, and a combination of cholinesterase inhibitors and antipsychotic drugs, compared with the non-treatment period in the same individual. A 14 day pretreatment period was defined before starting the study drugs because of concerns about confounding by indication. Results The incidence of falls and fractures per 100 person years was 8.30 (95% confidence interval 8.14 to 8.46) for the non-treatment period, 52.35 (48.46 to 56.47) for the pretreatment period, and 10.55 (9.98 to 11.14), 10.34 (9.80 to 10.89), and 9.41 (8.98 to 9.86) for use of a combination of cholinesterase inhibitors and antipsychotic drugs, antipsychotic drugs alone, and cholinesterase inhibitors alone, respectively. Compared with the non-treatment period, the highest risk of falls and fractures was during the pretreatment period (adjusted incidence rate ratio 6.17, 95% confidence interval 5.69 to 6.69), followed by treatment with the combination of cholinesterase inhibitors and antipsychotic drugs (1.35, 1.26 to 1.45), antipsychotic drugs alone (1.33, 1.24 to 1.43), and cholinesterase inhibitors alone (1.17, 1.10 to 1.24). Conclusions The incidence of falls and fractures was high in the pretreatment period, suggesting that factors other than the study drugs, such as underlying diseases, should be taken into consideration when evaluating the association between the risk of falls and fractures and use of cholinesterase inhibitors and antipsychotic drugs. The treatment periods were also associated with a higher risk of falls and fractures compared with the non-treatment period, although the magnitude was much lower than during the pretreatment period. Strategies for prevention and close monitoring of the risk of falls are still necessary until patients regain a more stable physical and mental state.
- Published
- 2021
27. Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder
- Author
-
Hsin-Min Wang, Yu Tsao, Hung-Shin Lee, Yu-Huai Peng, Yi-Chiao Wu, Tomoki Toda, Cheng-Hung Hu, and Wen-Chin Huang
- Subjects
Multi band ,Speaker verification ,Similarity (geometry) ,Training set ,Audio and Speech Processing (eess.AS) ,Relational database ,Computer science ,Speech recognition ,FOS: Electrical engineering, electronic engineering, information engineering ,Selection (linguistics) ,Representation (mathematics) ,Identity (music) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of limited target data, a data augmentation method based on speaker representation and similarity measurement of speaker verification is proposed in this paper. The proposed method selects utterances that have similar speaker identity to the target speaker from an external corpus, and then combines the selected utterances with the limited target data for SD vocoder adaptation. The evaluation results show that, compared with the vocoder adapted using only limited target data, the vocoder adapted using augmented data improves both the quality and similarity of synthesized speech., Comment: 5 pages, 1 figure, 3 tables, Proc. Interspeech, 2021
- Published
- 2021
- Full Text
- View/download PDF
28. Global epidemiology of hip fractures: a study protocol using a common analytical platform among multiple countries
- Author
-
Kebede Beyene, Jiannong Liu, Amy Hai Yan Chan, Chor-Wing Sing, Caroline Y. Doyon, Hongxin Zhao, Henrik Toft Sørensen, E. Michael Lewiecki, Anna-Maija Tolppanen, Katia M.C. Verhamme, Kenneth K.C. Man, Ju-Young Shin, Kiyoshi Kubota, Jeff Lange, Ian C. K. Wong, Tzu-Chieh Lin, Grace Hsin-Min Wang, Jenni Ilomäki, Mirhelen Mendes de Abreu, Douglas P. Kiel, Pauline Bosco-Lévy, Sharon Bartholomew, Nicholas Moore, Corina W Bennett, Sawaeng Watcharathanakij, Daniel Prieto-Alhambra, Sirpa Hartikainen, Ganga Ganesan, Nobuhiro Ooba, Edward Chia Cheng Lai, Alma B Pedersen, Kelvin Bryan Tan, James O’Kelly, Manju Chandran, Ching-Lung Cheung, J. Simon Bell, Han Eol Jeong, Cécile Droz-Perroteau, The University of Hong Kong (HKU), Amgen Inc. [Thousand Oaks, CA, USA], Public Health Agency of Canada, Monash University [Melbourne], University of Auckland [Auckland], Plateforme Bordeaux PharmacoEpi [Bordeaux] (BPE), Centre d'Investigation Clinique [Bordeaux], Institut Bergonié [Bordeaux], UNICANCER-UNICANCER-Université de Bordeaux (UB)-CHU Bordeaux [Bordeaux]-Institut National de la Santé et de la Recherche Médicale (INSERM)-Institut Bergonié [Bordeaux], UNICANCER-UNICANCER-Université de Bordeaux (UB)-CHU Bordeaux [Bordeaux]-Institut National de la Santé et de la Recherche Médicale (INSERM), Université de Bordeaux (UB), National University Hospital [Singapore] (NUH), Ministry of Health [Singapore], University of Eastern Finland, Sungkyunkwan University [Suwon] (SKKU), Harvard Medical School [Boston] (HMS), National Cheng Kung University (NCKU), The University of New Mexico [Albuquerque], Hennepin County Medical Center, Minneapolis, University College of London [London] (UCL), University College London Hospitals (UCLH), Universidade Federal Rural do Rio de Janeiro (UFRRJ), Nihon University, Aarhus University [Aarhus], National University of Singapore (NUS), Erasmus University Medical Center [Rotterdam] (Erasmus MC), Ubon Ratchathani University, and Medical Informatics
- Subjects
medicine.medical_specialty ,Asia ,Population ,diabetes & endocrinology ,030209 endocrinology & metabolism ,Global Health ,03 medical and health sciences ,0302 clinical medicine ,Health care ,Epidemiology ,medicine ,Humans ,030212 general & internal medicine ,education ,Aged ,Retrospective Studies ,Hip fracture ,education.field_of_study ,business.industry ,Hip Fractures ,Public health ,Incidence (epidemiology) ,Incidence ,public health ,Retrospective cohort study ,General Medicine ,Middle Aged ,South America ,medicine.disease ,calcium & bone ,3. Good health ,Europe ,Systematic review ,Medicine ,epidemiology ,[SDV.SPEE]Life Sciences [q-bio]/Santé publique et épidémiologie ,business ,Demography - Abstract
IntroductionHip fractures are associated with a high burden of morbidity and mortality. Globally, there is wide variation in the incidence of hip fracture in people aged 50 years and older. Longitudinal and cross-geographical comparisons of health data can provide insights on aetiology, risk factors, and healthcare practices. However, systematic reviews of studies that use different methods and study periods do not permit direct comparison across geographical regions. Thus, the objective of this study is to investigate global secular trends in hip fracture incidence, mortality and use of postfracture pharmacological treatment across Asia, Oceania, North and South America, and Western and Northern Europe using a unified methodology applied to health records.Methods and analysisThis retrospective cohort study will use a common protocol and an analytical common data model approach to examine incidence of hip fracture across population-based databases in different geographical regions and healthcare settings. The study period will be from 2005 to 2018 subject to data availability in study sites. Patients aged 50 years and older and hospitalised due to hip fracture during the study period will be included. The primary outcome will be expressed as the annual incidence of hip fracture. Secondary outcomes will be the pharmacological treatment rate and mortality within 12 months following initial hip fracture by year. For the primary outcome, crude and standardised incidence of hip fracture will be reported. Linear regression will be used to test for time trends in the annual incidence. For secondary outcomes, the crude mortality and standardised mortality incidence will be reported.Ethics and disseminationEach participating site will follow the relevant local ethics and regulatory frameworks for study approval. The results of the study will be submitted for peer-reviewed scientific publications and presented at scientific conferences.
- Published
- 2021
- Full Text
- View/download PDF
29. Ginsenoside compound K reduces the progression of Huntington's disease via the inhibition of oxidative stress and overactivation of the ATM/AMPK pathway
- Author
-
A-Ching Chao, Sheau-Long Lee, Wan-Tze Chen, Yu-Chieh Lee, Tz-Chuen Ju, Hsin-Min Wang, Wan-Han Hsu, Ding-I. Yang, Ting-Yu Lin, and Kuo-Feng Hua
- Subjects
Genetically modified mouse ,Huntingtin ,Chemistry ,DNA damage ,AMPK ,medicine.disease_cause ,medicine.disease ,Biochemistry, Genetics and Molecular Biology (miscellaneous) ,Complementary and alternative medicine ,Huntington's disease ,medicine ,Cancer research ,Phosphorylation ,Protein kinase A ,Oxidative stress ,Biotechnology - Abstract
Background Huntington's disease (HD) is a neurodegenerative disorder caused by the expansion of trinucleotide CAG repeat in the Huntingtin (Htt) gene. The major pathogenic pathways underlying HD involve the impairment of cellular energy homeostasis and DNA damage in the brain. The protein kinase ataxia-telangiectasia mutated (ATM) is an important regulator of the DNA damage response. ATM is involved in the phosphorylation of AMP-activated protein kinase (AMPK), suggesting that AMPK plays a critical role in response to DNA damage. Herein, we demonstrated that expression of polyQ-expanded mutant Htt (mHtt) enhanced the phosphorylation of ATM. Ginsenoside is the main and most effective component of Panax ginseng. However, the protective effect of a ginsenoside (compound K, CK) in HD remains unclear and warrants further investigation. Methods This study used the R6/2 transgenic mouse model of HD and performed behavioral tests, survival rate, histological analyses, and immunoblot assays. Results The systematic administration of CK into R6/2 mice suppressed the activation of ATM/AMPK and reduced neuronal toxicity and mHTT aggregation. Most importantly, CK increased neuronal density and lifespan and improved motor dysfunction in R6/2 mice. Conversely, CK enhanced the expression of Bcl2 protected striatal cells from the toxicity induced by the overactivation of mHtt and AMPK. Conclusions Thus, the oral administration of CK reduced the disease progression and markedly enhanced lifespan in the transgenic mouse model (R6/2) of HD.
- Published
- 2021
30. ACL Size and Notch Width Between ACLR and Healthy Individuals: A Pilot Study
- Author
-
David H. Perrin, Hsin-Min Wang, Robert A. Henson, Sandra J. Shultz, Scott E. Ross, and Randy J. Schmitz
- Subjects
Adolescent ,Anterior cruciate ligament ,Pilot Projects ,Physical Therapy, Sports Therapy and Rehabilitation ,Young Adult ,03 medical and health sciences ,0302 clinical medicine ,Recurrence ,Risk Factors ,Notch width ,medicine ,Body Size ,Humans ,Orthopedics and Sports Medicine ,In patient ,Femur ,Anterior Cruciate Ligament ,Orthodontics ,030222 orthopedics ,Anterior Cruciate Ligament Reconstruction ,business.industry ,Anterior Cruciate Ligament Injuries ,030229 sport sciences ,musculoskeletal system ,Current Research ,Magnetic Resonance Imaging ,Cross-Sectional Studies ,medicine.anatomical_structure ,Healthy individuals ,Female ,business ,human activities - Abstract
Background: Given the relatively high risk of contralateral anterior cruciate ligament (ACL) injury in patients with ACL reconstruction (ACLR), there is a need to understand intrinsic risk factors that may contribute to contralateral injury. Hypothesis: The ACLR group would have smaller ACL volume and a narrower femoral notch width than healthy individuals after accounting for relevant anthropometrics. Study Design: Cross-sectional study. Level of Evidence: Level 3. Methods: Magnetic resonance imaging data of the left knee were obtained from uninjured (N = 11) and unilateral ACL-reconstructed (N = 10) active, female, collegiate-level recreational athletes. ACL volume was obtained from T2-weighted images. Femoral notch width and notch width index were measured from T1-weighted images. Independent-samples t tests examined differences in all measures between healthy and ACLR participants. Results: The ACLR group had a smaller notch width index (0.22 ± 0.02 vs 0.25 ± 0.01; P = 0.004; effect size, 1.41) and ACL volume (25.6 ± 4.0 vs 32.6 ± 8.2 mm3/(kg·m)−1; P = 0.025; effect size, 1.08) after normalizing by body size. Conclusion: Only after normalizing for relevant anthropometrics, the contralateral ACLR limb had smaller ACL size and narrower relative femoral notch size than healthy individuals. These findings suggest that risk factor studies of ACL size and femoral notch size should account for relevant body size when determining their association with contralateral ACL injury. Clinical Relevance: The present study shows that the method of the identified intrinsic risk factors for contralateral ACL injury could be used in future clinical screening settings.
- Published
- 2019
- Full Text
- View/download PDF
31. Sex Comparisons of In Vivo Anterior Cruciate Ligament Morphometry
- Author
-
Hsin-Min Wang, Robert A. Henson, Scott E. Ross, Randy J. Schmitz, Sandra J. Shultz, Robert A. Kraft, and David H. Perrin
- Subjects
Adult ,Male ,Anterior cruciate ligament ,Physical Therapy, Sports Therapy and Rehabilitation ,Context (language use) ,Body Mass Index ,03 medical and health sciences ,Sex Factors ,0302 clinical medicine ,In vivo ,medicine ,Humans ,Knee ,Orthopedics and Sports Medicine ,Anterior Cruciate Ligament ,030222 orthopedics ,Anthropometry ,medicine.diagnostic_test ,business.industry ,Anterior Cruciate Ligament Injuries ,Magnetic resonance imaging ,Organ Size ,030229 sport sciences ,General Medicine ,Anatomy ,musculoskeletal system ,Magnetic Resonance Imaging ,Cross-Sectional Studies ,medicine.anatomical_structure ,Female ,business ,human activities - Abstract
Context Females have consistently higher anterior cruciate ligament (ACL) injury rates than males. The reasons for this disparity are not fully understood. Whereas ACL morphometric characteristics are associated with injury risk and females have a smaller absolute ACL size, comprehensive sex comparisons that adequately account for sex differences in body mass index (BMI) have been limited. Objective To investigate sex differences among in vivo ACL morphometric measures before and after controlling for femoral notch width and BMI. Design Cross-sectional study. Setting Laboratory. Patients or Other Participants Twenty recreationally active men (age = 23.2 ± 2.9 years, height = 180.4 ± 6.7 cm, mass = 84.0 ± 10.9 kg) and 20 recreationally active women (age = 21.3 ± 2.3 years, height = 166.9 ± 7.7 cm, mass = 61.9 ± 7.2 kg) participated. Main Outcome Measure(s) Structural magnetic resonance imaging sequences were performed on the left knee. Anterior cruciate ligament volume, width, and cross-sectional area measures were obtained from T2-weighted images and normalized to femoral notch width and BMI. Femoral notch width was measured from T1-weighted images. We used independent-samples t tests to examine sex differences in absolute and normalized measures. Results Men had greater absolute ACL volume (1712.2 ± 356.3 versus 1200.1 ± 337.8 mm3; t38 = −4.67, P < .001) and ACL width (8.5 ± 2.3 versus 7.0 ± 1.2 mm; t38 = −2.53, P = .02) than women. The ACL volume remained greater in men than in women after controlling for femoral notch width (89.31 ± 15.63 versus 72.42 ± 16.82 mm3/mm; t38 = −3.29, P = .002) and BMI (67.13 ± 15.40 versus 54.69 ± 16.39 mm3/kg/m2; t38 = −2.47, P = .02). Conclusions Whereas men had greater ACL volume and width than women, only ACL volume remained different when we accounted for femoral notch width and BMI. This suggests that ACL volume may be an appropriate measure of ACL anatomy in investigations of ACL morphometry and ACL injury risk that include sex comparisons.
- Published
- 2019
- Full Text
- View/download PDF
32. Sequence to General Tree: Knowledge-Guided Geometry Word Problem Solving
- Author
-
Keh-Yih Su, Chao-Chun Liang, Shih-hung Tsai, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sequence ,Theoretical computer science ,Uninterpretable ,business.industry ,Computer science ,Computer Science - Artificial Intelligence ,Deep learning ,Binary number ,computer.file_format ,Tree (data structure) ,Artificial Intelligence (cs.AI) ,Domain knowledge ,Binary expression tree ,Executable ,Artificial intelligence ,business ,computer - Abstract
With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems. However, these SOTA solvers only generate binary expression trees that contain basic arithmetic operators and do not explicitly use the math formulas. As a result, the expression trees they produce are lengthy and uninterpretable because they need to use multiple operators and constants to represent one single formula. In this paper, we propose sequence-to-general tree (S2G) that learns to generate interpretable and executable operation trees where the nodes can be formulas with an arbitrary number of arguments. With nodes now allowed to be formulas, S2G can learn to incorporate mathematical domain knowledge into problem-solving, making the results more interpretable. Experiments show that S2G can achieve a better performance against strong baselines on problems that require domain knowledge., ACL2021
- Published
- 2021
33. Speech Recognition by Simply Fine-tuning BERT
- Author
-
Chia-Hua Wu, Hsin-Min Wang, Kuan-Yu Chen, Shang-Bao Luo, Tomoki Toda, and Wen-Chin Huang
- Subjects
FOS: Computer and information sciences ,Sequence ,Signal processing ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,SIGNAL (programming language) ,Acoustic model ,Context (language use) ,Computer Science - Sound ,Data modeling ,Simple (abstract algebra) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Language model ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance., Accepted to ICASSP 2021
- Published
- 2021
34. MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration
- Author
-
Tai-Shih Chi, Yu-Huai Peng, Yuan-Hong Yang, Hsin-Min Wang, Yu-Tao Chang, Yu Tsao, and Syu-Siang Wang
- Subjects
business.industry ,Computer science ,Deep learning ,Speech recognition ,Computation ,Latency (audio) ,020206 networking & telecommunications ,02 engineering and technology ,Gating ,FLOPS ,Convolution ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Task (computing) ,Computer engineering ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,0305 other medical science ,business - Abstract
Owing to the recent advancements in deep learning technology, the performance of voice conversion (VC) in terms of quality and similarity has significantly improved. However, complex computation is generally required for deep-learning-based VC systems. This can cause a notable latency, which limits the deployment of such VC systems in real-world applications. Therefore, increasing the efficiency of online computing has become an important task. In this study, we propose a novel mixture-of-experts (MoE) based VC system, termed MoEVC. The MoEVC system uses a gating mechanism to assign weights to feature maps to increase VC performance. In addition, applying sparse constraints on the gating mechanism can skip some convolution processes through elimination of redundant feature maps, thereby accelerating online computing. Experimental results show that by using proper sparse constraints, we can effectively reduce the FLOPs (floating-point operations) count by 70%, while improving VC performance in both objective evaluation and human subjective listening tests.
- Published
- 2021
- Full Text
- View/download PDF
35. SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- Author
-
Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,Applied Mathematics ,Signal Processing ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels., Comment: To appear in IEEE Signal Processing Letters (SPL)
- Published
- 2021
- Full Text
- View/download PDF
36. AlloST: Low-resource Speech Translation without Source Transcription
- Author
-
Yao-Fei Cheng, Hung-Shin Lee, and Hsin-Min Wang
- Subjects
Byte pair encoding ,FOS: Computer and information sciences ,Sequence ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer science ,Computer Science - Artificial Intelligence ,Speech recognition ,Pronunciation ,Machine Learning (cs.LG) ,Multimedia (cs.MM) ,Artificial Intelligence (cs.AI) ,Phone ,Audio and Speech Processing (eess.AS) ,Speech translation ,FOS: Electrical engineering, electronic engineering, information engineering ,Transcription (software) ,Encoder ,Computation and Language (cs.CL) ,Word (computer architecture) ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence. Due to the conversion of symbols, a segmented sequence represents not only pronunciation but also language-dependent information lacking in phones. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription., Comment: Accepted by Interspeech2021
- Published
- 2021
- Full Text
- View/download PDF
37. A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
- Author
-
Hsin-Min Wang, Ching-Feng Liu, Kazuhiro Kobayashi, Yu-Huai Peng, Yu Tsao, Wen-Chin Huang, and Tomoki Toda
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,media_common.quotation_subject ,Dysarthric speech ,Autoencoder ,Computer Science - Sound ,Poor quality ,Identity (music) ,Dysarthria ,Audio and Speech Processing (eess.AS) ,medicine ,FOS: Electrical engineering, electronic engineering, information engineering ,Quality (business) ,medicine.symptom ,Normal speech ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,media_common - Abstract
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality. We investigate several design options. Experimental evaluation results demonstrate the potential of our approach to improving the quality of the dysarthric speech while maintaining the speaker identity., Comment: Accepted to Interspeech 2021. 5 pages, 3 figures, 1 table
- Published
- 2021
- Full Text
- View/download PDF
38. SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours
- Author
-
Yi-Wei Chen, Hung-Shin Lee, Yen-Hsing Chen, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Multimedia (cs.MM) - Abstract
The surprisingness of a song is an essential and seemingly subjective factor in determining whether the listener likes it. With the help of information theory, it can be described as the transition probability of a music sequence modeled as a Markov chain. In this study, we introduce the concept of deriving entropy variations over time, so that the surprise contour of each chord sequence can be extracted. Based on this, we propose a user-controllable framework that uses a conditional variational autoencoder (CVAE) to harmonize the melody based on the given chord surprise indication. Through explicit conditions, the model can randomly generate various and harmonic chord progressions for a melody, and the Spearman's correlation and p-value significance show that the resulting chord progressions match the given surprise contour quite well. The vanilla CVAE model was evaluated in a basic melody harmonization task (no surprise control) in terms of six objective metrics. The results of experiments on the Hooktheory Lead Sheet Dataset show that our model achieves performance comparable to the state-of-the-art melody harmonization model., Comment: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021
- Published
- 2021
- Full Text
- View/download PDF
39. Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation
- Author
-
Hsin-Min Wang, Yu-Huai Peng, Hung-Shin Lee, and Fan-Lin Wang
- Subjects
Network speaker ,Computer science ,Filter (video) ,Audio and Speech Processing (eess.AS) ,Speech recognition ,Path (graph theory) ,Source separation ,FOS: Electrical engineering, electronic engineering, information engineering ,Time domain ,Cocktail party effect ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Domain (software engineering) ,Dual (category theory) - Abstract
Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dual-path filter network (DPFN). Our model focuses on the post-processing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT)., Comment: Accepted by Interspeech2021
- Published
- 2021
- Full Text
- View/download PDF
40. Speech Enhancement with Zero-Shot Model Selection
- Author
-
Ryandhimas E. Zezario, Chiou-Shann Fuh, Hsin-Min Wang, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Machine Learning (cs.LG) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we combine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performance. The proposed approach is realized in the offline and online phases. The offline phase clusters the entire set of training data into multiple subsets and trains a specialized SE model (termed component SE model) with each subset. The online phase selects the most suitable component SE model to perform the enhancement. Furthermore, two selection strategies were developed: selection based on the quality score (QS) and selection based on the quality embedding (QE). Both QS and QE were obtained using a Quality-Net, a non-intrusive quality assessment network. Experimental results confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems and other model selection systems, which indicates the effectiveness of the proposed approach in providing robust SE performance., Accepted in EUSIPCO 2021
- Published
- 2020
41. Quadriceps muscle volume positively contributes to ACL volume
- Author
-
Hsin-Min Wang, Sandra J. Shultz, Jeffrey D. Labban, Anthony S. Kulas, and Randy J. Schmitz
- Subjects
Male ,medicine.medical_specialty ,Knee Joint ,Anterior cruciate ligament ,Thigh ,Muscle mass ,Quadriceps Muscle ,Risk Factors ,Internal medicine ,medicine ,Humans ,Orthopedics and Sports Medicine ,Clinical significance ,Femur ,Anterior Cruciate Ligament ,business.industry ,Anterior Cruciate Ligament Injuries ,Quadriceps muscle ,medicine.disease ,ACL injury ,Magnetic Resonance Imaging ,medicine.anatomical_structure ,Cardiology ,Female ,business ,Body mass index ,Hamstring - Abstract
Females have smaller anterior cruciate ligaments (ACLs) than males and smaller ACLs have been associated with a greater risk of ACL injury. Overall body dimensions do not adequately explain these sex differences. This study examined the extent to which quadriceps muscle volume (VOLQUAD ) positively predicts ACL volume (VOLACL ) once sex and other body dimensions were accounted for. Physically active males (N = 10) and females (N = 10) were measured for height, weight, and body mass index (BMI). Three-Tesla magnetic resonance images of their dominant and nondominant thigh and knee were then obtained to measure VOLACL , quadriceps, and hamstring muscle volumes, femoral notch width, and femoral notch width index. Separate three-step regressions estimated associations between VOLQUAD and VOLACL (third step), after controlling for sex (first step) and one body dimension (second step). When controlling for sex and sex plus BMI, VOLHAM , notch width, or notch width index, VOLQUAD consistently exhibited a positive association with VOLACL in the dominant leg, nondominant leg, and leg-averaged models (p < 0.05). Findings were inconsistent when controlling for sex and height (p = 0.038-0.102). Once VOLQUAD was included, only notch width and notch width index retained a statistically significant individual association with VOLACL (p < 0.01). Statement of Clinical Significance: The positive association between VOLQUAD and VOLACL suggests ACL size may in part be modifiable. Future studies are needed to determine the extent to which an appropriate training stimulus (focused on optimizing overall lower extremity muscle mass development) can positively impact ACL size and structure in young females.
- Published
- 2020
42. Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks
- Author
-
Natalie Yu-Hsien Wang, Yu Tsao, Tao-Wei Wang, Hsin-Min Wang, Xugan Lu, Szu-Wei Fu, and Hsiao Lan Sharon Wang
- Subjects
Computer science ,Speech recognition ,Noise reduction ,medicine.medical_treatment ,Biomedical Engineering ,Intelligibility (communication) ,Convolutional neural network ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0302 clinical medicine ,Cochlear implant ,Internal Medicine ,medicine ,Humans ,030223 otorhinolaryngology ,Noise measurement ,business.industry ,General Neuroscience ,Deep learning ,Rehabilitation ,Speech Intelligibility ,Electric Stimulation ,Speech enhancement ,Cochlear Implants ,Acoustic Stimulation ,QUIET ,Speech Perception ,Artificial intelligence ,Neural Networks, Computer ,0305 other medical science ,business - Abstract
Combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts. With evidence showing the benefits of FCN(S) for normal speech, this study sets out to assess its ability to improve the intelligibility of EAS simulated speech. Objective evaluations and listening tests were conducted to examine the performance of FCN(S) in improving the speech intelligibility of normal and vocoded speech in noisy environments. The experimental results show that, compared with the traditional minimum-mean square-error SE method and the deep denoising autoencoder SE method, FCN(S) can obtain better gain in the speech intelligibility for normal as well as vocoded speech. This study, being the first to evaluate deep learning SE approaches for EAS, confirms that FCN(S) is an effective SE approach that may potentially be integrated into an EAS processor to benefit users in noisy environments.
- Published
- 2020
43. Using Taigi Dramas with Mandarin Chinese Subtitles to Improve Taigi Speech Recognition
- Author
-
Hsin-Min Wang, Ming-Tat Ko, Chia-Hua Wu, Pin-Yuan Chen, Shao-Kang Tsao, and Hung-Shin Lee
- Subjects
Training set ,Computer science ,Speech recognition ,language ,Acoustic model ,Subtitle ,Written language ,Language model ,Chinese characters ,Mandarin Chinese ,language.human_language ,Spoken language - Abstract
An obvious problem with automatic speech recognition (ASR) for Taigi is that the amount of training data is far from enough to build a practical ASR system. Collecting speech data with reliable transcripts for training the acoustic model (AM) is feasible but expensive. Moreover, text data used for language model (LM) training is extremely scarce and difficult to collect because Taigi is a spoken language, not a commonly used written language. Interestingly, the subtitles of Taigi drama in Taiwan have long been in Chinese characters for Mandarin. Since a large amount of Taigi drama episodes with Mandarin Chinese subtitles are available on YouTube, we propose a method to augment the training data for AM and LM of Taigi ASR. The idea is to use an initial Taigi ASR system to convert a Mandarin Chinese subtitle into the most likely Taigi word sequence by referring to the speech. Experimental results show that our ASR system can be remarkably improved by such training data augmentation.
- Published
- 2020
- Full Text
- View/download PDF
44. Oriental COCOSDA – Country Report 2020 Language Resources Developed in Taiwan
- Author
-
Hsin-Min Wang and Sin-Horng Chen
- Subjects
Speech enhancement ,History ,Task analysis ,Pragmatics ,Linguistics - Published
- 2020
- Full Text
- View/download PDF
45. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
- Author
-
Hirokazu Kameoka, Hsin-Te Hwang, Driss Matrouf, Markus Becker, Quan Wang, Sahidullah, Ye Jia, Yu Zhang, Lauri Juvela, Hsin-Min Wang, Wen-Chin Huang, Zhen-Hua Ling, Yuan Jiang, Yi-Chiao Wu, Héctor Delgado, Massimiliano Todisco, Yu Tsao, Li-Juan Liu, Junichi Yamagishi, Jean-François Bonastre, Tomoki Toda, Nicholas Evans, Robert A. J. Clark, Kai Onuma, Yu-Huai Peng, Sébastien Le Maguer, Avashna Govender, Takashi Kaneda, Andreas Nautsch, Kong Aik Lee, Xin Wang, Srikanth Ronanki, Ville Vestman, Koji Mushika, Ingmar Steiner, Tomi Kinnunen, Fergus Henderson, Jing-Xuan Zhang, Kou Tanaka, Paavo Alku, Hitotsubashi University, University of Edinburgh, Eurecom [Sophia Antipolis], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), University of Eastern Finland, NEC Corporation, Aalto University, Academia Sinica, ADAPT Centre, Sigmedia Lab, EE Engineering, Trinity College Dublin, Google Inc [Mountain View], Research at Google, Hoya Corp., iFlytek Research, Nagoya City University [Nagoya, Japan], NTT Communication Science Laboratories, NTT Corporation, audEERING GmbH, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, The Centre for Speech Technology Research [Edinburgh] (CSTR), Southern University of Science and Technology (SUSTech), The work was partially supported by JST CREST Grant No. JPMJCR18A6 (VoicePersonae project), Japan, MEXT KAKENHI Grant Nos. (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan, the VoicePersonae and RESPECT projects funded by the French Agence Nationale de la Recherche (ANR), the Academy of Finland (NOTCH project no. 309629), and Region Grand Est, France. entitled 'NOTCH: NOn-cooperaTive speaker CHaracterization'). The authors at the University of Eastern Finland also gratefully acknowledge the use of the computational infrastructures at CSC – the IT Center for Science, and the support of the NVIDIA Corporation the donation of a Titan V GPU used in this research. The numerical calculations of some of the spoofed data were carried out on the TSUBAME3.0 supercomputer at the Tokyo Institute of Technology. The work is also partially supported by Region Grand Est, France. The ADAPT centre (13/RC/2106) is funded by the Science Foundation Ireland (SFI)., National Institute of Informatics, EURECOM, Université de Lorraine, Dept Signal Process and Acoust, Trinity College Dublin, Google, USA, HOYA Corporation, IFLYTEK Co., Ltd., Nagoya University, AudEERING GmbH, Avignon Université, University of Science and Technology of China, Aalto-yliopisto, and Southern University of Science and Technology of China (SUSTech)
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Sound (cs.SD) ,ASVspoof challenge ,biometrics ,Computer Science - Cryptography and Security ,voice conversion ,Computer science ,Speech synthesis ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Computer Science - Sound ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,text-to-speech synthesis ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,Replay ,Use case ,media forensics ,010301 acoustics ,Protocol (object-oriented programming) ,Text-to-speech synthesis ,Database ,presentation attack ,ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS ,Automatic speaker verification ,Cryptography and Security (cs.CR) ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,Electrical Engineering and Systems Science - Audio and Speech Processing ,automatic speaker verification ,Voice conversion ,Spoofing attack ,Biometrics ,anti-spoofing ,Reliability (computer networking) ,Database design ,Theoretical Computer Science ,replay ,presentation attack detection ,0103 physical sciences ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical Engineering and Systems Science - Signal Processing ,[SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph] ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,[INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] ,020206 networking & telecommunications ,Human-Computer Interaction ,Physical access ,computer ,countermeasure ,Software - Abstract
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects., Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114
- Published
- 2020
- Full Text
- View/download PDF
46. SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning
- Author
-
Yu-Chen Lin, Chi-Chang Lee, Hsin-Min Wang, Hsuan-Tien Lin, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Forgetting ,Training set ,Computer science ,business.industry ,Deep learning ,Adaptation strategies ,Machine learning ,computer.software_genre ,Regularization (mathematics) ,Machine Learning (cs.LG) ,Speech enhancement ,Audio and Speech Processing (eess.AS) ,Digital storage ,Incremental learning ,FOS: Electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,computer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Numerous noise adaptation techniques have been proposed to fine-tune deep-learning models in speech enhancement (SE) for mismatched noise environments. Nevertheless, adaptation to a new environment may lead to catastrophic forgetting of the previously learned environments. The catastrophic forgetting issue degrades the performance of SE in real-world embedded devices, which often revisit previous noise environments. The nature of embedded devices does not allow solving the issue with additional storage of all pre-trained models or earlier training data. In this paper, we propose a regularization-based incremental learning SE (SERIL) strategy, complementing existing noise adaptation strategies without using additional storage. With a regularization constraint, the parameters are updated to the new noise environment while retaining the knowledge of the previous noise environments. The experimental results show that, when faced with a new noise domain, the SERIL model outperforms the unadapted SE model. Meanwhile, compared with the current adaptive technique based on fine-tuning, the SERIL model can reduce the forgetting of previous noise environments by 52%. The results verify that the SERIL model can effectively adjust itself to new noise environments while overcoming the catastrophic forgetting issue. The results make SERIL a favorable choice for real-world SE applications, where the noise environment changes frequently., Accepted to Interspeech 2020
- Published
- 2020
- Full Text
- View/download PDF
47. Combining Deep Embeddings of Acoustic and Articulatory Features for Speaker Identification
- Author
-
Qian-Bei Hong, Chung-Hsien Wu, Chien-Lin Huang, and Hsin-Min Wang
- Subjects
Speech production ,Artificial neural network ,Computer science ,Speech recognition ,Feature vector ,02 engineering and technology ,Convolutional neural network ,Signal ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Feature (computer vision) ,Multilayer perceptron ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,0305 other medical science - Abstract
In this study, deep embedding of acoustic and articulatory features are combined for speaker identification. First, a convolutional neural network (CNN)-based universal background model (UBM) is constructed to generate acoustic feature (AC) embedding. In addition, as the articulatory features (AFs) represent some important phonological properties during speech production, a multilayer perceptron (MLP)-based AF embedding extraction model is also constructed for AF embedding extraction. The extracted AC and AF embeddings are concatenated as a combined feature vector for speaker identification using a fully-connected neural network. This proposed system was evaluated by three corpora consisting of King-ASR, LibriSpeech and SITW, and the experiments were conducted according to the properties of the datasets. We adopted all three corpora to evaluate the effect of AF embedding, and the results showed that combining AF embedding into the input feature vector improved the performance of speaker identification. The LibriSpeech corpus was used to evaluate the effect of the number of enrolled speakers. The proposed system achieved an EER of 7.80% outperforming the method based on x-vector with PLDA (8.25%). And we further evaluated the effect of signal mismatch using the SITW corpus. The proposed system achieved an EER of 25.19%, which outperformed the other baseline methods.
- Published
- 2020
- Full Text
- View/download PDF
48. Self-Supervised Denoising Autoencoder with Linear Regression Decoder for Speech Enhancement
- Author
-
Xugang Lu, Yu Tsao, Tassadaq Hussain, Hsin-Min Wang, and Ryandhimas E. Zezario
- Subjects
Computer Science::Machine Learning ,Speech enhancement ,Nonlinear system ,Denoising autoencoder ,Noise ,ComputingMethodologies_PATTERNRECOGNITION ,Training set ,Computer science ,Speech recognition ,Linear regression ,Supervised learning ,Unsupervised learning ,Point (geometry) - Abstract
Nonlinear spectral mapping-based models based on supervised learning have successfully applied for speech enhancement. However, as supervised learning approaches, a large amount of labelled data (noisy-clean speech pairs) should be provided to train those models. In addition, their performances for unseen noisy conditions are not guaranteed, which is a common weak point of supervised learning approaches. In this study, we proposed an unsupervised learning approach for speech enhancement, i.e., denoising autoencoder with linear regression decoder (DAELD) model for speech enhancement. The DAELD is trained with noisy speech as both input and target output in a self-supervised learning manner. In addition, with properly setting a shrinkage threshold for internal hidden representations, noise could be removed during the reconstruction from the hidden representations via the linear regression decoder. Speech enhancement experiments were carried out to test the proposed model. Results confirmed that the proposed DAELD could achieve comparable and sometimes even better enhancement performance as compared to the conventional supervised speech enhancement approaches, in both seen and unseen noise environments. Moreover, we observe that higher performances tend to achieve by DAELD when the training data cover more diverse noise types and signal-tonoise-ratio (SNR) levels.
- Published
- 2020
- Full Text
- View/download PDF
49. Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification
- Author
-
Chien-Lin Huang, Qian-Bei Hong, Hsin-Min Wang, and Chung-Hsien Wu
- Subjects
Structure (mathematical logic) ,0209 industrial biotechnology ,Computer science ,Time delay neural network ,Feature vector ,Pooling ,02 engineering and technology ,020901 industrial engineering & automation ,Transformation (function) ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Representation (mathematics) - Abstract
This paper aims to improve speaker embedding representation based on x-vector for extracting more detailed information for speaker verification. We propose a statistics pooling time delay neural network (TDNN), in which the TDNN structure integrates statistics pooling for each layer, to consider the variation of temporal context in frame-level transformation. The proposed feature vector, named as statsvector, are compared with the baseline x-vector features on the VoxCeleb dataset and the Speakers in the Wild (SITW) dataset for speaker verification. The experimental results showed that the proposed stats-vector with score fusion achieved the best performance on VoxCeleb1 dataset. Furthermore, considering the interference from other speakers in the recordings, we found that the proposed statsvector efficiently reduced the interference and improved the speaker verification performance on the SITW dataset.
- Published
- 2020
- Full Text
- View/download PDF
50. Lite Audio-Visual Speech Enhancement
- Author
-
Chen-Chou Lo, Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao, Meng, Helen, Xu, Bo, and Zheng, Thomas Fang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Noise reduction ,Feature extraction ,Model parameters ,Online computation ,Computer Science - Sound ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,Face (geometry) ,Audio visual ,FOS: Electrical engineering, electronic engineering, information engineering ,0305 other medical science ,Computation and Language (cs.CL) ,Data compression ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Previous studies have confirmed the effectiveness of incorporating visual information into speech enhancement (SE) systems. Despite improved denoising performance, two problems may be encountered when implementing an audio-visual SE (AVSE) system: (1) additional processing costs are incurred to incorporate visual input and (2) the use of face or lip images may cause privacy problems. In this study, we propose a Lite AVSE (LAVSE) system to address these problems. The system includes two visual data compression techniques and removes the visual feature extraction network from the training model, yielding better online computation efficiency. Our experimental results indicate that the proposed LAVSE system can provide notably better performance than an audio-only SE system with a similar number of model parameters. In addition, the experimental results confirm the effectiveness of the two techniques for visual data compression., Comment: Accepted to Interspeech 2020
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.