290 results on '"Hsin-Min Wang"'
Search Results
2. Generalization Ability Improvement of Speaker Representation and Anti-Interference for Speaker Verification
- Author
-
Qian-Bei Hong, Chung-Hsien Wu, and Hsin-Min Wang
- Subjects
Computational Mathematics ,Acoustics and Ultrasonics ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering - Published
- 2023
3. Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features
- Author
-
Chiou-Shann Fuh, Szu-Wei Fu, Ryandhimas Zezario, Hsin-Min Wang, Yu Tsao, and Fei Chen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computational Mathematics ,Acoustics and Ultrasonics ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
- Published
- 2023
4. Improved Lite Audio-Visual Speech Enhancement
- Author
-
Yu Tsao, Shang-Yi Chuang, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computational Mathematics ,Acoustics and Ultrasonics ,Audio and Speech Processing (eess.AS) ,Image and Video Processing (eess.IV) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical Engineering and Systems Science - Image and Video Processing ,Electrical and Electronic Engineering ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.
- Published
- 2022
5. Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition
- Author
-
Hsin-Min Wang, Chung-Hsien Wu, and Qian-Bei Hong
- Subjects
Signal Processing ,Information Systems - Published
- 2023
6. Lip Sync Matters: A Novel Multimodal Forgery Detector
- Author
-
Sahibzada Adil Shahzad, Ammarah Hashmi, Sarwar Khan, Yan-Tsung Peng, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
7. Multimodal Forgery Detection Using Ensemble Learning
- Author
-
Ammarah Hashmi, Sahibzada Adil Shahzad, Wasim Ahmad, Chia Wen Lin, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
8. Detecting Replay Attacks Using Single-Channel Audio: The Temporal Autocorrelation of Speech
- Author
-
Shih-Kuang Lee, Yu Tsao, and Hsin-Min Wang
- Published
- 2022
9. Continued potassium supplementation use following loop diuretic discontinuation in older adults: An evaluation of a prescribing cascade relic
- Author
-
Grace Hsin‐Min Wang, Earl J. Morris, Steven M. Smith, Jesper Hallas, and Scott M. Vouri
- Subjects
Geriatrics and Gerontology - Abstract
The use of a new medication (e.g., potassium supplementation) for managing a drug-induced adverse event (e.g., loop diuretic-induced hypokalemia) constitutes a prescribing cascade. However, loop diuretics are often stopped while potassium may be unnecessarily continued (i.e., relic). We aimed to quantify the occurrence of relics using older adults previously experiencing a loop diuretic-potassium prescribing cascade as an example.We conducted a prescription sequence symmetry analysis using the population-based Medicare Fee-For-Service data (2011-2018) and partitioned the 150 days following potassium initiation by day to assess the daily treatment scenarios (i.e., loop diuretics alone, potassium alone, combination of loop diuretics and potassium, or neither). We calculated the proportion of patients developing the relic, proportion of person-days under potassium alone, the daily probability of the relic, and the proportion of patients filling potassium after loop diuretic discontinuation. We also identified the risk factors of the relic.We identified 284,369 loop diuretic initiators who were 8 times more likely to receive potassium supplementation simultaneously or after (i.e., the prescribing cascade), rather than before, loop diuretic initiation (aSR 8.0, 95% CI 7.9-8.2). Among the 66,451 loop diuretic initiators who subsequently (≤30 days) initiated potassium, 20,445 (30.8%) patients remained on potassium after loop diuretic discontinuation, and 9365 (14.1%) patients subsequently filled another potassium supplementation. Following loop diuretic initiation, 4.0% of person-days were for potassium alone, and daily probability of the relic was the highest after day 90 of loop diuretic initiation (5.6%). Older age, female sex, higher diuretic daily dose, and greater baseline comorbidities were risk factors for the relic, while patients having the same prescriber or pharmacy involved in the use of both medications were less likely to experience the relic.Our findings suggest the need for clinicians to be aware of the potential of relic to avoid unnecessary drug use.
- Published
- 2022
10. Association between gabapentinoids and oedema treated with loop diuretics: A pooled sequence symmetry analysis from the USA and Denmark
- Author
-
Scott Martin Vouri, Earl J. Morris, Grace Hsin‐Min Wang, Alyaa Hashim Jaber Bilal, Jesper Hallas, and Daniel Pilsgaard Henriksen
- Subjects
Pharmacology ,Adult ,Sodium Potassium Chloride Symporter Inhibitors ,Denmark ,Humans ,Edema ,Pharmacology (medical) ,Serotonin and Noradrenaline Reuptake Inhibitors ,Medicare ,Diuretics ,United States - Abstract
To assess the gabapentinoid-oedema-loop diuretic prescribing cascade in adults using large administrative health care databases from the USA and Denmark.This study used a sequence symmetry analysis to assess loop diuretic initiation before and after the initiation of gabapentinoids among patients aged 20 years or older without heart failure or chronic kidney disease. Data from MarketScan Commercial and Medicare Supplemental Claims databases (2005 to 2019) and Danish National Prescription Register (2005 to 2018) were analyzed. Use of loop diuretics associated with initiation of selective norepinephrine reuptake inhibitors (SNRI) was used as a negative control. We assessed the pooled temporality of loop diuretic initiation relative to gabapentinoid or SNRI initiation across the 2 countries. Secular trend-adjusted sequence ratios (aSRs) with 95% confidence intervals (CIs) were calculated using data from 90 days before and after initiation of gabapentinoids. Pooled ratio of aSRs were calculated by comparing gabapentinoids to SNRIs.Among the 1 511 493 gabapentinoid initiators (Denmark [n = 338 941]; USA [n = 1 172 552]), 20 139 patients had a new loop diuretic prescription 90 days before or after gabapentinoid initiation, resulting in a pooled aSR of 1.33 (95% CI 1.06-1.67). The pooled aSR for the negative control (i.e., SNRI) was 0.84 (95% CI 0.75-0.94), which resulted in a pooled ratio of aSRs of 1.58 (95% CI 1.23-2.04). Pooled estimated incidence of the gabapentinoid-loop diuretic prescribing cascade was 8.14 (95% CI, 1.92-34.49) events per 1000 patient-years.We identified evidence of the gabapentinoid-oedema-loop diuretic prescribing cascade in 2 countries.
- Published
- 2022
11. Partially Fake Audio Detection by Self-Attention-Based Fake Span Discovery
- Author
-
Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, and Helen Meng
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,ComputingMilieux_LEGALASPECTSOFCOMPUTING ,Computer Science - Sound ,Machine Learning (cs.LG) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022., Submitted to ICASSP 2022
- Published
- 2022
12. Learning to Visualize Music Through Shot Sequence for Automatic Concert Video Mashup
- Author
-
Hsin-Min Wang, Tyng-Luh Liu, Hong-Yuan Mark Liao, Jen-Chun Lin, Wen-Li Wei, and Hsiao-Rong Tyan
- Subjects
Computer science ,Knowledge engineering ,02 engineering and technology ,computer.software_genre ,Computer Science Applications ,Visualization ,Human–computer interaction ,ComputerApplications_MISCELLANEOUS ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Task analysis ,020201 artificial intelligence & image processing ,Mashup ,Electrical and Electronic Engineering ,computer ,Amateur ,Storytelling - Abstract
An experienced director usually switches among different types of shots to make visual storytelling more touching. When filming a musical performance, appropriate switching shots can produce some special effects, such as enhancing the expression of emotion or heating up the atmosphere. However, while the visual storytelling technique is often used in making professional recordings of a live concert, amateur recordings of audiences often lack such storytelling concepts and skills when filming the same event. Thus a versatile system that can perform video mashup to create a refined high-quality video from such amateur clips is desirable. To this end, we aim at translating the music into an attractive shot (type) sequence by learning the relation between music and visual storytelling of shots. The resulting shot sequence can then be used to better portray the visual storytelling of a song and guide the concert video mashup process. To achieve the task, we first introduces a novel probabilistic-based fusion approach, named as multi-resolution fused recurrent neural networks (MF-RNNs) with film-language, which integrates multi-resolution fused RNNs and a film-language model for boosting the translation performance. We then distill the knowledge in MF-RNNs with film-language into a lightweight RNN, which is more efficient and easier to deploy. The results from objective and subjective experiments demonstrate that both MF-RNNs with film-language and lightweight RNN can generate attractive shot sequences for music, thereby enhancing the viewing and listening experience.
- Published
- 2021
13. Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion
- Author
-
Chen-Chou Lo, Hsin-Te Hwang, Hao Luo, Hsin-Min Wang, Wen-Chin Huang, Yu-Huai Peng, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Control and Optimization ,Computer science ,Speech recognition ,Speech processing ,Autoencoder ,Computer Science - Sound ,Machine Learning (cs.LG) ,Computer Science Applications ,Domain (software engineering) ,Constraint (information theory) ,Computational Mathematics ,Audio and Speech Processing (eess.AS) ,Artificial Intelligence ,Similarity (psychology) ,Classifier (linguistics) ,FOS: Electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Representation (mathematics) ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods., Accepted to IEEE Transactions on Emerging Topics in Computational Intelligence
- Published
- 2020
14. Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders
- Author
-
Ryandhimas E. Zezario, Xugang Lu, Syu-Siang Wang, Hsin-Min Wang, Cheng Yu, Jonathan Sherman, Yu Tsao, and Yi-Yen Hsieh
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Acoustics and Ultrasonics ,Computer science ,Speech recognition ,Noise reduction ,Computer Science - Sound ,Data modeling ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Noise measurement ,business.industry ,Deep learning ,Speech enhancement ,Computational Mathematics ,Noise ,Artificial intelligence ,0305 other medical science ,business ,Encoder ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this article, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: training and testing. In the training stage, we build multiple component models to form a multi-branched encoder based on a decision tree (DSDT). The DSDT is built based on prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the testing stage, noisy speech is first processed by each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics, automatic speech recognition results, and quality in subjective human listening tests.
- Published
- 2020
15. Subspace-Based Representation and Learning for Phonotactic Spoken Language Recognition
- Author
-
Yu Tsao, Hung-Shin Lee, Hsin-Min Wang, and Shyh-Kang Jeng
- Subjects
Computer Science - Machine Learning ,Sequence ,Computer Science - Computation and Language ,Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,Speech recognition ,Linear subspace ,Computer Science - Sound ,Matrix decomposition ,Support vector machine ,Computational Mathematics ,Kernel (linear algebra) ,Computer Science (miscellaneous) ,NIST ,Electrical and Electronic Engineering ,Subspace topology ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods., Comment: Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079
- Published
- 2020
16. EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement
- Author
-
Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,FOS: Biological sciences ,FOS: Electrical engineering, electronic engineering, information engineering ,Quantitative Biology - Quantitative Methods ,Computer Science - Sound ,Quantitative Methods (q-bio.QM) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE., Comment: 5 pages, 4 figures, and 3 tables
- Published
- 2022
- Full Text
- View/download PDF
17. Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
- Author
-
Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer Science - Computation and Language ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) ,Multimedia (cs.MM) - Abstract
Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data., Comment: Published in Interspeech 2022
- Published
- 2022
- Full Text
- View/download PDF
18. Multi-target Extractor and Detector for Unknown-number Speaker Diarization
- Author
-
Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,Applied Mathematics ,Signal Processing ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Multimedia (cs.MM) - Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines., Comment: Accepted by IEEE Signal Processing Letters
- Published
- 2022
- Full Text
- View/download PDF
19. Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling
- Author
-
Ming-Chi Yen, Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Shu-Wei Tsai, Yu Tsao, Tomoki Toda, Jyh-Shing Roger Jang, and Hsin-Min Wang
- Published
- 2021
20. Investigation of a Single-Channel Frequency-Domain Speech Enhancement Network to Improve End-to-End Bengali Automatic Speech Recognition Under Unseen Noisy Conditions
- Author
-
Md Mahbub E Noor, Yen-Ju Lu, Syu-Siang Wang, Supratip Ghose, Chia-Yu Chang, Ryandhimas E. Zezario, Shafique Ahmed, Wei-Ho Chung, Yu Tsao, and Hsin-Min Wang
- Published
- 2021
21. Use of antipsychotic drugs and cholinesterase inhibitors and risk of falls and fractures: self-controlled case series
- Author
-
Tzu-Chi Liao, Kenneth K.C. Man, Edward Chia Cheng Lai, Grace Hsin-Min Wang, and Wei-Hung Chang
- Subjects
Male ,medicine.medical_specialty ,Databases, Factual ,medicine.medical_treatment ,Neurocognitive Disorders ,Taiwan ,Rate ratio ,Risk Assessment ,symbols.namesake ,Fractures, Bone ,Internal medicine ,medicine ,Humans ,Poisson regression ,Medical prescription ,Antipsychotic ,Cholinesterase ,Aged ,Aged, 80 and over ,biology ,business.industry ,Incidence (epidemiology) ,Incidence ,Research ,General Medicine ,Confidence interval ,symbols ,biology.protein ,Accidental Falls ,Female ,Cholinesterase Inhibitors ,Risk assessment ,business ,Antipsychotic Agents - Abstract
Objective To evaluate the association between the use of antipsychotic drugs and cholinesterase inhibitors and the risk of falls and fractures in elderly patients with major neurocognitive disorders. Design Self-controlled case series. Setting Taiwan’s National Health Insurance Database. Participants 15 278 adults, aged ≥65, with newly prescribed antipsychotic drugs and cholinesterase inhibitors, who had an incident fall or fracture between 2006 and 2017. Prescription records of cholinesterase inhibitors confirmed the diagnosis of major neurocognitive disorders; all use of cholinesterase inhibitors was reviewed by experts. Main outcome measures Conditional Poisson regression was used to derive incidence rate ratios and 95% confidence intervals for evaluating the risk of falls and fractures for different treatment periods: use of cholinesterase inhibitors alone, antipsychotic drugs alone, and a combination of cholinesterase inhibitors and antipsychotic drugs, compared with the non-treatment period in the same individual. A 14 day pretreatment period was defined before starting the study drugs because of concerns about confounding by indication. Results The incidence of falls and fractures per 100 person years was 8.30 (95% confidence interval 8.14 to 8.46) for the non-treatment period, 52.35 (48.46 to 56.47) for the pretreatment period, and 10.55 (9.98 to 11.14), 10.34 (9.80 to 10.89), and 9.41 (8.98 to 9.86) for use of a combination of cholinesterase inhibitors and antipsychotic drugs, antipsychotic drugs alone, and cholinesterase inhibitors alone, respectively. Compared with the non-treatment period, the highest risk of falls and fractures was during the pretreatment period (adjusted incidence rate ratio 6.17, 95% confidence interval 5.69 to 6.69), followed by treatment with the combination of cholinesterase inhibitors and antipsychotic drugs (1.35, 1.26 to 1.45), antipsychotic drugs alone (1.33, 1.24 to 1.43), and cholinesterase inhibitors alone (1.17, 1.10 to 1.24). Conclusions The incidence of falls and fractures was high in the pretreatment period, suggesting that factors other than the study drugs, such as underlying diseases, should be taken into consideration when evaluating the association between the risk of falls and fractures and use of cholinesterase inhibitors and antipsychotic drugs. The treatment periods were also associated with a higher risk of falls and fractures compared with the non-treatment period, although the magnitude was much lower than during the pretreatment period. Strategies for prevention and close monitoring of the risk of falls are still necessary until patients regain a more stable physical and mental state.
- Published
- 2021
22. Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder
- Author
-
Hsin-Min Wang, Yu Tsao, Hung-Shin Lee, Yu-Huai Peng, Yi-Chiao Wu, Tomoki Toda, Cheng-Hung Hu, and Wen-Chin Huang
- Subjects
Multi band ,Speaker verification ,Similarity (geometry) ,Training set ,Audio and Speech Processing (eess.AS) ,Relational database ,Computer science ,Speech recognition ,FOS: Electrical engineering, electronic engineering, information engineering ,Selection (linguistics) ,Representation (mathematics) ,Identity (music) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of limited target data, a data augmentation method based on speaker representation and similarity measurement of speaker verification is proposed in this paper. The proposed method selects utterances that have similar speaker identity to the target speaker from an external corpus, and then combines the selected utterances with the limited target data for SD vocoder adaptation. The evaluation results show that, compared with the vocoder adapted using only limited target data, the vocoder adapted using augmented data improves both the quality and similarity of synthesized speech., Comment: 5 pages, 1 figure, 3 tables, Proc. Interspeech, 2021
- Published
- 2021
23. Ginsenoside compound K reduces the progression of Huntington's disease via the inhibition of oxidative stress and overactivation of the ATM/AMPK pathway
- Author
-
A-Ching Chao, Sheau-Long Lee, Wan-Tze Chen, Yu-Chieh Lee, Tz-Chuen Ju, Hsin-Min Wang, Wan-Han Hsu, Ding-I. Yang, Ting-Yu Lin, and Kuo-Feng Hua
- Subjects
Genetically modified mouse ,Huntingtin ,Chemistry ,DNA damage ,AMPK ,medicine.disease_cause ,medicine.disease ,Biochemistry, Genetics and Molecular Biology (miscellaneous) ,Complementary and alternative medicine ,Huntington's disease ,medicine ,Cancer research ,Phosphorylation ,Protein kinase A ,Oxidative stress ,Biotechnology - Abstract
Background Huntington's disease (HD) is a neurodegenerative disorder caused by the expansion of trinucleotide CAG repeat in the Huntingtin (Htt) gene. The major pathogenic pathways underlying HD involve the impairment of cellular energy homeostasis and DNA damage in the brain. The protein kinase ataxia-telangiectasia mutated (ATM) is an important regulator of the DNA damage response. ATM is involved in the phosphorylation of AMP-activated protein kinase (AMPK), suggesting that AMPK plays a critical role in response to DNA damage. Herein, we demonstrated that expression of polyQ-expanded mutant Htt (mHtt) enhanced the phosphorylation of ATM. Ginsenoside is the main and most effective component of Panax ginseng. However, the protective effect of a ginsenoside (compound K, CK) in HD remains unclear and warrants further investigation. Methods This study used the R6/2 transgenic mouse model of HD and performed behavioral tests, survival rate, histological analyses, and immunoblot assays. Results The systematic administration of CK into R6/2 mice suppressed the activation of ATM/AMPK and reduced neuronal toxicity and mHTT aggregation. Most importantly, CK increased neuronal density and lifespan and improved motor dysfunction in R6/2 mice. Conversely, CK enhanced the expression of Bcl2 protected striatal cells from the toxicity induced by the overactivation of mHtt and AMPK. Conclusions Thus, the oral administration of CK reduced the disease progression and markedly enhanced lifespan in the transgenic mouse model (R6/2) of HD.
- Published
- 2021
24. ACL Size and Notch Width Between ACLR and Healthy Individuals: A Pilot Study
- Author
-
David H. Perrin, Hsin-Min Wang, Robert A. Henson, Sandra J. Shultz, Scott E. Ross, and Randy J. Schmitz
- Subjects
Adolescent ,Anterior cruciate ligament ,Pilot Projects ,Physical Therapy, Sports Therapy and Rehabilitation ,Young Adult ,03 medical and health sciences ,0302 clinical medicine ,Recurrence ,Risk Factors ,Notch width ,medicine ,Body Size ,Humans ,Orthopedics and Sports Medicine ,In patient ,Femur ,Anterior Cruciate Ligament ,Orthodontics ,030222 orthopedics ,Anterior Cruciate Ligament Reconstruction ,business.industry ,Anterior Cruciate Ligament Injuries ,030229 sport sciences ,musculoskeletal system ,Current Research ,Magnetic Resonance Imaging ,Cross-Sectional Studies ,medicine.anatomical_structure ,Healthy individuals ,Female ,business ,human activities - Abstract
Background: Given the relatively high risk of contralateral anterior cruciate ligament (ACL) injury in patients with ACL reconstruction (ACLR), there is a need to understand intrinsic risk factors that may contribute to contralateral injury. Hypothesis: The ACLR group would have smaller ACL volume and a narrower femoral notch width than healthy individuals after accounting for relevant anthropometrics. Study Design: Cross-sectional study. Level of Evidence: Level 3. Methods: Magnetic resonance imaging data of the left knee were obtained from uninjured (N = 11) and unilateral ACL-reconstructed (N = 10) active, female, collegiate-level recreational athletes. ACL volume was obtained from T2-weighted images. Femoral notch width and notch width index were measured from T1-weighted images. Independent-samples t tests examined differences in all measures between healthy and ACLR participants. Results: The ACLR group had a smaller notch width index (0.22 ± 0.02 vs 0.25 ± 0.01; P = 0.004; effect size, 1.41) and ACL volume (25.6 ± 4.0 vs 32.6 ± 8.2 mm3/(kg·m)−1; P = 0.025; effect size, 1.08) after normalizing by body size. Conclusion: Only after normalizing for relevant anthropometrics, the contralateral ACLR limb had smaller ACL size and narrower relative femoral notch size than healthy individuals. These findings suggest that risk factor studies of ACL size and femoral notch size should account for relevant body size when determining their association with contralateral ACL injury. Clinical Relevance: The present study shows that the method of the identified intrinsic risk factors for contralateral ACL injury could be used in future clinical screening settings.
- Published
- 2019
25. Sex Comparisons of In Vivo Anterior Cruciate Ligament Morphometry
- Author
-
Hsin-Min Wang, Robert A. Henson, Scott E. Ross, Randy J. Schmitz, Sandra J. Shultz, Robert A. Kraft, and David H. Perrin
- Subjects
Adult ,Male ,Anterior cruciate ligament ,Physical Therapy, Sports Therapy and Rehabilitation ,Context (language use) ,Body Mass Index ,03 medical and health sciences ,Sex Factors ,0302 clinical medicine ,In vivo ,medicine ,Humans ,Knee ,Orthopedics and Sports Medicine ,Anterior Cruciate Ligament ,030222 orthopedics ,Anthropometry ,medicine.diagnostic_test ,business.industry ,Anterior Cruciate Ligament Injuries ,Magnetic resonance imaging ,Organ Size ,030229 sport sciences ,General Medicine ,Anatomy ,musculoskeletal system ,Magnetic Resonance Imaging ,Cross-Sectional Studies ,medicine.anatomical_structure ,Female ,business ,human activities - Abstract
Context Females have consistently higher anterior cruciate ligament (ACL) injury rates than males. The reasons for this disparity are not fully understood. Whereas ACL morphometric characteristics are associated with injury risk and females have a smaller absolute ACL size, comprehensive sex comparisons that adequately account for sex differences in body mass index (BMI) have been limited. Objective To investigate sex differences among in vivo ACL morphometric measures before and after controlling for femoral notch width and BMI. Design Cross-sectional study. Setting Laboratory. Patients or Other Participants Twenty recreationally active men (age = 23.2 ± 2.9 years, height = 180.4 ± 6.7 cm, mass = 84.0 ± 10.9 kg) and 20 recreationally active women (age = 21.3 ± 2.3 years, height = 166.9 ± 7.7 cm, mass = 61.9 ± 7.2 kg) participated. Main Outcome Measure(s) Structural magnetic resonance imaging sequences were performed on the left knee. Anterior cruciate ligament volume, width, and cross-sectional area measures were obtained from T2-weighted images and normalized to femoral notch width and BMI. Femoral notch width was measured from T1-weighted images. We used independent-samples t tests to examine sex differences in absolute and normalized measures. Results Men had greater absolute ACL volume (1712.2 ± 356.3 versus 1200.1 ± 337.8 mm3; t38 = −4.67, P < .001) and ACL width (8.5 ± 2.3 versus 7.0 ± 1.2 mm; t38 = −2.53, P = .02) than women. The ACL volume remained greater in men than in women after controlling for femoral notch width (89.31 ± 15.63 versus 72.42 ± 16.82 mm3/mm; t38 = −3.29, P = .002) and BMI (67.13 ± 15.40 versus 54.69 ± 16.39 mm3/kg/m2; t38 = −2.47, P = .02). Conclusions Whereas men had greater ACL volume and width than women, only ACL volume remained different when we accounted for femoral notch width and BMI. This suggests that ACL volume may be an appropriate measure of ACL anatomy in investigations of ACL morphometry and ACL injury risk that include sex comparisons.
- Published
- 2019
26. MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration
- Author
-
Tai-Shih Chi, Yu-Huai Peng, Yuan-Hong Yang, Hsin-Min Wang, Yu-Tao Chang, Yu Tsao, and Syu-Siang Wang
- Subjects
business.industry ,Computer science ,Deep learning ,Speech recognition ,Computation ,Latency (audio) ,020206 networking & telecommunications ,02 engineering and technology ,Gating ,FLOPS ,Convolution ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Task (computing) ,Computer engineering ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,0305 other medical science ,business - Abstract
Owing to the recent advancements in deep learning technology, the performance of voice conversion (VC) in terms of quality and similarity has significantly improved. However, complex computation is generally required for deep-learning-based VC systems. This can cause a notable latency, which limits the deployment of such VC systems in real-world applications. Therefore, increasing the efficiency of online computing has become an important task. In this study, we propose a novel mixture-of-experts (MoE) based VC system, termed MoEVC. The MoEVC system uses a gating mechanism to assign weights to feature maps to increase VC performance. In addition, applying sparse constraints on the gating mechanism can skip some convolution processes through elimination of redundant feature maps, thereby accelerating online computing. Experimental results show that by using proper sparse constraints, we can effectively reduce the FLOPs (floating-point operations) count by 70%, while improving VC performance in both objective evaluation and human subjective listening tests.
- Published
- 2021
27. SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- Author
-
Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,Applied Mathematics ,Signal Processing ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Machine Learning (cs.LG) - Abstract
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels., Comment: To appear in IEEE Signal Processing Letters (SPL)
- Published
- 2021
- Full Text
- View/download PDF
28. AlloST: Low-resource Speech Translation without Source Transcription
- Author
-
Yao-Fei Cheng, Hung-Shin Lee, and Hsin-Min Wang
- Subjects
Byte pair encoding ,FOS: Computer and information sciences ,Sequence ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer science ,Computer Science - Artificial Intelligence ,Speech recognition ,Pronunciation ,Machine Learning (cs.LG) ,Multimedia (cs.MM) ,Artificial Intelligence (cs.AI) ,Phone ,Audio and Speech Processing (eess.AS) ,Speech translation ,FOS: Electrical engineering, electronic engineering, information engineering ,Transcription (software) ,Encoder ,Computation and Language (cs.CL) ,Word (computer architecture) ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence. Due to the conversion of symbols, a segmented sequence represents not only pronunciation but also language-dependent information lacking in phones. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription., Comment: Accepted by Interspeech2021
- Published
- 2021
- Full Text
- View/download PDF
29. A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
- Author
-
Hsin-Min Wang, Ching-Feng Liu, Kazuhiro Kobayashi, Yu-Huai Peng, Yu Tsao, Wen-Chin Huang, and Tomoki Toda
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,media_common.quotation_subject ,Dysarthric speech ,Autoencoder ,Computer Science - Sound ,Poor quality ,Identity (music) ,Dysarthria ,Audio and Speech Processing (eess.AS) ,medicine ,FOS: Electrical engineering, electronic engineering, information engineering ,Quality (business) ,medicine.symptom ,Normal speech ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,media_common - Abstract
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality. We investigate several design options. Experimental evaluation results demonstrate the potential of our approach to improving the quality of the dysarthric speech while maintaining the speaker identity., Comment: Accepted to Interspeech 2021. 5 pages, 3 figures, 1 table
- Published
- 2021
- Full Text
- View/download PDF
30. SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours
- Author
-
Yi-Wei Chen, Hung-Shin Lee, Yen-Hsing Chen, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Sound ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Multimedia (cs.MM) - Abstract
The surprisingness of a song is an essential and seemingly subjective factor in determining whether the listener likes it. With the help of information theory, it can be described as the transition probability of a music sequence modeled as a Markov chain. In this study, we introduce the concept of deriving entropy variations over time, so that the surprise contour of each chord sequence can be extracted. Based on this, we propose a user-controllable framework that uses a conditional variational autoencoder (CVAE) to harmonize the melody based on the given chord surprise indication. Through explicit conditions, the model can randomly generate various and harmonic chord progressions for a melody, and the Spearman's correlation and p-value significance show that the resulting chord progressions match the given surprise contour quite well. The vanilla CVAE model was evaluated in a basic melody harmonization task (no surprise control) in terms of six objective metrics. The results of experiments on the Hooktheory Lead Sheet Dataset show that our model achieves performance comparable to the state-of-the-art melody harmonization model., Comment: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021
- Published
- 2021
- Full Text
- View/download PDF
31. Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation
- Author
-
Hsin-Min Wang, Yu-Huai Peng, Hung-Shin Lee, and Fan-Lin Wang
- Subjects
Network speaker ,Computer science ,Filter (video) ,Audio and Speech Processing (eess.AS) ,Speech recognition ,Path (graph theory) ,Source separation ,FOS: Electrical engineering, electronic engineering, information engineering ,Time domain ,Cocktail party effect ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Domain (software engineering) ,Dual (category theory) - Abstract
Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dual-path filter network (DPFN). Our model focuses on the post-processing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT)., Comment: Accepted by Interspeech2021
- Published
- 2021
- Full Text
- View/download PDF
32. Quadriceps muscle volume positively contributes to ACL volume
- Author
-
Hsin-Min Wang, Sandra J. Shultz, Jeffrey D. Labban, Anthony S. Kulas, and Randy J. Schmitz
- Subjects
Male ,medicine.medical_specialty ,Knee Joint ,Anterior cruciate ligament ,Thigh ,Muscle mass ,Quadriceps Muscle ,Risk Factors ,Internal medicine ,medicine ,Humans ,Orthopedics and Sports Medicine ,Clinical significance ,Femur ,Anterior Cruciate Ligament ,business.industry ,Anterior Cruciate Ligament Injuries ,Quadriceps muscle ,medicine.disease ,ACL injury ,Magnetic Resonance Imaging ,medicine.anatomical_structure ,Cardiology ,Female ,business ,Body mass index ,Hamstring - Abstract
Females have smaller anterior cruciate ligaments (ACLs) than males and smaller ACLs have been associated with a greater risk of ACL injury. Overall body dimensions do not adequately explain these sex differences. This study examined the extent to which quadriceps muscle volume (VOLQUAD ) positively predicts ACL volume (VOLACL ) once sex and other body dimensions were accounted for. Physically active males (N = 10) and females (N = 10) were measured for height, weight, and body mass index (BMI). Three-Tesla magnetic resonance images of their dominant and nondominant thigh and knee were then obtained to measure VOLACL , quadriceps, and hamstring muscle volumes, femoral notch width, and femoral notch width index. Separate three-step regressions estimated associations between VOLQUAD and VOLACL (third step), after controlling for sex (first step) and one body dimension (second step). When controlling for sex and sex plus BMI, VOLHAM , notch width, or notch width index, VOLQUAD consistently exhibited a positive association with VOLACL in the dominant leg, nondominant leg, and leg-averaged models (p < 0.05). Findings were inconsistent when controlling for sex and height (p = 0.038-0.102). Once VOLQUAD was included, only notch width and notch width index retained a statistically significant individual association with VOLACL (p < 0.01). Statement of Clinical Significance: The positive association between VOLQUAD and VOLACL suggests ACL size may in part be modifiable. Future studies are needed to determine the extent to which an appropriate training stimulus (focused on optimizing overall lower extremity muscle mass development) can positively impact ACL size and structure in young females.
- Published
- 2020
33. Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks
- Author
-
Natalie Yu-Hsien Wang, Yu Tsao, Tao-Wei Wang, Hsin-Min Wang, Xugan Lu, Szu-Wei Fu, and Hsiao Lan Sharon Wang
- Subjects
Computer science ,Speech recognition ,Noise reduction ,medicine.medical_treatment ,Biomedical Engineering ,Intelligibility (communication) ,Convolutional neural network ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0302 clinical medicine ,Cochlear implant ,Internal Medicine ,medicine ,Humans ,030223 otorhinolaryngology ,Noise measurement ,business.industry ,General Neuroscience ,Deep learning ,Rehabilitation ,Speech Intelligibility ,Electric Stimulation ,Speech enhancement ,Cochlear Implants ,Acoustic Stimulation ,QUIET ,Speech Perception ,Artificial intelligence ,Neural Networks, Computer ,0305 other medical science ,business - Abstract
Combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts. With evidence showing the benefits of FCN(S) for normal speech, this study sets out to assess its ability to improve the intelligibility of EAS simulated speech. Objective evaluations and listening tests were conducted to examine the performance of FCN(S) in improving the speech intelligibility of normal and vocoded speech in noisy environments. The experimental results show that, compared with the traditional minimum-mean square-error SE method and the deep denoising autoencoder SE method, FCN(S) can obtain better gain in the speech intelligibility for normal as well as vocoded speech. This study, being the first to evaluate deep learning SE approaches for EAS, confirms that FCN(S) is an effective SE approach that may potentially be integrated into an EAS processor to benefit users in noisy environments.
- Published
- 2020
34. Using Taigi Dramas with Mandarin Chinese Subtitles to Improve Taigi Speech Recognition
- Author
-
Hsin-Min Wang, Ming-Tat Ko, Chia-Hua Wu, Pin-Yuan Chen, Shao-Kang Tsao, and Hung-Shin Lee
- Subjects
Training set ,Computer science ,Speech recognition ,language ,Acoustic model ,Subtitle ,Written language ,Language model ,Chinese characters ,Mandarin Chinese ,language.human_language ,Spoken language - Abstract
An obvious problem with automatic speech recognition (ASR) for Taigi is that the amount of training data is far from enough to build a practical ASR system. Collecting speech data with reliable transcripts for training the acoustic model (AM) is feasible but expensive. Moreover, text data used for language model (LM) training is extremely scarce and difficult to collect because Taigi is a spoken language, not a commonly used written language. Interestingly, the subtitles of Taigi drama in Taiwan have long been in Chinese characters for Mandarin. Since a large amount of Taigi drama episodes with Mandarin Chinese subtitles are available on YouTube, we propose a method to augment the training data for AM and LM of Taigi ASR. The idea is to use an initial Taigi ASR system to convert a Mandarin Chinese subtitle into the most likely Taigi word sequence by referring to the speech. Experimental results show that our ASR system can be remarkably improved by such training data augmentation.
- Published
- 2020
35. Oriental COCOSDA – Country Report 2020 Language Resources Developed in Taiwan
- Author
-
Hsin-Min Wang and Sin-Horng Chen
- Subjects
Speech enhancement ,History ,Task analysis ,Pragmatics ,Linguistics - Published
- 2020
36. SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning
- Author
-
Yu-Chen Lin, Chi-Chang Lee, Hsin-Min Wang, Hsuan-Tien Lin, and Yu Tsao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Forgetting ,Training set ,Computer science ,business.industry ,Deep learning ,Adaptation strategies ,Machine learning ,computer.software_genre ,Regularization (mathematics) ,Machine Learning (cs.LG) ,Speech enhancement ,Audio and Speech Processing (eess.AS) ,Digital storage ,Incremental learning ,FOS: Electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,computer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Numerous noise adaptation techniques have been proposed to fine-tune deep-learning models in speech enhancement (SE) for mismatched noise environments. Nevertheless, adaptation to a new environment may lead to catastrophic forgetting of the previously learned environments. The catastrophic forgetting issue degrades the performance of SE in real-world embedded devices, which often revisit previous noise environments. The nature of embedded devices does not allow solving the issue with additional storage of all pre-trained models or earlier training data. In this paper, we propose a regularization-based incremental learning SE (SERIL) strategy, complementing existing noise adaptation strategies without using additional storage. With a regularization constraint, the parameters are updated to the new noise environment while retaining the knowledge of the previous noise environments. The experimental results show that, when faced with a new noise domain, the SERIL model outperforms the unadapted SE model. Meanwhile, compared with the current adaptive technique based on fine-tuning, the SERIL model can reduce the forgetting of previous noise environments by 52%. The results verify that the SERIL model can effectively adjust itself to new noise environments while overcoming the catastrophic forgetting issue. The results make SERIL a favorable choice for real-world SE applications, where the noise environment changes frequently., Accepted to Interspeech 2020
- Published
- 2020
37. Combining Deep Embeddings of Acoustic and Articulatory Features for Speaker Identification
- Author
-
Qian-Bei Hong, Chung-Hsien Wu, Chien-Lin Huang, and Hsin-Min Wang
- Subjects
Speech production ,Artificial neural network ,Computer science ,Speech recognition ,Feature vector ,02 engineering and technology ,Convolutional neural network ,Signal ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Feature (computer vision) ,Multilayer perceptron ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,0305 other medical science - Abstract
In this study, deep embedding of acoustic and articulatory features are combined for speaker identification. First, a convolutional neural network (CNN)-based universal background model (UBM) is constructed to generate acoustic feature (AC) embedding. In addition, as the articulatory features (AFs) represent some important phonological properties during speech production, a multilayer perceptron (MLP)-based AF embedding extraction model is also constructed for AF embedding extraction. The extracted AC and AF embeddings are concatenated as a combined feature vector for speaker identification using a fully-connected neural network. This proposed system was evaluated by three corpora consisting of King-ASR, LibriSpeech and SITW, and the experiments were conducted according to the properties of the datasets. We adopted all three corpora to evaluate the effect of AF embedding, and the results showed that combining AF embedding into the input feature vector improved the performance of speaker identification. The LibriSpeech corpus was used to evaluate the effect of the number of enrolled speakers. The proposed system achieved an EER of 7.80% outperforming the method based on x-vector with PLDA (8.25%). And we further evaluated the effect of signal mismatch using the SITW corpus. The proposed system achieved an EER of 25.19%, which outperformed the other baseline methods.
- Published
- 2020
38. Self-Supervised Denoising Autoencoder with Linear Regression Decoder for Speech Enhancement
- Author
-
Xugang Lu, Yu Tsao, Tassadaq Hussain, Hsin-Min Wang, and Ryandhimas E. Zezario
- Subjects
Computer Science::Machine Learning ,Speech enhancement ,Nonlinear system ,Denoising autoencoder ,Noise ,ComputingMethodologies_PATTERNRECOGNITION ,Training set ,Computer science ,Speech recognition ,Linear regression ,Supervised learning ,Unsupervised learning ,Point (geometry) - Abstract
Nonlinear spectral mapping-based models based on supervised learning have successfully applied for speech enhancement. However, as supervised learning approaches, a large amount of labelled data (noisy-clean speech pairs) should be provided to train those models. In addition, their performances for unseen noisy conditions are not guaranteed, which is a common weak point of supervised learning approaches. In this study, we proposed an unsupervised learning approach for speech enhancement, i.e., denoising autoencoder with linear regression decoder (DAELD) model for speech enhancement. The DAELD is trained with noisy speech as both input and target output in a self-supervised learning manner. In addition, with properly setting a shrinkage threshold for internal hidden representations, noise could be removed during the reconstruction from the hidden representations via the linear regression decoder. Speech enhancement experiments were carried out to test the proposed model. Results confirmed that the proposed DAELD could achieve comparable and sometimes even better enhancement performance as compared to the conventional supervised speech enhancement approaches, in both seen and unseen noise environments. Moreover, we observe that higher performances tend to achieve by DAELD when the training data cover more diverse noise types and signal-tonoise-ratio (SNR) levels.
- Published
- 2020
39. Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification
- Author
-
Chien-Lin Huang, Qian-Bei Hong, Hsin-Min Wang, and Chung-Hsien Wu
- Subjects
Structure (mathematical logic) ,0209 industrial biotechnology ,Computer science ,Time delay neural network ,Feature vector ,Pooling ,02 engineering and technology ,020901 industrial engineering & automation ,Transformation (function) ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Representation (mathematics) - Abstract
This paper aims to improve speaker embedding representation based on x-vector for extracting more detailed information for speaker verification. We propose a statistics pooling time delay neural network (TDNN), in which the TDNN structure integrates statistics pooling for each layer, to consider the variation of temporal context in frame-level transformation. The proposed feature vector, named as statsvector, are compared with the baseline x-vector features on the VoxCeleb dataset and the Speakers in the Wild (SITW) dataset for speaker verification. The experimental results showed that the proposed stats-vector with score fusion achieved the best performance on VoxCeleb1 dataset. Furthermore, considering the interference from other speakers in the recordings, we found that the proposed statsvector efficiently reduced the interference and improved the speaker verification performance on the SITW dataset.
- Published
- 2020
40. Lite Audio-Visual Speech Enhancement
- Author
-
Chen-Chou Lo, Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao, Meng, Helen, Xu, Bo, and Zheng, Thomas Fang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Noise reduction ,Feature extraction ,Model parameters ,Online computation ,Computer Science - Sound ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,Face (geometry) ,Audio visual ,FOS: Electrical engineering, electronic engineering, information engineering ,0305 other medical science ,Computation and Language (cs.CL) ,Data compression ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Previous studies have confirmed the effectiveness of incorporating visual information into speech enhancement (SE) systems. Despite improved denoising performance, two problems may be encountered when implementing an audio-visual SE (AVSE) system: (1) additional processing costs are incurred to incorporate visual input and (2) the use of face or lip images may cause privacy problems. In this study, we propose a Lite AVSE (LAVSE) system to address these problems. The system includes two visual data compression techniques and removes the visual feature extraction network from the training model, yielding better online computation efficiency. Our experimental results indicate that the proposed LAVSE system can provide notably better performance than an audio-only SE system with a similar number of model parameters. In addition, the experimental results confirm the effectiveness of the two techniques for visual data compression., Comment: Accepted to Interspeech 2020
- Published
- 2020
- Full Text
- View/download PDF
41. WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement
- Author
-
Tsun-An Hsieh, Yu Tsao, Hsin-Min Wang, and Xugang Lu
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer science ,Pipeline (computing) ,Speech recognition ,Feature extraction ,Machine Learning (stat.ML) ,02 engineering and technology ,Convolutional neural network ,Computer Science - Sound ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Noise measurement ,Applied Mathematics ,020206 networking & telecommunications ,Speech enhancement ,Recurrent neural network ,Signal Processing ,Noise (video) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.
- Published
- 2020
- Full Text
- View/download PDF
42. Melody Harmonization Using Orderless NADE, Chord Balancing, and Blocked Gibbs Sampling
- Author
-
Yi-Wei Chen, Hung-Shin Lee, Chung-En Sun, Yen-Hsing Chen, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Ground truth ,Sound (cs.SD) ,Computer science ,business.industry ,Process (computing) ,Inference ,Pattern recognition ,Harmonization ,Coherence (statistics) ,Speech processing ,Computer Science - Sound ,Multimedia (cs.MM) ,symbols.namesake ,Audio and Speech Processing (eess.AS) ,symbols ,FOS: Electrical engineering, electronic engineering, information engineering ,Chord (music) ,Artificial intelligence ,business ,Computer Science - Multimedia ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Gibbs sampling - Abstract
Coherence and interestingness are two criteria for evaluating the performance of melody harmonization, which aims to generate a chord progression from a symbolic melody. In this study, we apply the concept of orderless NADE, which takes the melody and its partially masked chord sequence as the input of the BiLSTM-based networks to learn the masked ground truth, to the training process. In addition, the class weights are used to compensate for some reasonable chord labels that are rarely seen in the training set. Consistent with the stochasticity in training, blocked Gibbs sampling with proper numbers of masking/generating loops is used in the inference phase to progressively trade the coherence of the generated chord sequence off against its interestingness. The experiments were conducted on a dataset of 18,005 melody/chord pairs. Our proposed model outperforms the state-of-the-art system MTHarmonizer in five of six different objective metrics based on chord/melody harmonicity and chord progression. The subjective test results with more than 100 participants also show the superiority of our model., Comment: Accepted by ICASSP 2021, and Demo is available at: https://chord-generation.herokuapp.com/demo
- Published
- 2020
- Full Text
- View/download PDF
43. Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks
- Author
-
Shang-Bao Luo, Hsin-Min Wang, Hung-Shin Lee, and Kuan-Yu Chen
- Subjects
Word embedding ,Modalities ,Offset (computer science) ,business.industry ,Computer science ,computer.software_genre ,Mandarin Chinese ,Convolutional neural network ,language.human_language ,Question answering ,language ,Artificial intelligence ,business ,Hidden Markov model ,computer ,Natural language processing ,Multiple choice - Abstract
In a spoken multiple-choice question answering (MCQA) task, where passages, questions, and choices are given in the form of speech, usually only the auto-transcribed text is considered in system development. The acoustic-level information may contain useful cues for answer prediction. However, to the best of our knowledge, only a few studies focus on using the acoustic-level information or fusing the acoustic-level information with the text-level information for a spoken MCQA task. Therefore, this paper presents a hierarchical multistage multimodal (HMM) framework based on convolutional neural networks (CNNs) to integrate text- and acoustic-level statistics into neural modeling for spoken MCQA. Specifically, the acoustic-level statistics are expected to offset text inaccuracies caused by automatic speech recognition (ASR) systems or representation inadequacy lurking in word embedding generators, thereby making the spoken MCQA system robust. In the proposed HMM framework, two modalities are first manipulated to separately derive the acoustic- and text-level representations for the passage, question, and choices. Next, these clever features are jointly involved in inferring the relationships among the passage, question, and choices. Then, a final representation is derived for each choice, which encodes the relationship of the choice to the passage and question. Finally, the most likely answer is determined based on the individual final representations of all choices. Evaluated on the data of “Formosa Grand Challenge - Talk to AI”, a Mandarin Chinese spoken MCQA contest held in 2018, the proposed HMM framework achieves remarkable improvements in accuracy over the text-only baseline.
- Published
- 2019
44. Sequential Speaker Embedding and Transfer Learning for Text-Independent Speaker Identification
- Author
-
Chung-Hsien Wu, Qian-Bei Hong, Ming-Hsiang Su, and Hsin-Min Wang
- Subjects
Computer science ,Speech recognition ,Feature extraction ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,Convolutional neural network ,Identification (information) ,0202 electrical engineering, electronic engineering, information engineering ,Spectrogram ,Embedding ,Speaker identification ,020201 artificial intelligence & image processing ,Transfer of learning - Abstract
In this study, an approach to speaker identification is proposed based on a convolutional neural network (CNN)-based model considering sequential speaker embedding and transfer learning. First, a CNN-based universal background model (UBM) is constructed and a transfer learning mechanism is applied to obtain speaker embedding using a small amount of enrollment data. Second, considering the temporal variation of acoustic features in an utterance of a speaker, this study generates sequential speaker embedding to capture temporal characteristics of speech features of a speaker. Experiments were conducted on the King-ASR series database for UBM training, and the LibriSpeech corpus was adopted for evaluation. The experimental results showed that the proposed method using sequential speaker embedding and transfer learning achieved an equal error rate (EER) of 6.89% outperforming the method based on x-vector and PLDA method (8.25%). Furthermore, we considered the effect of speaker number for speaker identification. When the number of enrolled speakers was from 50 to 1172, the identification accuracy of the proposed method was degraded from 82.99% to 73.26%, which outperformed the identification accuracy of the method using x-vector and PLDA which was dramatically degraded from 83.17% to 60.95%.
- Published
- 2019
45. Investigation of Neural Network Approaches for Unified Spectral and Prosodic Feature Enhancement
- Author
-
Hsin-Min Wang, Wei-Cheng Lin, Yu Tsao, and Fei Chen
- Subjects
Correctness ,Artificial neural network ,Noise measurement ,Computer science ,business.industry ,Feature extraction ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Fundamental frequency ,Intelligibility (communication) ,Extractor ,Speech enhancement ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business - Abstract
Most speech enhancement (SE) systems focus on the spectral feature or raw-waveform enhancement. However, many speech-related applications rely on other features rather than the spectral features, such as the intensity and fundamental frequency (f0). Therefore, a unified feature enhancement for different types of features is worth investigating. In this work, we train our neural network (NN)-based SE system in a manner that simultaneously minimizes the spectral loss and preserves the correctness of the intensity and f0 contours extracted from the enhanced speech. The idea is to introduce an NN-based feature extractor to the SE framework that imitates the feature extraction of Praat. Then, we can train the SE system by minimizing the combined loss of the spectral feature, intensity, and f0. We investigate three bidirectional long short-term memory (BLSTM)-based unified feature enhancement systems: fixed-concat, joint-concat, and multi-task. The results of the experiments on the Taiwan Mandarin hearing in a noise test dataset (TMHINT) demonstrate that all three systems show improved intensity and f0 extraction accuracy without sacrificing the perceptual evaluation of the speech quality and short-time objective intelligibility scores compared with the baseline SE system. Further analysis of the experimental results shows that the improvement mostly comes from better f0 contours under difficult conditions such as low signal-to-noise ratio and nonstationary noises. Our work demonstrates the advantage of the unified feature enhancement and provides new insights for SE.
- Published
- 2019
46. Improving Automatic Jazz Melody Generation by Transfer Learning Techniques
- Author
-
Yi-Hsuan Yang, Hsiao-Tzu Hung, Chung-Yang Wang, and Hsin-Min Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,MIDI ,Computer science ,business.industry ,computer.file_format ,Machine learning ,computer.software_genre ,Autoencoder ,Computer Science - Sound ,Machine Learning (cs.LG) ,Data modeling ,Generative model ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Task analysis ,Artificial intelligence ,Jazz ,business ,Transfer of learning ,computer ,Classifier (UML) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In this paper, we tackle the problem of transfer learning for Jazz automatic generation. Jazz is one of representative types of music, but the lack of Jazz data in the MIDI format hinders the construction of a generative model for Jazz. Transfer learning is an approach aiming to solve the problem of data insufficiency, so as to transfer the common feature from one domain to another. In view of its success in other machine learning problems, we investigate whether, and how much, it can help improve automatic music generation for under-resourced musical genres. Specifically, we use a recurrent variational autoencoder as the generative model, and use a genre-unspecified dataset as the source dataset and a Jazz-only dataset as the target dataset. Two transfer learning methods are evaluated using six levels of source-to-target data ratios. The first method is to train the model on the source dataset, and then fine-tune the resulting model parameters on the target dataset. The second method is to train the model on both the source and target datasets at the same time, but add genre labels to the latent vectors and use a genre classifier to improve Jazz generation. The evaluation results show that the second method seems to perform better overall, but it cannot take full advantage of the genre-unspecified dataset., Comment: 8 pages, Accepted to APSIPA ASC(Asia-Pacific Signal and Information Processing Association Annual Summit and Conference ) 2019
- Published
- 2019
47. Compressed Multimodal Hierarchical Extreme Learning Machine for Speech Enhancement
- Author
-
Sabato Marco Siniscalchi, Jia-Ching Wang, Hsin-Min Wang, Yu Tsao, Tassadaq Hussain, and Wen-Hung Liao
- Subjects
Noise measurement ,Computer science ,Speech recognition ,Binary number ,020206 networking & telecommunications ,02 engineering and technology ,Electronic mail ,Visualization ,Speech enhancement ,Signal-to-noise ratio ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Objective evaluation ,Extreme learning machine - Abstract
Recently, model compression that aims to facilitate the use of deep models in real-world applications has attracted considerable attention. Several model compression techniques have been proposed to reduce computational costs without significantly degrading the achievable performance. In this paper, we propose a multimodal framework for speech enhancement (SE) by utilizing a hierarchical extreme learning machine (HELM) to enhance the performance of conventional HELM-based SE frameworks that consider audio information only. Furthermore, we investigate the performance of the HELM-based multimodal SE framework trained using binary weights and quantized input data to reduce the computational requirement. The experimental results show that the proposed multimodal SE framework outperforms the conventional HELM-based SE framework in terms of three standard objective evaluation metrics. The results also show that the performance of the proposed multimodal SE framework is only slightly degraded, when the model is compressed through model binarization and quantized input data.
- Published
- 2019
48. Multi-task Learning for Acoustic Modeling Using Articulatory Attributes
- Author
-
Hung-Shin Lee, Hsin-Min Wang, Xuan-Bo Chen, Yueh-Ting Lee, and Jyh-Shing Roger Jang
- Subjects
Artificial neural network ,Time delay neural network ,Computer science ,Speech recognition ,Frame (networking) ,Feature extraction ,Word error rate ,Multi-task learning ,020206 networking & telecommunications ,02 engineering and technology ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Hidden Markov model - Abstract
In addition to the phone sequences, articulatory attributes in spoken utterances have demonstrated salient cues for supervised training of acoustic models in automatic speech recognition (ASR). In this paper, a multi-task learning (MTL) scheme for neural network-based acoustic modeling is proposed. It aims to simultaneously minimize the cross-entropy losses of the triphone-states and articulatory attributes, given their corresponding true alignments. Supposing the articulatory information associated with the physical process is not as abstract and composite as the phonetic descriptions, the layer-wise neuron sharing occurs only in the first few layers. Moreover, instead of the fully-connected feed-forward networks (FFNs), the well-known structure of time-delay neural networks (TDNNs) is adopted to efficiently model the long-term contexts of each acoustic input frame. The results of experiments on the MATBN Mandarin Chinese broadcast news corpus show that our proposed framework achieves relative character error rate reductions of 3.3% and 5.7% over the non-MTL TDNN-based system and the MTL-FFN-based system, respectively.
- Published
- 2019
49. A Rotational Actuator Using a Thermomagnetic-Induced Magnetic Force Interaction
- Author
-
Tien-Kan Chung, Hsin-Min Wang, Chih-Cheng Cheng, and Chin Chung Chen
- Subjects
010302 applied physics ,Stall torque ,Materials science ,Rotational speed ,02 engineering and technology ,Thermomagnetic convection ,Mechanics ,021001 nanoscience & nanotechnology ,01 natural sciences ,Computer Science::Other ,Electronic, Optical and Magnetic Materials ,Magnetic field ,Computer Science::Robotics ,Operating temperature ,0103 physical sciences ,Torque ,Electrical and Electronic Engineering ,0210 nano-technology ,Actuator ,Beam (structure) - Abstract
In this paper, we demonstrate a rotational actuator using a thermomagnetic-induced magnetic force interaction. The actuator consists of a magnetic rotary beam, stainless-steel bearing, mechanical frame, thermomagnetic Gadolinium sheets, and thermoelectric generators (TEGs). Experimental results show that applying a sequence of currents to the TEGs successfully produces sequential magnetic forces. Consequently, these sequential magnetic forces rotate the beam for revolutions. When applying a sequence set of currents of −0.5 and 1.3 A, the maximum rotation speed and maximum stall torque of the actuator is 3.81 rpm and $136.2~\mu $ Nm, respectively. Most importantly, the operating temperatures of other thermomagnetic (and electrothermal) actuators are usually high, but the operating temperature of our actuator is approximately room temperature (13 °C–27 °C). Therefore, our actuators have more practical applications. According to the above-mentioned features, we believe our actuator is an important alternative approach to developing future rotational actuators and motors.
- Published
- 2018
50. An Information Distillation Framework for Extractive Summarization
- Author
-
Kuan-Yu Chen, Hsin-Min Wang, Berlin Chen, and Shih-Hung Liu
- Subjects
Context model ,Acoustics and Ultrasonics ,Computer science ,business.industry ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,computer.software_genre ,Automatic summarization ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Relevance (information retrieval) ,Artificial intelligence ,Electrical and Electronic Engineering ,Paragraph ,0305 other medical science ,Representation (mathematics) ,business ,computer ,Feature learning ,Natural language processing ,Sentence - Abstract
In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
- Published
- 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.