Database: Academic Search Index / Journal: speech communication / Topic: speech perception - Searchworks@Jio Institute Digital Library Search Results

Showing total 263 results

Start Over Topic speech perception Journal speech communication Database Academic Search Index

263 results

1. Analysis-by-synthesis based training target extraction of the DNN for noise masking.

Author: Cui, Zihao and Bao, Changchun
Subjects: *ARTIFICIAL neural networks, *AUDITORY masking, *SPEECH enhancement, *SPEECH perception, *SPEECH, *FOURIER transforms, *COMPUTATIONAL complexity
Abstract: • An ideal real-valued ratio mask (IRVRM) extraction method is proposed based on the analysis-by-synthesis (ABS) to import spectral dependency. In the synthesis process, the enhanced speech is obtained by inverse short-time Fourier transform of the masked spectrum, whereas in the analysis process, the IRVRM is obtained by maximizing the speech quality of the reconstructed speech from mask space. • The ABS loop method is proposed to reduce the computational complexity of the ABS-based mask design by loop searching in the iteratively generated subspace. • The generated subspace in this paper is linear spanned by the projection of a specific basis matrix. The specific basis matrix is the descending direction of the mean square error between the reconstructed speech and clean speech. In conventional speech enhancement methods, the target of noise mask in the time-frequency domain is based on deep neural networks (DNN), such as ideal ratio mask and phase-sensitive mask, in which they do not consider the dependency of spectrum. In this paper, an ideal real-valued ratio mask (IRVRM) extraction method is proposed based on the analysis-by-synthesis (ABS) for utilizing the dependency of spectrum. In the synthesis process, the enhanced speech is obtained by inverse short-time Fourier transform (ISTFT) of the masked spectrum, whereas in the analysis process, the IRVRM is determined by maximizing speech quality of the reconstructed speech from mask space. The ABS loop algorithm is proposed to reduce computational complexity, namely, the best mask in the specifically generated subspace is conducted in each loop. After the ABS loop, the approximated IRVRM is conducted. This IRVRM is further utilized as the training target of the DNN. The experimental results show that when the extracted IRVRM with the ABS loop is employed as the training target of the DNN, the speech quality is effectively improved in the DNN-based noise masking. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. Review of analysis methods for speech applications.

Author: O'Shaughnessy, Douglas
Subjects: *SPEECH perception, *AUTOMATIC speech recognition, *SPEECH, *TIME-frequency analysis, *TEXT recognition
Abstract: • This is a review of methods used to analyze speech signals for automatic recognition of their associated text, speaker identity, and other pertinent information are reviewed. • The survey focuses on the requirements of different applications, as diverse speech tasks have often used the same methods, despite having very different objectives. • As relevant information in a speech signal is distributed highly non-uniformly, a wide variety of time and frequency analysis techniques is examined. • The utility of methods is noted in terms of performance, using accuracy, complexity, cost, and latency as measures. This paper reviews methods used to analyze speech signals for various applications such as automatic recognition of associated text and speaker identity, and coding. The survey focuses on the requirements of different applications, as diverse speech tasks have often used the same methods, despite having very different objectives. As relevant information in a speech signal is distributed highly non-uniformly, a variety of time and frequency analysis techniques is examined. The utility of methods is noted in terms of performance, using accuracy, complexity, cost, and latency as criteria. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

3. Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review.

Author: Rathi, Tarun and Tripathy, Manoj
Subjects: *ARTIFICIAL neural networks, *CONVOLUTIONAL neural networks, *AFFECTIVE computing, *SPEECH perception, *EMOTION recognition
Abstract: • A thorough analysis of the speech data corpora used in various studies, highlighting their size, diversity, and relevance to real-world applications. • An examination of the different speech features employed for emotion classification, including prosody, spectral, and linguistic features, and their impact on classification accuracy. • A comprehensive study table of the dataset features and classifiers commonly used in different research papers on speech emotion classification has been shown to emphasize their strengths and limitations. Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions.

Author: Humayun, Mohammad Ali, Yassin, Hayati, and Abas, Pg Emeroylariffion
Subjects: *SPEECH perception, *NATIVE language, *CONVOLUTIONAL neural networks, *SPEECH, *VERBAL behavior testing
Abstract: • The proposed model fuses hierarchical CNNs, applied to vowel segments and the complete utterances of speech, for native language identification. • The model achieves up to 83.6% average accuracy, over 80–20% train-test ratios from the dataset comprising five native Indian languages with 55 speakers each. • The paper presents classification results for short and long durations of speech and individually for all ARPABET vowel segments. • The model effectively employs Low-Pass-Filtering based speech augmentation to improve the classification accuracy. Identifying the social background of an unknown speaker by speech accent has multiple applications including in forensic profiling and adaptation of speech recognition. The most effective accent classification models based on phoneme pronunciation require the presence of certain phonemes in the test speech and hence, are applicable only for a longer duration of test samples. On the other hand, the text-independent classifiers disregard the phoneme and linguistic information completely. This paper proposes an ensemble of convolutional neural networks for phoneme-based short-term and text-independent long-term classification of speech regarding speaker background profiling. The model is evaluated by classifying the native language of Indian speakers by their English speech. Both the classifiers within the ensemble complement each other positively; to give higher classification accuracy as compared to classification accuracies obtained from the individual classifiers. Low-pass filtering based speech augmentation has been proven to further improve the classification performance and average accuracy, with up to 79% and 73.7% accuracies achieved for speaker-level and sentence-level tests, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

5. Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring.

Author: Li, Qiujia, Zhang, Chao, and Woodland, Philip C.
Subjects: *AUTOMATIC speech recognition, *HYBRID systems, *SPEECH perception, *MARKOV processes, *ERROR rates
Abstract: The traditional hybrid deep neural network (DNN)–hidden Markov model (HMM) system and attention-based encoder–decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N -best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems. • Simple and effective framework to combine HMM-based and attention-based ASR systems. • Attention-based models viewed as audio-grounded LMs for 2nd-pass rescoring. • Proposed an improved lattice rescoring algorithm for attention-based models. • Discussed and compared LSTM and Transformer decoders for N-best and lattice rescoring. • Achieved very competitive WERs on AMI and Switchboard datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

6. Controllable speech synthesis by learning discrete phoneme-level prosodic representations.

Author: Ellinas, Nikolaos, Christidou, Myrsini, Vioni, Alexandra, Sung, June Sig, Chalamandaris, Aimilios, Tsiakoulis, Pirros, and Mastorocostas, Paris
Subjects: *SPEECH synthesis, *VERSIFICATION, *SPEECH, *PROSODIC analysis (Linguistics), *COGNITIVE styles, *SPEECH perception
Abstract: In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces. • Prosodic clustering for fine-grained phoneme-level prosody control. • Controllable end-to-end text-to-speech synthesis using intuitive discrete labels. • Multispeaker prosody control, with application to unseen speaker adaptation. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. Accurate synthesis of dysarthric Speech for ASR data augmentation.

Author: Soleymanpour, Mohammad, Johnson, Michael T., Soleymanpour, Rahim, and Berry, Jeffrey
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *SPEECH synthesis, *SPEECH disorders, *INTELLIGIBILITY of speech
Abstract: • Modified a neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. • Providing data augmentation for machine learning tasks such as automatic speech recognition. • Evaluating the effectiveness of the approach for dysarthria-specific speech recognition on the TORGO dataset, with results provided for two different experiments: the first includes augmented speech across 3 severities with pause insertion, and the second includes augmented speech with across a larger number of variables that include severity, pause, pitch, energy, and duration. • A relative improvement of 12.2 % on word error rate (WER), demonstrating that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has the potential for significant impact on dysarthric ASR systems. • Two subjective evaluations of the synthesized dysarthric speech are provided. This includes an evaluation of Dysarthric-ness that shows that the perceived level of the dysarthria tracks with the target synthesized dysarthric severity, as well as an evaluation of speaker similarity that shows a higher rating of similarity between synthesis target speaker and actual speaker when these are the same individual. • A demonstration web page with audio results of the synthesis is available at https://mohammadelc.github.io/SpeechGroupUKY/. Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/ [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Decoupled structure for improved adaptability of end-to-end models.

Author: Deng, Keqi and Woodland, Philip C.
Subjects: *AUTOMATIC speech recognition, *LANGUAGE models, *SPEECH perception, *SPEECH, *NEUROPLASTICITY
Abstract: Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance. • Decoupled structure to retain the adaptation advantage from DNN-HMM in end-to-end models. • Applied to attention-based encoder–decoder and neural transducer models. • Flexible domain adaptation with internal language model directly replaced. • Boosted cross-domain speech recognition accuracy while maintaining competitive intra-domain word error rates. • Consistent effectiveness across diverse tasks including the end-to-end speech translation. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Arabic Automatic Speech Recognition: Challenges and Progress.

Author: Besdouri, Fatma Zahra, Zribi, Inès, and Belguith, Lamia Hadrich
Subjects: *AUTOMATIC speech recognition, *CODE switching (Linguistics), *LINGUISTIC complexity, *SPEECH, *SPEECH perception
Abstract: This paper provides a structured examination of Arabic Automatic Speech Recognition (ASR), focusing on the complexity posed by the language's diverse forms and dialectal variations. We first explore the Arabic language forms, delimiting the challenges encountered with Dialectal Arabic, including issues such as code-switching and non-standardized orthography and, thus, the scarcity of large annotated datasets. Subsequently, we delve into the landscape of Arabic resources, distinguishing between Modern Standard Arabic (MSA) and Dialectal Arabic (DA) Speech Resources and highlighting the disparities in available data between these two categories. Finally, we analyze both traditional and modern approaches in Arabic ASR, assessing their effectiveness in addressing the unique challenges inherent to the language. Through this comprehensive examination, we aim to provide insights into the current state and future directions of Arabic ASR research and development. • Limited annotated data hampers the development of Dialectal Arabic speech recognition. • Dialectal variations and non-standard orthography complicate Arabic speech recognition. • Addressing dialects and code-switching is key to effective Arabic ASR systems. • Comparing traditional and modern methods reveals the best approach for Arabic ASR. • Advancing Arabic ASR requires innovative research and extensive data resources. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. A unified system for multilingual speech recognition and language identification.

Author: Liu, Danyang, Xu, Ji, Zhang, Pengyuan, and Yan, Yonghong
Subjects: *AUTOMATIC speech recognition, *LANGUAGE policy, *LINGUISTIC identity, *ACOUSTIC models, *SPEECH perception, *SEARCH algorithms
Abstract: In this paper, a multilingual automatic speech recognition (ASR) and language identification (LID) system is designed. In contrast to conventional multilingual ASR systems, this paper takes advantage of the complementarity of the ASR and LID modules. First, the LID module contributes to the language adaptive training of the multilingual acoustic model. Then, the ASR decoding information acts as the confidence metric to balance the LID results. To simulate complex multilingual speech recognition situations, two types of LID strategies are conducted. For a multilingual speech recognition task in which only one language is contained in the speech stream, the language information can be directly determined based on utterance-level judgment. Under this condition, a segment-level statistical component and a two-stage update strategy are designed to assist in the utterance-level language classification. In another multilingual speech recognition task, where the speech stream contains multiple languages simultaneously, the Viterbi language state retrieval method based on neural network (NN) classification is used to perform dynamic detection of the language state. In both cases, the ASR decoding information is used to adjust the language classification results. Without prior knowledge of language identity information, the enhanced LID module achieves an accuracy of 99.3% for utterance-level language judgment and 92.4% for dynamic language detection, and the multilingual ASR system also provides performance comparable to that of monolingual ASR systems. • This paper presents a multilingual speech recognition system. • The ASR module and LID module are constructed in a unified architecture to complement each other. • The LID module contributes to the language adaptive training of the acoustic model. • The ASR decoding information acts as the confidence metric to balance the LID results. • The Viterbi beam search algorithm is applied to dynamic language identification. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

11. Low-resource automatic speech recognition and error analyses of oral cancer speech.

Author: Halpern, Bence Mark, Feng, Siyuan, van Son, Rob, van den Brekel, Michiel, and Scharenborg, Odette
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *ORAL cancer, *SPEECH, *ACOUSTIC models, *FILM adaptations
Abstract: In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we investigate three acoustic modelling approaches that previously have worked well with low-resource scenarios using two different architectures; a hybrid architecture and a transformer-based end-to-end (E2E) model: (1) a retraining approach; (2) a speaker adaptation approach; and (3) a disentangled representation learning approach (only using the hybrid architecture). The approaches achieve a (1) 4.7% (hybrid) and 7.5% (E2E); (2) 7.7%; and (3) 2.0% absolute word error rate reduction, respectively, compared to a baseline system which is not trained on oral cancer speech. A detailed analysis of the speech recognition results shows that (1) plosives and certain vowels are the most difficult sounds to recognise in oral cancer speech — this problem is successfully alleviated by our proposed approaches; (3) however these sounds are also relatively poorly recognised in the case of healthy speech with the exception of/p/. (2) recognition performance of certain phonemes is strongly data-dependent; (4) In terms of the manner of articulation, E2E performs better with the exception of vowels — however, vowels have a large contribution to overall performance. As for the place of articulation, vowels, labiodentals, dentals and glottals are better captured by hybrid models, E2E is better on bilabial, alveolar, postalveolar, palatal and velar information. (5) Finally, our analysis provides some guidelines for selecting words that can be used as voice commands for ASR systems for oral cancer speakers. • We introduce a new annotated dataset of oral cancer speech. • We propose three different training strategies to recognise oral cancer speech. • Plosives are recognised well, even though these sounds are known to be impacted. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

12. Perceptual effects of interpolated Austrian and German standard varieties.

Author: Pucher, Michael, Kranawetter, Katharina, Reinisch, Eva, Koppensteiner, Wolfgang, and Lenz, Alexandra
Subjects: *INTELLIGIBILITY of speech, *GERMAN language, *VARIATION in language, *JUDGMENT (Psychology), *STANDARD language, *FREQUENCY spectra, *GEOGRAPHIC boundaries
Abstract: • Pluricentric Research on German by means of a listener judgment experiment. • Interpolation of speech samples between Austrian and German German standard language. • Interpolated samples are perceived as a continuum by Austrian and German listeners. • Listener groups judge the original and interpolated speech as closer to their own respective standard. • Judgments are based more on spectral information than the F0 contour. This article focuses on the perception of standard varieties produced by Austrian and German TV newscasters from the perspective of listeners from both countries, Germany and Austria. Thus, the paper's sociolinguistic scope is located in the pluricentric realm. It assesses (the perception of) standard language variation on fine phonetic levels. In addition to naturally produced stimuli (read sentences), "intermediate" samples were generated by means of a two-step interpolation procedure: First, the symbolic phone sequences were aligned with a Levenshtein-distance algorithm. Second, a phone-level Dynamic-Time-Warping (DTW) algorithm was applied to align the two utterances on a frame level, taking phoneme boundaries into account. Additionally, the spectrum and the fundamental frequency (F0) contour of the utterances was manipulated to be either interpolated or fixed to a given speaker. This procedure allowed for assessing these features' contribution to listeners' judgments of whether a given utterance sounds as if spoken by a speaker from (rather) Germany or (rather) Austria. Results of the judgment task showed that the interpolated samples were perceived in a continuous fashion, similarly by both groups, and with overall greater reliance on spectral than F0 information. These findings suggest that even fine phonetic differences between Standard German from Germany and Austria are recognized and evaluated by listeners from both countries. An overall bias towards perceiving all speech samples as closer to the listeners' own national standard points to a factor of familiarity with one's own and uncertainty towards the 'foreign' standard. The similar degree of this bias indicates that both groups have similar levels of familiarity with their own and uncertainty about the other standard variety. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. Automatic assessment of English proficiency for Japanese learners without reference sentences based on deep neural network acoustic models.

Author: Fu, Jiang, Chiba, Yuya, Nose, Takashi, and Ito, Akinori
Subjects: *ARTIFICIAL neural networks, *ACOUSTIC models, *PSYCHOACOUSTICS, *COMPUTER assisted language instruction, *GAUSSIAN mixture models, *ABILITY, *SPEECH perception
Abstract: • A novel machine score for automatic pronunciation evaluation is proposed: Reference-free Error Rate (RER). • The non-native acoustic models and native ones are combined together as an ASR-based automatic English proficiency evaluation system. • The DNN-based acoustic models significantly improved the accuracy of recognition. • The established evaluation system has the ability to evaluate the utterance from the speaker without knowing the transcription in advance. • The performance of the proposed RER score has a high correlation with human proficiency score. Speech-based computer-assisted language learning (CALL) systems should recognize the utterances of the learner with high accuracy and evaluate the language proficiency of the specific speaker with appropriate methods. In this paper, we discuss the automatic assessment of the second language (L2) for non-native speakers. There are many existing works on pronunciation evaluation by applying the goodness of pronunciation (GOP) method. This paper introduces an automatic proficiency evaluation system that combines various kinds of non-native acoustic models and native ones, such as Gaussian mixture model (GMM)-hidden Markov model (HMM) and deep neural network (DNN)-HMM. Most of existing works assume that we know the transcription of an utterance (the reference sentence) when evaluating the utterance, especially in reading and repeating tasks. To realize a reference-free proficiency evaluation, we propose a novel machine score named as the reference-free error rate (RER) to evaluate English proficiency. In our experiments, the DNN-based non-native acoustic models outperformed the traditional acoustic models on non-native speech recognition. Thus, we calculated the RER by regarding the recognition result from the DNN-based non-native acoustic model as "reference" and the result from the native acoustic model as "recognition result". The proposed RER has high correlation with human proficiency scores, which indicates the effectiveness of RER for automatically estimating the proficiency. By combining the RER with other machine scores such as the log-likelihood scores, we obtained high correlation (reading aloud task: r = 0.826 , p < 0.001 , N = 190 ; constrained interactive dialogue task: r = 0.803 , p < 0.001 , N = 26 ; spontaneous English conversation task: r = 0.799 , p < 0.001 , N = 28) to the human scores. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

14. Uneven success: automatic speech recognition and ethnicity-related dialects.

Author: Wassink, Alicia Beckford, Gansen, Cady, and Bartholomew, Isabel
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *RACISM, *DIALECTS, *ERROR rates, *SPEECH, *NATIVE Americans, *ETHNIC groups
Abstract: • Comparison of accuracy of the Microsoft Speech Services conversational ASR system finds that phonetic error rates are higher for speech samples of nonwhite, than for white, speakers • Sociophonetic variables account for 20% of ASR errors • CLOx allows automated transcription of conversational speech in one-fifth of the time required to manually produce an orthographic transcription • Steady improvements to ASR systems greatly expedite and simplify the task of sociolinguistic data analysis Addressing racial bias in automatic speech recognition is an area of concern in fields associated with human-computer interaction. Research to date suggests that sociolinguistic variation, namely systematic sources of sociophonetic variation, has yet to be extensively exploited in acoustic model architectures. This paper reports a study that evaluates the performance of one ASR system for a multi-ethnic sample of speakers from the American Pacific Northwest (including Native American, African American, European American and ChicanX speakers). Using a sociophonetic approach to characterizing vocalic and consonantal variation, we ask which dialect features appear to be most challenging for our ASR system. We also ask which error types are particular to the four ethnic dialects sampled. Recordings of both conversational and read speech were coded for a common set of 18 sociophonetic variables with distinct phonetic profiles. Automatic transcription was achieved using CLOx , a custom-built ASR system created for sociolinguistic analysis. Normalized error frequency rates were compared across ethnic samples to evaluate CLOx performance. N f error rates demonstrate clear differential performance in the ASR system, pointing to racial bias in system output. Specific predictions are made regarding approaches that might be taken to leverage sociophonetic knowledge to improve social dialect-recognition accuracy in ASR systems. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. Multilingual speech recognition for GlobalPhone languages.

Author: Tachbelie, Martha Yifiru, Abate, Solomon Teferra, and Schultz, Tanja
Subjects: *SPEECH perception, *AUTOMATIC speech recognition, *LINGUISTIC complexity, *ACOUSTIC models
Abstract: In this paper, we present our investigations towards the development of multilingual automatic speech recognition (ML ASR) systems using the GlobalPhone database. In addition to GlobalPhone, we have included 4 Ethiopian languages (Amharic, Oromo, Tigrigna and Wolaytta), as well as Uyghur and English in our investigation. In order to see the impact of language relatedness in ML ASR training, we have analyzed both phonetic overlap and morphological complexity of the languages. Deep Neural Network based ML ASR systems have been developed using ML mix, transfer and multitask learning approaches. Relative word error rate (WER) reductions up to 33.21% have been achieved as a result of using resources of other languages in ML acoustic model training. Our experimental results show that languages with small amounts of monolingual training data benefit a lot from ML training. Moreover, using phonetically related languages in ML training is more beneficiary than using phonetically less related languages. It seems that the nature of the corpus (single or mixed domain, noisy or noise free, etc.) has also an impact in ML training although it is not as important as the phonetic relatedness of the languages. • MLASR for low-resourced languages using data from 26 languages and three approaches • Transfer learning leads to better performance for most of the languages • Languages with small amount of data benefit a lot from ML training • High WER reduction is obtained when phonetically related languages are used in MLASR • Using resources of any language is better than using small monolingual data in ASR [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion.

Author: Atmaja, Bagus Tris, Sasou, Akira, and Akagi, Masato
Subjects: *EMOTION recognition, *AUTOMATIC speech recognition, *EMOTIONS, *SPEECH perception, *FEATURE extraction, *AFFECTIVE computing, *SUPPORT vector machines
Abstract: Speech emotion recognition (SER) is traditionally performed using merely acoustic information. Acoustic features, commonly are extracted per frame, are mapped into emotion labels using classifiers such as support vector machines for machine learning or multi-layer perceptron for deep learning. Previous research has shown that acoustic-only SER suffers from many issues, mostly on low performances. On the other hand, not only acoustic information can be extracted from speech but also linguistic information. The linguistic features can be extracted from the transcribed text by an automatic speech recognition system. The fusion of acoustic and linguistic information could improve the SER performance. This paper presents a survey of the works on bimodal emotion recognition fusing acoustic and linguistic information. Five components of bimodal SER are reviewed: emotion models, datasets, features, classifiers, and fusion methods. Some major findings, including state-of-the-art results and their methods from the commonly used datasets, are also presented to give insights for the current research and to surpass these results. Finally, this survey proposes the remaining issues in the bimodal SER research for future research directions. • Survey of new paradigms in emotion recognition by fusing acoustics and linguistics • Review more than a hundred works from model-based to data-driven approaches • Describe fusion methods by their similarities with their descriptions • Discuss current findings from the major datasets and future challenges [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. Unsupervised Automatic Speech Recognition: A review.

Author: Aldarmaki, Hanan, Ullah, Asad, Ram, Sreepratha, and Zaki, Nazar
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *SPEECH, *MAPS
Abstract: Automatic Speech Recognition (ASR) systems can be trained to achieve remarkable performance given large amounts of manually transcribed speech, but large labeled data sets can be difficult or expensive to acquire for all languages of interest. In this paper, we review the research literature to identify models and ideas that could lead to fully unsupervised ASR, including unsupervised sub-word and word modeling, unsupervised segmentation of the speech signal, and unsupervised mapping from speech segments to text. The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition. Identifying these limitations would help optimize the resources and efforts in ASR development for low-resource languages. • Steps in unsupervised speech recognition include segmentation, embedding, clustering, and mapping. • Unsupervised segmentation can be performed at the level of phones, syllables, or words. • Unsupervised word segmentation can be used for word-level speech-to-text mapping. • Speech segments can be automatically mapped to corresponding text segments using adversarial networks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

18. Neural speech-rate conversion with multispeaker WaveNet vocoder.

Author: Okamoto, Takuma, Matsubara, Keisuke, Toda, Tomoki, Shiga, Yoshinori, and Kawai, Hisashi
Subjects: *VOCODER, *CORPORA, *INFERENCE (Logic), *SPEECH perception
Abstract: Speech-rate conversion technology, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches. To improve the synthesis quality, this paper proposes a machine-learning-based approach using neural vocoders, to perform neural speech-rate conversion. The proposed approach introduces a multispeaker WaveNet vocoder trained with a multispeaker corpus. Speech-rate conversion for many and unspecified speakers, not included in the training data, is realized by resampling acoustic features or hidden features along the time direction in inference. In experiments, the multispeaker WaveNet vocoder was trained using the JVS corpus and two types of resampling methods were compared. Conventional WSOLA and STRAIGHT were also compared as signal-processing-based baselines. The test sets included Japanese speaker corpora for the monolingual condition, and an English multispeaker corpus (CMU ARCTIC) for the cross-lingual condition. The results of the experiments demonstrate that the proposed approach with resampling of hidden features can achieve higher quality speech-rate conversion than the conventional methods, in both monolingual and cross-lingual conditions, except for speakers with low fundamental frequency in conversion of fast speech. • Neural-network-based speech-rate conversion is proposed. • Neural speech-rate conversion is realized by expanding or compressing hidden features along the time direction. • Proposed neural speech-rate conversion outperforms conventional WSOLA and STRAIGHT. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. Pronunciation error detection model based on feature fusion.

Author: Zhu, Cuicui, Wumaier, Aishan, Wei, Dongping, Fan, Zhixing, Yang, Jianlei, Yu, Heng, Kadeer, Zaokere, and Wang, Liejun
Subjects: *RECOGNITION (Psychology), *PRONUNCIATION, *SPEECH perception, *ERROR functions, *PHONEME (Linguistics)
Abstract: Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions. • The utilization of pronunciation error types in the pronunciation error detection model significantly enhances its performance. • Jointly using Focal loss and multi-task loss effectively resolves overconfidence caused by CTC loss. • The model excels across multiple evaluation metrics by incorporating joint loss functions and error type information. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. A two-level Item Response Theory model to evaluate speech synthesis and recognition.

Author: Oliveira, Chaina S., Moraes, João V.C., Filho, Telmo Silva, and Prudêncio, Ricardo B.C.
Subjects: *ITEM response theory, *SPEECH synthesis, *SPEECH perception, *AUTOMATIC speech recognition, *MODEL theory, *VERBAL behavior testing
Abstract: Automatic speech recognition (ASR) systems should be tested ideally using diverse speech test data. A promising alternative to produce such test data is to synthesize speeches from diverse sentences and speakers. However, despite the great amount of test data that can be produced, not all speeches are equally relevant. This paper proposes a two-level Item Response Theory (IRT) model to simultaneously evaluate ASR systems, speakers and sentences. In the first level, the transcription rates obtained by a pool of ASR systems on a set of synthesized speeches are recorded and then analyzed to estimate: each speech's difficulty and each ASR system's ability. In the second level, each speech's difficulty is decomposed as a function of two factors: the sentence's difficulty and the speaker's quality. Thus, the speech's difficulty is high when generated from a difficult sentence and a bad speaker, while an ASR is good when it is robust to hard speeches. Performed experiments revealed useful insights on how the quality of speech synthesis and recognition can be affected by distinct factors (e.g., sentence difficulty and speaker ability). • An original solution for simultaneously evaluating speech synthesis and recognition using Item Response Theory. • The difficulty of a synthesized speech depends on the performance of automatic speech recognition systems with different abilities when transcribing it. • Specific sentences may have a more significant influence on the synthesis quality than the speakers' abilities. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. Data augmentation based non-parallel voice conversion with frame-level speaker disentangler.

Author: Chen, Bo, Xu, Zhihang, and Yu, Kai
Subjects: *DATA augmentation, *PROSODIC analysis (Linguistics), *DATABASES, *SPEECH synthesis, *DATA conversion, *SPEECH perception
Abstract: Non-parallel data voice conversion is a popular and challenging research area. The main task is to build acoustic mappings from the source speaker to the target speaker in different units (e.g., frame, phoneme, cluster, sentence). With the help of the recent high-quality speech synthesis techniques, it is possible to directly produce parallel speech using non-parallel data. This paper proposes ParaGen: a data augmentation based technique for non-parallel data voice conversion. The system consists of a speaker disentangler based text-to-speech model and a simple frame-to-frame spectrogram conversion model. The text-to-speech model takes the text and reference audio as input to produce the speech in the target speaker identity with the time-aligned local speaking style from the reference audio. The spectrogram conversion model directly converts the source spectrogram to the target speaker framewisely. The local speaking style is extracted by an acoustic encoder while the speaker identity is eliminated by a conditional convolutional disentangler. The local style encodings are time-aligned with the text encodings by the attention mechanism. The attention contexts are decoded by a conditional recurrent decoder. The experiment shows that the speaker identity of the source speech is converted to the target speaker while the local speaking style (e.g., prosody) is preserved after the augmentation. The method is compared to the augmentation model with typical statistical parameter speech synthesis (SPSS) with pre-aligned phoneme duration. The result shows that the converted speech has better speech naturalness than the SPSS system, while the speaker similarities of the converted speech are close. • We propose a data augmentation based technique for non-parallel voice conversion. • It produces time-aligned parallel data with the same frame-level speaking style. • We use the frame-level adversarial loss to reduce the speaker identity. • We propose two separate speaker embeddings before and after the attention mechanism. • We use stacked 2D CNNs with conditional 1D CNNs to extract local speaking style. • We can use a simple network to build voice conversion model with the augmented data. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

22. Seamless equal accuracy ratio for inclusive CTC speech recognition.

Author: Gao, Heting, Wang, Xiaoxuan, Kang, Sunghun, Mina, Rusty, Issa, Dias, Harvill, John, Sari, Leda, Hasegawa-Johnson, Mark, and Yoo, Chang D.
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *ARRANGEMENT (Musical composition), *HISPANIC Americans, *ERROR rates, *AFRICAN Americans
Abstract: Concerns have been raised regarding performance disparity in automatic speech recognition (ASR) systems as they provide unequal transcription accuracy for different user groups defined by different attributes that include gender, dialect, and race. In this paper, we propose "equal accuracy ratio", a novel inclusiveness measure for ASR systems that can be seamlessly integrated into the standard connectionist temporal classification (CTC) training pipeline of an end-to-end neural speech recognizer to increase the recognizer's inclusiveness. We also create a novel multi-dialect benchmark dataset to study the inclusiveness of ASR, by combining data from existing corpora in seven dialects of English (African American, General American, Latino English, British English, Indian English, Afrikaaner English, and Xhosa English). Experiments on this multi-dialect corpus show that using the equal accuracy ratio as a regularization term along with CTC loss, succeeds in lowering the accuracy gap between user groups and reduces the recognition error rate compared with a non-regularized baseline. Experiments on additional speech corpora that have different user groups also confirm our findings. • A seven-dialect corpus is created by combining existing English corpora. • "Equal accuracy ratio" is proposed as an inclusiveness measure of speech recognizers. • Inclusiveness regularization to the neural network can reduce performance disparity. • Carefully tuned inclusiveness regularization can improve average recognition accuracy. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

23. Progress of machine learning based automatic phoneme recognition and its prospect.

Author: Malakar, Mousumi and Keskar, Ravindra B.
Subjects: *AUTOMATIC speech recognition, *MACHINE learning, *PHONEME (Linguistics), *SPEECH perception, *ALGORITHMS
Abstract: • A comprehensive survey related to automatic phoneme recognition is presented. • Necessity of explicit phoneme recognition, its applications and scope of machine learning tools for the same are highlighted. • Both acoustic-phonetic and machine learning approaches are discussed, with focus on machine learning based approaches and their descriptions. • Accuracies of various machine learning techniques in different research works are presented in a tabular form. A phoneme is the smallest perceptually distinct sound unit that can be distinguished among words in a particular language. Every language has its own set of phonemes, and all possible words can be considered as ordered sequences of phonemes.The total number of phonemes contained in a language is always very few in comparison to the size of the vocabulary supported by the language. These facts have made phoneme recognition an attractive proposition in the entire journey of the Automatic Speech Processing (ASP) till date. As a result, the classification and recognition of phonemes are considered as the primary tasks of automatic speech recognition (ASR) systems irrespective of application domain. The dynamic nature of phonemes and several sources of their variability create lots of barriers in accurate identification of phonemes from an acoustic signal. The contribution of Machine Learning (ML) based techniques in overcoming these obstructions in automatic phoneme recognition (APR) is remarkable. Nowadays with lot of data availability, ML based ASR is preferred because of its simplicity over acoustic-phonetic based methods. The ML based techniques do not follow the conventional method based on identification of acoustic properties. Rather, ML techniques build their own trained model (algorithm) using readily available data. They do so by finding out the hidden patterns in speech signals, and acquire predictive intelligence through learning. Therefore, ML techniques can be said to provide a more generalized model for phoneme classification. In this paper, we present a comprehensive survey of ML tools to build phoneme recognizers. We also highlight some applications of speech (especially phoneme) recognition which illustrate the current scope as well as future prospects of APR. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

24. A study on the perception of prosodic cues to focus by Egyptian listeners: Some make use of them, but most of them don't.

Author: Zarka, Dina El and Hödl, Petra
Subjects: *PERCEPTION testing, *EGYPTIANS, *SPEECH perception, *PROSODIC analysis (Linguistics)
Abstract: • Speakers of a non-deaccenting language are less successful than speakers of a deaccenting language in identifying the information structure of utterances perceptually solely on the basis of prosodic differences. • Where identification is successful, listeners use on-focus acoustic cues as well as post-focus compression to identify focus. • Significant acoustic differences in production do not automatically imply successful perceptual identification. • High inter-speaker and intra-speaker variation as well as non-categorical (less salient) prosodic differences may lead to a weaker conventionalization of a form-to-meaning mapping. The role of prosody in the encoding of information structure in Egyptian Arabic is not yet fully understood. A recent production study found significant acoustic differences between morpho-syntactically identical utterances expressing a topic-comment and a focus-background structure. This paper presents the results from two perception experiments testing Egyptian Arabic listeners' ability to infer the two pragmatically different structures from acoustically different prosodic realizations. In the first experiment, naturalistic stimuli were used while in the second experiment, the stimuli were gated after the first target word whose pitch contour was manipulated. In addition, the first experiment was repeated with German-speaking listeners to compare their performance to that of the Egyptian listeners. The results suggest that Egyptian listeners perform only slightly above chance-level in both conditions, while the German-speaking listeners were more successful (75% correct). We interpret the results as a difference in conventionalization of the mapping between prosodic form and pragmatic meaning. This difference may be due to the non-categorical character of post-focus compression and the strong variation within and across speakers in Egyptian Arabic. Thus, the results lend some support to the distinction between 'deaccenting' and 'non-deaccenting' languages in terms of the conventionalization of a prosodic structure. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

25. Data augmentation using generative adversarial networks for robust speech recognition.

Author: Qian, Yanmin, Hu, Hu, and Tan, Tian
Subjects: *SPEECH perception, *ACOUSTIC models, *ARTIFICIAL neural networks, *VIDEO coding, *DATA reduction
Abstract: • This paper utilizes three different GANs for data augmentation to improve speech recognition under noise conditions. • The experiments show that out proposed data augmentation approaches can obtain the performance improvement under all noisy conditions, which have additive noise, channel distortion and reverberation. • With the proposed approach, we can use GAN to generate more training data under noisy conditions, which can be used in multi-condition training of acoustic modeling in robust speech recognition. For noise robust speech recognition, data mismatch between training and testing is a significant challenge. Data augmentation is an effective way to enlarge the size and diversity of training data and solve this problem. Different from the traditional approaches by directly adding noise to the original waveform, in this work we utilize generative adversarial networks (GAN) for data generation to improve speech recognition under noise conditions. In this paper we investigate different configurations of GANs. Firstly the basic GAN is applied: the generated speech samples are based on spectrum feature level and produced frame by frame without dependence among them, and there is no true labels. Thus, an unsupervised learning framework is proposed to utilize these untranscribed data for acoustic modeling. Then, in order to better guide the data generation, condition information is introduced into GAN structures, and the conditional GAN is utilized: two different conditions are explored, including the acoustic state of each speech frame and the original paired clean speech of each speech frame. With the incorporation of specific condition information into data generation, these conditional GANs can provide true labels directly, which can be used for later acoustic modeling. During the acoustic model training, these true labels are combined with the soft labels which make the model better. The proposed GAN-based data augmentation approaches are evaluated on two different noisy tasks: Aurora4 (simulated data with additive noise and channel distortion) and the AMI meeting transcription task (real data with significant reverberation). The experiments show that the new data augmentation approaches can obtain the performance improvement under all noisy conditions, which including additive noise, channel distortion and reverberation. With these augmented data by basic GAN / conditional GAN, a relative 6% to 14% WER reduction can be obtained upon an advanced acoustic model. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

26. The Hearing-Aid Speech Perception Index (HASPI) Version 2.

Author: Kates, James M. and Arehart, Kathryn H.
Subjects: *SPEECH perception, *INTELLIGIBILITY of speech, *AUDITORY masking, *HEARING disorders, *SIGNAL processing, *PARAMETRIC modeling
Abstract: • We propose a revised intelligibility index based on the outputs of an auditory model. • The auditory model incorporates peripheral hearing loss and is accurate for both normal and impaired hearing. • The index compares the model outputs for a processed signal with the outputs for an unprocessed reference signal. • The index uses measurements of envelope fidelity calculated using a modulation filterbank • Index results are presented for noise, distortion, reverberation, and nonlinear signal processing outputs. This paper presents a revised version of the Hearing-Aid Speech Perception Index (HASPI). The index is based on a model of the auditory periphery that incorporates changes due to hearing loss and is valid for both normal-hearing and hearing-impaired listeners. It is an intrusive metric that compares the time-frequency envelope and temporal fine structure (TFS) of a degraded signal to an unprocessed reference. The first modification to HASPI is an extension to the range of envelope modulation rates considered in the metric. HASPI applies a lowpass filter to the time-frequency envelope modulation, and in the new version this single filter is replaced by a modulation filterbank. The temporal fine structure (TFS) analysis in the original version of HASPI is replaced by the filterbank outputs at higher modulation rates that represent auditory roughness and periodicity. The second modification is replacing the parametric model combining envelope and TFS measurements used in the original version with an ensemble of neural networks. The improved version of HASPI is compared to the original version for datasets from five experiments that encompass noise and nonlinear distortion, frequency compression, ideal binary mask noise suppression, speech modified using a noise vocoder, and speech in reverberation. The new version of HASPI is shown to have a statistically-significant reduction in RMS error compared to the original version for most of the data considered, and to be significantly more accurate for speech in reverberation. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

27. Factorized and progressive knowledge distillation for CTC-based ASR models.

Author: Tian, Sanli, Li, Zehan, Lyv, Zhaobiao, Cheng, Gaofeng, Xiao, Qing, Li, Ta, and Zhao, Qingwei
Subjects: *DISTILLATION, *FORMATIVE tests, *DISTRIBUTION (Probability theory), *KNOWLEDGE transfer, *SPEECH perception
Abstract: Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model's posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher's blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model's learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher's posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD. • We explore why the conventional KD underperforms when applied to CTC models. • We propose Factorized KL-divergence for CTC-based models' KD. • We propose a progressive KD framework to gradually build up the student's knowledge. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences.

Author: Tang, Ping, Li, Shanpeng, Shen, Yanan, Yu, Qianxi, and Feng, Yan
Subjects: *COCHLEAR implants, *ORAL communication, *SPEECH, *ABSOLUTE pitch, *HEARING disorders
Abstract: • This paper examines the perception of Mandarin tones by children with cochlear implants in both audio-only (AO) and audiovisua (AV) conditions across quiet and noisy environments. • A picture-pointing task was adopted. • Children with cochlear implants demonstrated higher tonal perception accuracy in the AV condition compared to the AO condition, implying that they can use visual-articulatory cues to enhance their tonal perception, although in noisy environments only. • Children who were implanted earlier are better able to use visual cues to facilitate tonal perception. • These findings highlight the importance of visual cues in speech communication for individuals with hearing impairments and their strong ability to perceive visual speech. Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers' facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. The effect of musical expertise on whistled vowel identification.

Author: Tran Ngoc, Anaïs, Meyer, Julien, and Meunier, Fanny
Subjects: *VOWELS, *SPEECH perception, *KNOWLEDGE representation (Information theory), *SPEECH, *EXPERTISE
Abstract: • Musicians and non-musicians process the whistled speech signal differently. • Musical expertise affects whistled vowel perception with advantages for lower vowels. • Inter-whistler variation affects both musicians and non-musicians. • Musical advantages are more marked with a wider whistled production range and stable frequencies. In this paper, we looked at the impact of musical experience on whistled vowel categorization by native French speakers. Whistled speech, a natural, yet modified speech type, augments speech amplitude while transposing the signal to a range of fairly high frequencies, i.e. 1 to 4 kHz. The whistled vowels are simple pitches of different heights depending on the vowel position, and generally represent the most stable part of the signal, just as in modal speech. They are modulated by consonant coarticulation(s), resulting in characteristic pitch movements. This change in speech mode can liken the speech signal to musical notes and their modulations; however, the mechanisms used to categorize whistled phonemes rely on abstract phonological knowledge and representation. Here we explore the impact of musical expertise on such a process by focusing on four whistled vowels (/i, e, a, o/) which have been used in previous experiments with non-musicians. We also included inter-speaker production variations, adding variability to the vowel pitches. Our results showed that all participants categorize whistled vowels well over chance, with musicians showing advantages for the middle whistled vowels (/a/ and /e/) as well as for the lower whistled vowel /o/. The whistler variability also affects musicians more than non-musicians and impacts their advantage, notably for the vowels /e/ and /o/. However, we find no specific training advantage for musicians over the whole experiment, but rather training effects for /a/ and /e/ when taking into account all participants. This suggests that though musical experience may help structure the vowel hierarchy when the whistler has a larger range, this advantage cannot be generalized when listening to another whistler. Thus, the transfer of musical knowledge present in this task only influences certain aspects of speech perception. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Language fusion via adapters for low-resource speech recognition.

Author: Hu, Qing, Zhang, Yan, Zhang, Xianlei, Han, Zongyu, and Liang, Xiuxia
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *COMPUTATIONAL linguistics, *SPEECH processing systems
Abstract: Data scarcity makes low-resource speech recognition systems suffer from severe overfitting. Although fine-tuning addresses this issue to some extent, it leads to parameter-inefficient training. In this paper, a novel language knowledge fusion method, named LanFusion, is proposed. It is built on the recent popular adapter-tuning technique, thus maintaining better parameter efficiency compared with conventional fine-tuning methods. LanFusion is a two-stage method. Specifically, multiple adapters are first trained on several source languages to extract language-specific and language-invariant knowledge. Then, the trained adapters are re-trained on the target low-resource language to fuse the learned knowledge. Compared with Vanilla-adapter, LanFusion obtains a relative average word error rate (WER) reduction of 9.8% and 8.6% on the Common Voice and FLEURS corpora, respectively. Extensive experiments demonstrate the proposed method is not only simple and effective but also parameter-efficient. Besides, using source languages that are geographically similar to the target language yields better results on both datasets. • LanFusion's key innovation lies in improving LLSR from a new perspective of language knowledge fusion. • LanFusion is based on the recent popular adapter-tuning. Besides, language-invariant adapters are introduced. • Extensive experiments on Common Voice and FLEURS show the effectiveness and parameter efficiency of LanFusion. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. Monaural multi-talker speech recognition using factorial speech processing models.

Author: Khademian, Mahdi and Homayounpour, Mohammad Mehdi
Subjects: *ARTIFICIAL neural networks, *SIGNAL processing, *SPEECH perception, *SPEECH processing systems
Abstract: A Pascal challenge entitled monaural speech separation and recognition challenge was developed, targeting the problem of robust automatic speech recognition against speech-like noises which significantly degrade the performance of automatic speech recognition systems. In this challenge, two competing speakers say a simple command simultaneously and the objective is to recognize speech of the target speaker. Surprisingly, a team from IBM research could achieve performance better than human listeners on this task during the challenge. The IBM system consists of an intermediate speech separation and two single-talker speech recognition modules. This paper reconsiders the recognition task of this challenge based on gain adapted factorial speech processing models. It develops a joint-token passing algorithm for direct joint-decoding of target and masker speakers’ mixed-signals, simultaneously. It uses maximum uncertainty during the joint-decoding, which cannot be used in the two-phased IBM system. This paper provides a detailed derivation of inference on these models based on the general inference procedures of probabilistic graphical models. Additionally, it uses deep neural networks for joint-speaker identification and their gain estimation, which makes these two steps easier than before while producing competitive results for these steps. The proposed method of this work outperforms past super-human results and even the results recently achieved using deep neural networks by Microsoft research. It achieved 5.3% absolute task performance improvement compared to the first super-human system and 2.5% absolute task performance improvement compared to its recent competitor. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. An Iterative Graph Spectral Subtraction Method for Speech Enhancement.

Author: Yan, Xue, Yang, Zhen, Wang, Tingting, and Guo, Haiyan
Subjects: *SPEECH enhancement, *SIGNAL-to-noise ratio, *SIGNAL processing, *INTELLIGIBILITY of speech, *SPEECH perception
Abstract: • The graph speech signals are constructed based on the combined k-shift operators. • The graph spectrum of graph speech and graph noise signals in the graph Fourier domain are of significant differences. • Spectral subtraction methods are proposed in the graph frequency domain. • Performance of iterative graph spectral subtraction (IGSS) method performs better especially under low SNR conditions. In this paper, we investigate the application of graph signal processing (GSP) theory in speech enhancement. We first propose a set of shift operators to construct graph speech signals, and then analyze their spectrum in the graph Fourier domain. By leveraging the differences between the spectrum of graph speech and graph noise signals, we further propose the graph spectral subtraction (GSS) method to suppress the noise interference in noisy speech. Moreover, based on GSS, we propose the iterative graph spectral subtraction (IGSS) method to further improve the speech enhancement performance. Our experimental results show that the proposed operators are suitable for graph speech signals, and the proposed methods outperform the traditional basic spectral subtraction (BSS) method and iterative basic spectral subtraction (IBSS) method in terms of both signal-to-noise ratios (SNR) and mean Perceptual Evaluation of Speech Quality (PESQ). [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

33. A review of multi-objective deep learning speech denoising methods.

Author: Azarang, Arian and Kehtarnavaz, Nasser
Subjects: *DEEP learning, *SPEECH, *SIGNAL-to-noise ratio, *SPEECH perception
Abstract: This paper presents a review of multi-objective deep learning methods that have been introduced in the literature for speech denoising. After stating an overview of conventional, single objective deep learning, and hybrid or combined conventional and deep learning methods, a review of the mathematical framework of the multi-objective deep learning methods for speech denoising is provided. A representative method from each speech denoising category, whose codes are publicly available, is selected and a comparison is carried out by considering the same public domain dataset and four widely used objective metrics. The comparison results indicate the effectiveness of the multi-objective method compared with the other methods, in particular when the signal-to-noise ratio is low. Possible future improvements that can be achieved are also mentioned. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

34. Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition.

Author: Sun, Ri Hyon and Chol, Ri Jong
Subjects: *SPEECH perception, *AUTOMATIC speech recognition, *GAUSSIAN mixture models, *PSYCHOACOUSTICS, *RECURRENT neural networks, *ACOUSTIC models
Abstract: • Features embedded the information of word history could improve modeling accuracy. • Word-clustering with decision trees could increase modeling accuracy and handle OOVs. • Subspace Gaussian mixture models would alleviate the sparseness of language modeling. • Models of a family of Gaussian distribution have been already developed deeply. This paper focuses on adaptable continuous space language modeling approach of combining longer context information of recurrent neural network (RNN) with adaptation ability of subspace Gaussian mixture model (SGMM) which has been widely used in acoustic modeling for automatic speech recognition (ASR). In large vocabulary continuous speech recognition (LVCSR) it is a challenging problem to construct language models that can capture the longer context information of words and ensure generalization and adaptation ability. Recently, language modeling based on RNN and its variants have been broadly studied in this field. The goal of our approach is to obtain the history feature vectors of a word with longer context information and model every word by subspace Gaussian mixture model such as Tandem system used in acoustic modeling for ASR. Also, it is to apply fMLLR adaptation method, which is widely used in SGMM based acoustic modeling, for adaptation of subspace Gaussian mixture based language model (SGMLM). After fMLLR adaptation, SGMLMs based on Top-Down and Bottom-Up obtain WERs of 5.70 % and 6.01%, which are better than 4.15% and 4.61% of that without adaptation, respectively. Also, with fMLLR adaptation, Top-Down and Bottom-Up based SGMLMs yield absolute word error rate reduction of 1.48%, 1.02% and a relative perplexity reduction of 10.02%, 6.46% compared to RNNLM without adaptation, respectively. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

35. Perceptual motivation for rhotics as a class.

Author: Howson, Phil J. and Monahan, Philip J.
Subjects: *GEOGRAPHICAL perception, *NATIVE language, *SPEECH perception, *INVENTORIES, *SENSORY perception
Abstract: • Tested perception of three rhotics /r ɻ ʀ/, against three stops /dj ɖ ɟ/, three nasals /nj ɳ ɲ/, three fricatives /z̪ ʐ ʑ/, and three laterals /ɫ lj ɭ/. • Performed AX discrimination task and used D-primes to calculate perceptual maps with multidimensional scaling. • Revealed low D-prime for rhotic-rhotic comparisons, small perceptual space for rhotics, and distinct grouping in the perceptual space for each of the natural classes. • Suggests acoustic-perceptual similarity between rhotics which may contribute to class membership. • May explain typological tendency to have small rhotic inventories. Finding phonetic correlates of rhotics as a natural class has been elusive, leading to the suggestion that any class-based relationship between different rhotic categories is purely phonological in nature. This paper examines native English speakers' perception of three different non-native rhotics (i.e., /r ɻ ʀ/) compared to non-native sounds from four other manners of articulation (stops, nasals, fricatives, and laterals). The results revealed that speakers cannot reliably discriminate between the rhotics examined here and that the perceptual distance between members of the class of rhotics is smaller than the other tested classes, aside from the comparison with laterals. The current results suggest that there is an acoustic-perceptual correlate to rhotics as a natural class and that their perception explains the relative rarity of large rhotic inventories cross-linguistically. The comparison with laterals also suggests why rhotics are often paired with laterals in inventories with two or more liquids. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

36. Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English.

Author: Zhang, Yunqi C., Hioka, Yusuke, Hui, C.T. Justine, and Watson, Catherine I.
Subjects: *SPEECH enhancement, *ENGLISH language, *INTELLIGIBILITY of speech, *SPEECH perception, *ALGORITHMS
Abstract: Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise. • Early study explores speech enhancement impact on listeners with varied immersion. • Tested algorithms failed to improve the intelligibility of extremely noisy speech. • Early-immersed group has higher intelligibility than late- and non-immersed groups. • Intelligibility of Mandarin groups at lower SNRs was similar regardless of immersion. • Both Mandarin groups found it difficult to distinguish NZE vowels in noise. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Dithering techniques in automatic recognition of speech corrupted by MP3 compression: Analysis, solutions and experiments.

Author: Borsky, Michal, Mizera, Petr, Pollak, Petr, and Nouza, Jan
Subjects: *SPEECH perception, *MPEG (Video coding standard), *AUTOMATIC speech recognition, *NEURAL circuitry, *GAUSSIAN mixture models, *DATABASE management
Abstract: A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that has achieved a really wide popularity in general audio coding, and in speech, too. However, the algorithm is lossy in nature and introduces distortion into spectral and temporal characteristics of a signal. In this paper we study its impact on automatic speech recognition (ASR). We show that with decreasing MP3 bitrates the major source of ASR performance degradation is deep spectral valleys (i.e. bins with almost zero energy) caused by the masking effect of the MP3 algorithm. We demonstrate that these unnatural gaps in spectrum can be effectively compensated by adding a certain amount of noise to the distorted signal. We provide theoretical background for this approach where we show that the added noise affects mainly the spectral valleys. They are filled by the noise while the spectral bins with speech remain almost unchanged. This helps to restore a more natural shape of log spectrum and cepstrum, and consequently has a positive impact on ASR performance. In our previous work, we have proposed two types of the signal dithering (noise addition) technique, one applied globally, the other in a more selective way. In this paper, we offer a more detailed insight into their performance. We provide results from many experiments where we test them in various scenarios, using a large vocabulary continuous speech recognition (LVCSR) system, acoustic models based on gaussian-mixture model (GMM) as well as on deep-neural network (DNN), and multiple speech databases in three languages (Czech, English and German). Our results prove that both the proposed techniques, and the selective dithering method, in particular, yield consistent compensation of the negative impact of the MP3 compressed speech on ASR performance. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

38. Speaker recognition using PCA-based feature transformation.

Author: Ahmed, Ahmed Isam, Chiverton, John P., Ndzi, David L., and Becerra, Victor M.
Subjects: *RECURRENT neural networks, *SPEECH perception, *ACCURACY, *COVARIANCE matrices, *GAUSSIAN mixture models
Abstract: This paper introduces a Weighted-Correlation Principal Component Analysis (WCR-PCA) for efficient transformation of speech features in speaker recognition. A Recurrent Neural Network (RNN) technique is also introduced to perform the weighted PCA. The weights are taken as the log-likelihood values from a fitted Single Gaussian-Background Model (SG-BM). For speech features, we show that there are large differences between feature variances which makes covariance based PCA less optimal. A comparative study of the performance of speaker recognition is presented using weighted and unweighted correlation and covariance based PCA. Extensions to improve the extraction of MFCC and LPCC features of speech are also proposed. These are Odd Even filter banks MFCC (OE-MFCC) and Multitaper-Fitted LPCC. The methodologies are evaluated for the i-vector speaker recognition system. A subset of the 2010 NIST speaker recognition evaluation set is used in the performance testing in addition to evaluations on the VoxCeleb1 dataset. A relative improvement of 44% in terms of EER is found in the system performance using the NIST data and 18% using the VoxCeleb1 dataset. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. Advances in phase-aware signal processing in speech communication.

Author: Mowlaee, Pejman, Saeidi, Rahim, and Stylianou, Yannis
Subjects: *ORAL communication, *SIGNAL processing, *AUTOMATIC speech recognition, *SPEECH perception, *VOICEPRINTS
Abstract: During the past three decades, the issue of processing spectral phase has been largely neglected in speech applications. There is no doubt that the interest of speech processing community towards the use of phase information in a big spectrum of speech technologies, from automatic speech and speaker recognition to speech synthesis, from speech enhancement and source separation to speech coding, is constantly increasing. In this paper, we elaborate on why phase was believed to be unimportant in each application. We provide an overview of advancements in phase-aware signal processing with applications to speech, showing that considering phase-aware speech processing can be beneficial in many cases, while it can complement the possible solutions that magnitude-only methods suggest. Our goal is to show that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications. The paper provides an extended and up-to-date bibliography on the topic of phase aware speech processing aiming at providing the necessary background to the interested readers for following the recent advancements in the area. Our review expands the step initiated by our organized special session and exemplifies the usefulness of spectral phase information in a wide range of speech processing applications. Finally, the overview will provide some future work directions. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

40. The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts.

Author: Geng, Puyang, Fan, Ningxue, Ling, Rong, Guo, Hong, Lu, Qimeng, and Chen, Xingwen
Subjects: *FACIAL expression & emotions (Psychology), *PEOPLE with drug addiction, *REHABILITATION of people with drug addiction, *STROOP effect, *SPEECH, *VISUAL perception, *SPEECH perception
Abstract: • Fill the research gap in the field of speech perception among drug addicts. • Reveal the presence of a disorder or deficit in multi-modal emotional speech processing in drug addicts. • Suggest that visual cues, such as facial expressions, could be beneficial in improving drug addicts' interpretation of emotional expressions during their drug rehabilitation therapy and social interpersonal communication. • Contribute to the theoretical foundations of detoxification and speech rehabilitation for drug addicts. Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience difficulties in emotion recognition and speech illusion, which are essential for interpersonal communication and a healthy life experience. However, previous research has yielded divergent results regarding whether drug addicts are more attracted to negative stimuli or positive stimuli. Additionally, little attention has been paid to the speech illusion experienced by drug addicts. Therefore, the current study aimed to investigate the effect of drugs on patterns of emotion recognition through two basic channels: auditory (speech) and visual (facial expression), as well as the speech illusions of drug addicts. The current study conducted a perceptual experiment in which 52 stimuli of four emotions (happy, angry, sad, and neutral) in three modalities (auditory, visual, auditory + visual [congruent & incongruent]) were presented to address Question 1 regarding the multi-modal emotional speech perception of drug addicts. Additionally, 26 stimuli of white noise and speech of three emotions in two noise conditions were presented to investigate Question 2 concerning the speech illusion of drug addicts. A total of thirty-five male drug addicts (25 heroin addicts and 10 ketamine addicts) and thirty-five male healthy controls were recruited for the perception experiment. The results, with heroin and ketamine addicts as examples, revealed that drug addicts exhibited lower accuracies in multi-modal emotional speech perception and relied more on visual cues for emotion recognition, especially when auditory and visual inputs were incongruent. Furthermore, both heroin and ketamine addicts showed a higher incidence of emotional responses when only exposed to white noise, suggesting the presence of psychotic-like symptoms (i.e., speech illusion) in drug addicts. Our results preliminarily indicate a disorder or deficit in multi-modal emotional speech processing among drug addicts, and the use of visual cues (e.g., facial expressions) may be recommended to improve their interpretation of emotional expressions. Moreover, the speech illusions experienced by drug addicts warrant greater attention and awareness. This paper not only fills the research gap in understanding multi-modal emotion processing and speech illusion in drug addicts but also contributes to a deeper understanding of the effects of drugs on human behavior and provides insights for the theoretical foundations of detoxification and speech rehabilitation for drug addicts. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

41. Deep feature for text-dependent speaker verification.

Author: Liu, Yuan, Qian, Yanmin, Chen, Nanxin, Fu, Tianfan, Zhang, Ya, and Yu, Kai
Subjects: *DEEP learning, *SPEECH perception, *DISCRIMINANT analysis, *CLASSIFIERS (Linguistics), *CONFIRMATION (Logic), *MATHEMATICAL models
Abstract: Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

42. Robust speech recognition in reverberant environments by using an optimal synthetic room impulse response model.

Author: Liu, Jindong and Yang, Guang-Zhong
Subjects: *ROBUST control, *SPEECH perception, *IMPULSE response, *AUTOMATIC control systems, *SIMULATION methods & models, *HUMAN-robot interaction
Abstract: This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environment selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time ( T 60 Model s) and tested on real room IRs with varying T 60 Room s. To apply our method, the biggest challenge is to choose a proper artificial room IR model for training ASR models. In this paper, a generalised statistical IR model with attenuated reverberation after an early reflection period, named attenuated IR model, has been adopted based on three time-domain statistical IR models. Its optimal values of the reverberation-attenuation factor and the early reflection period on the recognition rate have been searched and determined. Extensive testing has been performed over four real room IR sets (63 IRs in total) with variant T 60 Room s and speaker microphone distances (SMDs). The optimised attenuated IR model had the best performance in terms of recognition rate over others. Specific considerations of the practical use of the method have been taken into account including: (i) the maximal training step of T 60 Model in order to get the minimal number of models with acceptable performance; (ii) the impact of selection errors on the ASR caused by the estimation error of T 60 Room ; and (iii) the performance over SMD and direct-to-reverberation energy Ratio (DRR). It is shown that recognition rates of over 80 ∼ 90% are achieved in most cases. One important advantage of the method is that T 60 Room can be estimated either from reverberant sound directly ( Takeda et al., 2009; Falk and Chan, 2010; Löllmann et al., 2010 ) or from an IR measured from any point of the room as it remains constant in the same room ( Kuttruff, 2000 ), thus it is particularly suited to mobile applications. Compared to many classical dereverberation methods, the proposed method is more suited to ASR tasks in multiple reverberant environments, such as human-robot interaction. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

43. Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech.

Author: Sarria-Paja, Milton and Falk, Tiago H.
Subjects: *SPEECH perception, *BIOMETRIC identification, *MODULATION spectroscopy, *NOISE (Work environment), *GENDER identity
Abstract: Speech based biometrics is becoming a preferred method of identity management amongst users and companies. Current state-of-the-art speaker verification (SV) systems, however, are known to be strongly dependent on the condition of the speech material provided as input, and can be affected by unexpected variability presented during testing, such as with environmental noise or changes in vocal effort. In this paper, SV using whispered speech is explored, as whispered speech is known to be a natural speaking style with reduced perceptibility but containing relevant information regarding speaker identity and gender. We propose to fuse information from spectral, modulation spectral and so-called bottleneck features computed via deep neural networks at the feature- and score-levels. Bottleneck features have been recently shown to provide robustness against train/test mismatch conditions and have yet to be tested for whispered speech. Experimental results showed that relative improvements as high as 79% and 60% could be achieved for neutral and whispered speech, respectively, relative to a baseline system trained with i-vectors extracted from mel frequency cepstral coefficients. Results from our fusion experiments, show that the proposed strategies allow to efficiently use the limited resources available and to result in whispered speech performance inline with that obtained with normal speech. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

44. Sequence discriminative training for deep learning based acoustic keyword spotting.

Author: Chen, Zhehuai, Qian, Yanmin, and Yu, Kai
Subjects: *SEQUENCE (Linguistics), *DEEP learning, *SPEECH perception, *TASK performance, *VOCABULARY
Abstract: Speech recognition is a sequence prediction problem. Besides employing various deep learning approaches for frame-level classification, sequence-level discriminative training has been proved to be indispensable to achieve the state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR). However, keyword spotting (KWS), as one of the most common speech recognition tasks, almost only benefits from frame-level deep learning due to the difficulty of getting competing sequence hypotheses. The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches. In this paper, a sequence discriminative training framework is proposed for both fixed vocabulary and unrestricted acoustic KWS. Sequence discriminative training for both sequence-level generative and discriminative models are systematically investigated. By introducing word-independent phone lattices or non-keyword blank symbols to construct competing hypotheses, feasible and efficient sequence discriminative training approaches are proposed for acoustic KWS. Experiments showed that the proposed approaches obtained consistent and significant improvement in both fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level deep learning based acoustic KWS methods. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

45. Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space.

Author: Xue, Yawen, Hamada, Yasuhiro, and Akagi, Masato
Subjects: *SPEECH perception, *EMOTIONS, *RULE-based programming, *AROUSAL (Physiology), *MATRIX inversion
Abstract: This paper proposes a rule-based voice conversion system for emotion which is capable of converting neutral speech to emotional speech using dimensional space (arousal and valence) to control the degree of emotion on a continuous scale. We propose an inverse three-layered model with acoustic features as output at the top layer, semantic primitives at the middle layer and emotion dimension as input at the bottom layer; an adaptive-based fuzzy inference system acts as connectors to extract the non-linear rules among the three layers. The rules are applied by modifying the acoustic features of neutral speech to create the different types of emotional speech. The prosody-related acoustic features of F0 and power envelope are parameterized using the Fujisaki model and target prediction model separately. Perceptual evaluation results show that the degree of emotion can be perceived well in the dimensional space of valence and arousal. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

46. Automatic context window composition for distant speech recognition.

Author: Ravanelli, Mirco and Omologo, Maurizio
Subjects: *SPEECH perception, *DEEP learning, *HIDDEN Markov models, *GAUSSIAN mixture models, *ARTIFICIAL neural networks
Abstract: Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. This context configuration turns out to be useful not only to address low-latency speech recognition, but also to boost the recognition performance under reverberant conditions. This paper investigates on the mechanisms occurring inside DNNs, which lead to an effective application of asymmetric contexts. In particular, we propose a novel method for automatic context window composition based on a gradient analysis. The experiments, performed with different acoustic environments, features, DNN architectures, microphone settings, and recognition tasks show that our simple and efficient strategy leads to a less redundant frame configuration, which makes DNN training more effective in reverberant scenarios. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

47. Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis.

Author: Lorenzo-Trueba, Jaime, Eje Henter, Gustav, Takaki, Shinji, Yamagishi, Junichi, Morino, Yosuke, and Ochiai, Yuta
Subjects: *SPEECH synthesis, *ARTIFICIAL neural networks, *SPEECH perception, *EMOTION recognition, *ACOUSTIC signal processing
Abstract: In this paper, we investigate the simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional voice actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize different emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates, perceived emotional strength, and subjective speech quality. Simultaneously, we also study which representations are most appropriate for controlling the emotional strength of synthetic speech. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

48. Audiovisual perception of gemination and pharyngealization in Arabic.

Author: Scott, Mark and Idrissi, Ali
Subjects: *AUDITORY perception, *VISUAL perception, *OPTICAL information processing, *ORAL communication, *SPEECH perception
Abstract: This paper addresses a gap in the literature on audiovisual speech perception. Existing literature has largely examined the degree to which the audiovisual perception of primary place of articulation is influenced by visual information. Visual influences on the audiovisual categorization of a consonant as long (geminate) or short (singleton) have not, however, previously been examined. Furthermore, no experiment, to the authors’ knowledge, has examined audiovisual perception of the presence or absence of the secondary articulation of pharyngealization. The experiments reported in this article fill this gap by demonstrating that the audiovisual perception, by Arabic speakers, of both singleton versus geminate and pharyngealized versus non-pharyngealized is susceptible to visual influence. These experiments also serve to address the general lack of research on audiovisual speech processing in Arabic. Finally, these experiments provide a methodological advance in dealing with temporal asynchrony when investigating audiovisual speech perception. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

49. Introduction to the special issue on auditory-visual expressive speech and gesture in humans and machines.

Author: Kim, Jeesun, Bailly, Gérard, and Davis, Chris
Subjects: *GESTURE, *FACIAL expression, *SPEECH perception, *ORAL communication, *MACHINERY
Abstract: We speak to express ourselves. Sometimes words can capture what we mean; sometimes we mean more than can be said. This is where our visible gestures - those dynamic oscillations of our gaze, face, head, hand, arms and bodies – help. Not only do these co-verbal visual signals help express our intentions, attitudes and emotion, they also help us engage with our conversational partners to get our message across. Understanding how and when a message is supplemented, shaped and changed by auditory and visual signals is crucial for a science ultimately interesting in the correct interpretation of transmitted meaning. This special issue highlights research articles that explore co-verbal and nonverbal signals, a key topic in speech communication since these are crucial ingredients in the interpretation of meaning. That is, the meaning of speech is calibrated, augmented and even changed by co-verbal/speech behaviours and gestures including the talker's facial expression, eye-contact, gaze-direction, arm movements, hand gestures, body motion and orientation, posture, proximity, physical contact, and so on. Understanding expressive signals is a vital step for developing machines that can properly decipher intention and engage as social agents. The special issue is divided into three parts: Auditory-visual speech perception; Characterization and perception of auditory-visual prosody; Computer-generated auditory-visual speech. Below, we introduce these papers with a brief review of relevant issues and previous studies, when needed. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

50. Speech excitation signal recovering based on a novel error mitigation scheme under erasure channel conditions.

Author: López-Oller, Domingo, Benamirouche, Nadir, Gomez, Angel M., and Pérez-Córdoba, José Luis
Subjects: *SPEECH perception, *SIGNAL processing, *ERROR analysis in mathematics, *SPEECH codecs, *ALGORITHMS, *PARAMETER estimation
Abstract: Voice over IP (VoIP) communications are prone to transmission delays and data losses as they are carried out over packet-switched networks which are unable to guarantee real-time packet delivery. Speech codecs used in these channels strongly rely on Packet Loss Concealment (PLC) algorithms, the performance of which can be compromised as frame losses often occur in bursts. Thus, advanced PLC algorithms for erasure channels have already been proposed in the literature but these frequently focus on the speech envelope disregarding the excitation signal. In this paper we propose an error mitigation scheme focused on the estimation of this excitation signal whenever lost frames appear. These estimates are obtained by applying a minimum mean square error (MMSE) estimation technique based on the last correctly received frame. To this end an excitation signal’s representation and quantization approach which compares the resulting synthesized signal with the original speech one is considered. In addition, we propose the combination of this approach with a recursive least squares (RLS) technique which provides a better excitation signal estimate for the first lost consecutive frames. The proposed error mitigation scheme has been tested on the iLBC codec, where objective and subjective tests have shown a noticeable improvement on speech quality for transmissions over erasure channels. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

263 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources