Descriptor: "Noise Robustness" / Journal: speech communication - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Noise Robustness"' showing total 20 results

Start Over Descriptor "Noise Robustness" Journal speech communication

20 results on '"Noise Robustness"'

1. Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech.

Author: Räsänen, Okko, Seshadri, Shreyas, Karadayi, Julien, Riebling, Eric, Bunce, John, Cristia, Alejandrina, Metze, Florian, Casillas, Marisa, Rosemberg, Celia, Bergelson, Elika, and Soderstrom, Melanie
Subjects: *LANGUAGE Environment Analysis System, *ORAL communication, *MICROPHONE arrays, *LANGUAGE acquisition, *ORTHOGRAPHY & spelling
Abstract: Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

2. Comparison of spectral tilt measures for sentence prominence in speech—Effects of dimensionality and adverse noise conditions.

Author: Kakouros, Sofoklis, Räsänen, Okko, and Alku, Paavo
Subjects: *NOISE, *EMPHASIS (Linguistics), *GLOTTALIZATION, *ACOUSTIC variables measurement, *BANDPASS filters
Abstract: Highlights • The role of spectral tilt in sentence prominence was investigated. • Estimators for tilt of glottal source spectrum and surface spectrum were compared. • Multidimensional representations for spectral tilt were also investigated. • Robustness of different tilt estimators was tested for noisy and coded speech. • Experiments were conducted on both French and Dutch speech data. Abstract Linguistic prominence in speech is known to correlate with the acoustic measures of energy, F0, and duration. In contrast, the role of spectral tilt in the realization of prominence has remained more inconsistent between previous empirical investigations. This may be partially due to the lack of a standard method for quantifying spectral tilt or due to difficulties in estimating the acoustical source of spectral tilt, the glottal flow, from continuous speech. These issues have rendered interpretations and comparisons between studies difficult. In addition, (i) little is known about the robustness of tilt estimators for prominence detection in the case when speech is not clean but corrupted, as in real life, by environmental noise or telephone transmission (i.e. degradation caused by bandpass filtering and quantization noise). Moreover, (ii) little attention has been paid to multidimensional representations of source spectrum that can potentially incorporate more information about the phonation style than purely scalar measures. In this work, we study spectral tilt in signaling prominence in spoken Dutch and French under different levels of additive noise, and for telephone-band coded speech, and compare several one-dimensional tilt measures that have been previously encountered in the literature as well as multidimensional tilt measures. We also compare spectral tilt measures with other standard acoustic correlates for prominence, namely, energy, F0, and duration. Our results provide further empirical support for the finding that tilt is a systematic correlate of prominence in Dutch, that the role is smaller in French, and that energy, F0, and duration appear still to be the most robust features for discriminating prominent and non-prominent words. In addition, our results show that there are notable differences between different tilt measures at different levels of noise, and that multidimensional representations for tilt improve class separability from the scalar measures. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

3. Noise robust exemplar matching with alpha-beta divergence.

Author: Yılmaz, Emre, Gemmeke, Jort F., and Van hamme, Hugo
Subjects: *NOISE control, *ROBUST control, *DIVERGENCE theorem, *AUTOMATIC speech recognition, *VOCABULARY
Abstract: The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using exemplars, which are the labeled spectrographic representations of speech segments extracted from training data. By incorporating a sparse representations formulation, this technique remedies the inherent noise modeling problem of conventional exemplar matching-based automatic speech recognition systems. In this framework, noisy speech segments are approximated as a sparse linear combination of the exemplars of multiple lengths, each associated with a single speech unit such as words, half-words or phones. On account of the reconstruction error-based back end, the recognition accuracy highly depends on the congruence of the speech features and the divergence metric used to compare the speech segments with exemplars. In this work, we replace the conventional Kullback-Leibler divergence (KLD) with a generalized divergence family called the Alpha-Beta divergence with two parameters, α and β, in conjunction with mel-scaled magnitude spectral features. The proposed recognizer traverses the (α, β) plane depending on the amount of contamination to provide better separation of speech and noise sources. Moreover, we apply our recently proposed active noise exemplar selection (ANES) technique in a more realistic scenario where the target utterances are degraded by genuine room noise. Recognition experiments on the small vocabulary track of the 2nd CHiME Challenge and the AURORA-2 database have shown that the novel recognizer with the AB divergence and ANES outperforms the baseline system using the generalized KLD with tuned sparsity, especially at lower SNR levels. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

4. Comparison of spectral tilt measures for sentence prominence in speech — Effects of dimensionality and adverse noise conditions

Author: Paavo Alku, Sofoklis Kakouros, Okko Räsänen, Dept Signal Process and Acoust, Aalto-yliopisto, and Aalto University
Subjects: Linguistics and Language, Sentence prominence, Computer science, Speech recognition, Prosody, 01 natural sciences, Language and Linguistics, 030507 speech-language pathology & audiology, 03 medical and health sciences, Acoustic measures, 0103 physical sciences, Phonation, Environmental noise, 010301 acoustics, ta113, ta213, Communication, Estimator, Contrast (statistics), Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Spectral tilt, Computer Science Applications, Noise, Tilt (optics), Modeling and Simulation, Computer Vision and Pattern Recognition, 0305 other medical science, Noise robustness, Software, Realization (probability), DNN
Abstract: Linguistic prominence in speech is known to correlate with the acoustic measures of energy, F0, and duration. In contrast, the role of spectral tilt in the realization of prominence has remained more inconsistent between previous empirical investigations. This may be partially due to the lack of a standard method for quantifying spectral tilt or due to difficulties in estimating the acoustical source of spectral tilt, the glottal flow, from continuous speech. These issues have rendered interpretations and comparisons between studies difficult. In addition, (i) little is known about the robustness of tilt estimators for prominence detection in the case when speech is not clean but corrupted, as in real life, by environmental noise or telephone transmission (i.e. degradation caused by bandpass filtering and quantization noise). Moreover, (ii) little attention has been paid to multidimensional representations of source spectrum that can potentially incorporate more information about the phonation style than purely scalar measures. In this work, we study spectral tilt in signaling prominence in spoken Dutch and French under different levels of additive noise, and for telephone-band coded speech, and compare several one-dimensional tilt measures that have been previously encountered in the literature as well as multidimensional tilt measures. We also compare spectral tilt measures with other standard acoustic correlates for prominence, namely, energy, F0, and duration. Our results provide further empirical support for the finding that tilt is a systematic correlate of prominence in Dutch, that the role is smaller in French, and that energy, F0, and duration appear still to be the most robust features for discriminating prominent and non-prominent words. In addition, our results show that there are notable differences between different tilt measures at different levels of noise, and that multidimensional representations for tilt improve class separability from the scalar measures.
Published: 2018
Full Text: View/download PDF

5. A noise-robust speech recognition approach incorporating normalized speech/non-speech likelihood into hypothesis scores

Author: Oonishi, Tasuku, Iwano, Koji, and Furui, Sadaoki
Subjects: *AUTOMATIC speech recognition, *HYPOTHESIS, *ERRORS, *KALMAN filtering, *ESTIMATION theory, *JAPANESE language, *NOISE control
Abstract: Abstract: In noisy environments, speech recognition decoders often incorrectly produce speech hypotheses for non-speech periods, and non-speech hypotheses, such as silence or a short pause, for speech periods. It is crucial to reduce such errors to improve the performance of speech recognition systems. This paper proposes an approach using normalized speech/non-speech likelihoods calculated using adaptive speech and non-speech GMMs to weight the scores of recognition hypotheses produced by the decoder. To achieve good decoding performance, the GMMs are adapted to the variations of acoustic characteristics of input utterances and environmental noise, using either of the two modern on-line unsupervised adaptation methods, switching Kalman filter (SKF) or maximum a posteriori (MAP) estimation. Experimental results on real-world in-car speech, the Drivers’ Japanese Speech Corpus in a Car Environment (DJSC), and the AURORA-2 database show that the proposed method significantly improves recognition accuracy compared to a conventional approach using front-end voice activity detection (VAD). Results also confirm that our method significantly improves recognition accuracy under various noise and task conditions. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

6. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Author: Schuller, Björn, Batliner, Anton, Steidl, Stefan, and Seppi, Dino
Subjects: *EMOTIONS, *SPEECH, *HUMAN activity recognition, *ORATORS, *ANNOTATIONS, *AUTOMATIC classification, *NOISE, *STANDARDIZATION
Abstract: Abstract: More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its ‘big brothers’ speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech – the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge’s database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

7. Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition

Author: Garner, Philip N.
Subjects: *SIGNAL-to-noise ratio, *SPECTRUM analysis, *CEPSTRUM analysis (Mechanics), *ARTICULATION (Speech), *AUTOMATIC speech recognition, *PSYCHOACOUSTICS, *ROBUST control, *NOISE
Abstract: Abstract: Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. In this paper, it is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this SNR-cepstrum by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The relationship between the SNR-cepstrum and the articulation index, known in psycho-acoustics, is discussed. Experiments are presented suggesting that the combination of the SNR-cepstrum with the well known perceptual linear prediction method can be beneficial in noisy environments. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

8. Unsupervised learning of time–frequency patches as a noise-robust representation of speech

Author: Van Segbroeck, Maarten and Van hamme, Hugo
Subjects: *TIME-frequency analysis, *LANGUAGE acquisition, *ALGORITHMS, *ORAL communication -- Digital techniques, *ROBUST control, *AUTOMATIC speech recognition, *NONNEGATIVE matrices, *FACTORIZATION
Abstract: Abstract: We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time–frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time–frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness. [Copyright &y& Elsevier]
Published: 2009
Full Text: View/download PDF

9. A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise

Author: Messing, David P., Delhorne, Lorraine, Bruckert, Ed, Braida, Louis D., and Ghitza, Oded
Subjects: *POETICS, *AUTHORSHIP, *LITERARY theory, *FIGURES of speech
Abstract: Abstract: Current predictors of speech intelligibility are inadequate for understanding and predicting speech confusions caused by acoustic interference. We develop a model of auditory speech processing that includes a phenomenological representation of the action of the Medial Olivocochlear efferent pathway and that is capable of predicting consonant confusions made by normal hearing listeners in speech-shaped Gaussian noise. We then use this model to predict human error patterns of initial consonants in consonant–vowel–consonant words in the context of a Dynamic Rhyme Test. In the process we demonstrate its potential for speech discrimination in noise. Our results produced performance that was robust to varying levels of stationary additive speech-shaped noise and which mimicked human performance in discrimination of synthetic speech as measured by the Chi-squared test. [Copyright &y& Elsevier]
Published: 2009
Full Text: View/download PDF

10. Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments

Author: Jančovič, Peter and Köküer, Münevver
Subjects: *PHONEME (Linguistics), *HIDDEN Markov models, *BINOMIAL distribution, *SPEECH perception, *ORAL communication, *VOICE analysis
Abstract: In this paper, we propose a model for the incorporation of voicing information into a speech recognition system in noisy environments. The employed voicing information is estimated by a novel method that can provide this information for each filter-bank channel and does not require information about the fundamental frequency. The voicing information is modelled by employing the Bernoulli distribution. The voicing model is obtained for each HMM state and mixture by a Viterbi-style training procedure. The proposed voicing incorporation is evaluated both within a standard model and two other models that had compensated for the noise effect, the missing-feature and the multi-conditional training model. Experiments are first performed on noisy speech data from the Aurora 2 database. Significant performance improvements are achieved when the voicing information is incorporated within the standard model as well as the noise-compensated models. The employment of voicing information is also demonstrated on a phoneme recognition task on the noise-corrupted TIMIT database and considerable improvements are observed. [Copyright &y& Elsevier]
Published: 2009
Full Text: View/download PDF

11. Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings

Author: Park, Hyung-Min and Stern, Richard M.
Subjects: *SPEECH perception, *ALGORITHMS, *AUDITORY pathways, *ORAL communication, *COMMUNICATION & technology, *AUTOMATIC speech recognition
Abstract: This paper describes an algorithm called zero-crossing-based amplitude estimation (ZCAE) that enhances speech by reconstructing the desired signal from a mixture of two signals using continuously-variable weighting factors, based on pre-processing that is motivated by the well-known ability of the human auditory system to resolve spatially-separated signals. Although most conventional methods of signal separation have been based on interaural time differences (ITDs) derived from cross-correlation information, the ZCAE approach provides sound segregation based on estimates of ITD from comparisons of zero-crossings [Kim, Y.-I., An, S.J., Kil, R.M., Park, H.-M., 2005. Sound segregation based on binaural zero-crossings. In: Proc. European Conf. on Speech Communication and Technology (INTERSPEECH-2005), Lisbon, Portugal, pp. 2325–2328]. These ITD estimates are used to determine the relative contribution of the desired source in a mixture and subsequently to reconstruct a closer approximation to the desired signal. The estimation of relative target intensity in a given time-frequency segment is accomplished by analytically deriving a monotonic function that maps the estimated ITD in each time-frequency segment to the putative relative intensity of each source. The ZCAE method is evaluated by comparing the sample standard deviation of ITD estimates derived using cross-correlation and using zero-crossing information, by comparing the speech recognition accuracy that is obtained by applying the proposed methods to speech in the presence of interfering speech sources, and by comparing recognition accuracy obtained using a continuous weighting versus a binary weighting of the target and masker. It is found that better results are obtained when ITDs are estimated using zero-crossing information rather than cross-correlation information, and when continuous weighting functions are used in place of binary weighting of the target and masker in each time-frequency segment. [Copyright &y& Elsevier]
Published: 2009
Full Text: View/download PDF

12. Unsupervised intra-speaker variability compensation based on Gestalt and model adaptation in speaker verification with telephone speech

Author: Yoma, Nestor Becerra, Garretón, Claudio, Molina, Carlos, and Huenupán, Fernando
Subjects: *TELEPHONE calls, *ROBUST control, *GESTALT psychology, *NOISE, *VERSIFICATION, *COMMUNICATION models, *DATABASES
Abstract: In this paper, an unsupervised intra-speaker variability compensation (ISVC) method based on Gestalt is proposed to address the problem of limited enrolling data and noise robustness in text-dependent speaker verification (SV). Experiments with two databases show that: ISVC can lead to reductions in EER as high as 20% or 40% and ISCV provides reductions in the integral below the ROC curve between 30% and 60%. Also, the observed improvements are independent of the number of enrolling utterances. In contrast to model adaptation methods, ISVC is memoryless with respect to previous verification attempts. As shown here, unsupervised model adaptation can lead to substantial improvements in EER but is highly dependent on the sequence of client/impostor verification events. In adverse scenarios, such as massive impostor attacks and verification from alternated telephone line, unsupervised model adaptation might even provide reductions in verification accuracy when compared with the baseline system. In those cases, ISVC can even outperform adaptation schemes. It is worth emphasizing that ISVC and unsupervised model adaptation are compatible and the combination of both methods always improves the performance of model adaptation. The combination of both schemes can lead to improvements in EER as high as 34%. Due to the restrictions of commercially available databases for text-dependent SV research, the results presented here are based on local databases in Spanish. By doing so, the visibility of research in Iberian Languages is highlighted. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

13. Issues with uncertainty decoding for noise robust automatic speech recognition

Author: Liao, H. and Gales, M.J.F.
Subjects: *ALGORITHMS, *SPEECH perception, *FACTORIZATION, *SIGNAL-to-noise ratio, *DECODERS & decoding, *AUDITORY perception
Abstract: Abstract: Interest continues in a class of robustness algorithms for speech recognition that exploit the notion of uncertainty introduced by environmental noise. These techniques share the property that the uncertainty varies with the noise level and is propagated to the decoding stage, resulting in increased model variances. In observation uncertainty forms, the uncertainty variance is simply the variance of the error in enhancement that is added to the model variances. Another form, called uncertainty decoding, refers to a factorisation which results in a linear feature transform and model variance bias that increases with noise; using appropriate approximations, efficient implementations may be obtained, with the goal of achieving near model-based performance without the associated computational cost. Unfortunately, uncertainty decoding forms that compute the uncertainty in the front-end and pass this to the decoder may suffer from a theoretical problem in low signal-to-noise ratio conditions. This report discusses how this fundamental issue arises, and demonstrates it through two schemes: SPLICE with uncertainty and front-end joint uncertainty decoding (FE-Joint). A method to mitigate this for FE-Joint compensation is presented, as well as how SPLICE implicitly addresses it. However, it is shown that a model-based joint uncertainty decoding approach does not suffer from this limitation, like these front-end forms do, and is more computationally attractive. The issues described and performance of the various schemes are examined on two artificially corrupted corpora: the AURORA 2.0 digit string recognition and 1000-word Resource Management tasks. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

14. Analysis and recognition of whispered speech.

Author: Ito, Taisuke, Takeda, Kazuya, and Itakura, Fumitada
Subjects: *SPEECH, *COMMUNICATION, *CELL phones, *MODULATION theory, *VIBRATION (Mechanics), *VOCAL cords, *MICROPHONES, *DETECTORS
Abstract: In this study, we have examined the acoustic characteristics of whispered speech and addressed some of the issues involved in recognition of whispered speech used for communication over a mobile phone in a noisy environment. The acoustic analysis shows that there is an upward shift of formant frequencies of vowels as observed in the whispered speech data compared to the normal speech data. Voiced consonants in the whispered speech have lower energy at low frequencies up to 1.5 kHz and their spectral flatness is greater compared to the normal speech. In experiments on whispered speech recognition, results of our studies on adaptation of the whispered speech models have shown that adaptation using a small amount of whispered speech data from a target speaker can be effectively used for recognition of the whispered speech. In a noisy environment, the recognition accuracy decreases significantly for the whispered speech compared to the normal speaking of the same speech. A method to increase the SNR by covering the mouth with a hand has been shown to give a higher recognition accuracy for the whispered speech frequently encountered for private communication in a noisy environment. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

15. <f>α</f>-Jacobian environmental adaptation

Author: Cerisara, Christophe, Rigazio, Luca, and Junqua, Jean-Claude
Subjects: *ROBUST control, *AUTOMATIC speech recognition, *NOISE, *ALGORITHMS
Abstract: The robustness of automatic speech recognition systems to noise is still a problem, especially for small footprint systems. This paper addresses the problem of noise robustness using model compensation methods. Such algorithms are already available, but their complexity is usually high. An often-referenced method for achieving noise robustness is parallel model combination (PMC). Several algorithms have been proposed to develop more computationally efficient methods than PMC. For example, Jacobian adaptation approximates PMC with a linear transformation function in the cepstral domain. However, the Jacobian approximation is valid only for test environments that are close to the training conditions whereas, in real test conditions, the mismatch between the test and training environments is usually large. In this paper, we propose two methods, respectively called static and dynamic α-Jacobian adaptation (or α-JAC), to compute new linear approximations of PMC for realistic test environments. We further extend both algorithms to compensate for additive and convolutional noise and we derive the corresponding non-linear algorithm that is approximated. All these algorithms are experimentally compared in important mismatch conditions. As compared to Jacobian adaptation, improvements are observed with both static and dynamic α-Jacobian adaptation. [Copyright &y& Elsevier]
Published: 2004
Full Text: View/download PDF

16. Matching training and test data distributions for robust speech recognition

Author: Molau, Sirko, Keysers, Daniel, and Ney, Hermann
Subjects: *COMPUTATIONAL complexity, *PROBABILITY measures, *HEARING
Abstract: In this work normalization techniques in the acoustic feature space are studied that aim at reducing the mismatch between training and test by matching their distributions. Histogram normalization is the first technique explored in detail. The effect of normalization at different signal analysis stages as well as training and test data normalization are investigated. The basic normalization approach is improved by taking care of the variable silence fraction. Feature space rotation is the second technique that is introduced. It accounts for undesired variations in the acoustic signal that are correlated in the feature space dimensions. The interaction of rotation and histogram normalization is analyzed and it is shown that the recognition accuracy is significantly improved by both techniques on corpora with different complexity, acoustic conditions, and speaking styles. The word error rate is reduced from 24.6% to 21.8% on VerbMobil II, a German large vocabulary conversational speech task, and from 16.5% to 15.5% on EuTrans II, an Italian speech corpus of conversational speech over telephone. On the CarNavigation task, a German isolated-word corpus recorded partly in noisy car environments, the word error rate is reduced from 74.2% to 11.1% for heavy mismatch conditions between training and test. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

17. Hidden-articulator Markov models for speech recognition

Author: Richardson, Matthew, Bilmes, Jeff, and Diorio, Chris
Subjects: *SPEECH perception, *AUTOMATION, *MARKOV processes
Abstract: Most existing automatic speech recognition systems today do not explicitly use knowledge about human speech production. We show that the incorporation of articulatory knowledge into these systems is a promising direction for speech recognition, with the potential for lower error rates and more robust performance. To this end, we introduce the Hidden-Articulator Markov model (HAMM), a model which directly integrates articulatory information into speech recognition.The HAMM is an extension of the articulatory-feature model introduced by Erler in 1996. We extend the model by using diphone units, developing a new technique for model initialization, and constructing a novel articulatory feature mapping. We also introduce a method to decrease the number of parameters, making the HAMM comparable in size to standard HMMs. We demonstrate that the HAMM can reasonably predict the movement of articulators, which results in a decreased word error rate (WER). The articulatory knowledge also proves useful in noisy acoustic conditions. When combined with a standard model, the HAMM reduces WER 28–35% relative to the standard model alone. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

18. Robust speaker verification with state duration modeling

Author: Yoma, Nestor Becerra and Pegoraro, Tarciano Facco
Subjects: *SPEECH processing systems, *ALGORITHMS
Abstract: This paper addresses the problem of state duration modeling in the Viterbi algorithm in a text-dependent speaker verification task. The results presented in this paper suggest that temporal constraints can lead to reductions of 10% and 20% in the error rates with signals corrupted by noise at SNR equal to 6 and 0 dB, respectively, and that the accurate statistical modeling of state duration (e.g. with gamma probability distribution) does not seem to be very relevant if maximal and minimal state duration restrictions are imposed. In contrast, temporal restrictions do not seem to give any improvement in a speaker verification task with clean speech or high SNR. It is also shown that state duration constraints can easily be applied with the likelihood normalization metrics based on speaker-dependent temporal parameters. Finally, the results here presented show that word position-dependent state duration parameters give no significant improvement when compared with the word position-independent approach if the coarticulation effect between contiguous words is low. [Copyright &y& Elsevier]
Published: 2002
Full Text: View/download PDF

19. An integrated study of speaker normalisation and HMM adaptation for noise robust speaker-independent speech recognition

Author: Hariharan, Ramalingam and Viikki, Olli
Subjects: *SPEECH perception, *SPEECH processing systems, *ACOUSTIC models, *VOCAL tract
Abstract: Inter-speaker variability and sensitivity to background noise are two major problems in modern speech recognition systems. In this paper, we investigate different techniques that have been developed to overcome these issues. These methods include vocal tract length normalisation (VTLN), on-line HMM adaptation and gender-dependent acoustic modelling. Our objective in this paper is to combine these techniques so that the system recognition performance is maximised. Moreover, we propose a vocal tract length normalisation technique, which is more implementation-friendly than the previously published utterance-specific VTLN (u-VTLN). In order to ensure the wide applicability of the methods to be studied, the performance evaluation is done both in connected digit recognition and monophone-based isolated word recognition. The recognition results obtained indicate the importance of the combined use of these techniques. The integrated use of VTLN and on-line adaptation always provided the highest performance in both types of recognition experiments using gender-independent models. As expected, on-line HMM adaptation provided the major performance improvement with respect to a gender- and speaker-independent baseline system. The combination of speaker-specific VTLN (s-VTLN) or gender-dependent acoustic modelling further improved the system accuracy. However, while the joint use of s-VTLN and gender-dependent HMMs improved the recognition rate with original unadapted models, a minor performance degradation was observed when s-VTLN was applied to on-line adapted gender-dependent HMMs. [Copyright &y& Elsevier]
Published: 2002
Full Text: View/download PDF

20. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Author: Stefan Steidl, Dino Seppi, Björn Schuller, and Anton Batliner
Subjects: Goto, Computer science, standardisation, hidden markov-models, adaptation, 02 engineering and technology, perception, computer.software_genre, Language and Linguistics, noise robustness, memory, feature selection, Phenomenon, 0202 electrical engineering, electronic engineering, information engineering, Hidden Markov model, controversy, media_common, evaluation, Communication, Speaker recognition, Unit of analysis, Computer Science Applications, usability, Modeling and Simulation, feature types, System integration, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, performance, Natural language processing, Linguistics and Language, affect recognition, linear prediction, media_common.quotation_subject, emotion, selection, PSI_SPEECH, automatic classification, framework, Perception, algorithm, business.industry, 020206 networking & telecommunications, Usability, Data science, affect, Artificial intelligence, ddc:004, business, computer, Software
Abstract: More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech-the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. (C) 2011 Elsevier B.V. All rights reserved. Schuller B., Batliner A., Steidl S., Seppi D., ''Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge'', Speech communication, vol. 53, no. 9-10, pp. 1062-1087, November 2011. ispartof: Speech Communication vol:53 issue:9 pages:1062-1087 status: published
Published: 2011
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

20 results on '"Noise Robustness"'

1. Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech.

2. Comparison of spectral tilt measures for sentence prominence in speech—Effects of dimensionality and adverse noise conditions.

3. Noise robust exemplar matching with alpha-beta divergence.

4. Comparison of spectral tilt measures for sentence prominence in speech — Effects of dimensionality and adverse noise conditions

5. A noise-robust speech recognition approach incorporating normalized speech/non-speech likelihood into hypothesis scores

6. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

7. Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition

8. Unsupervised learning of time–frequency patches as a noise-robust representation of speech

9. A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise

10. Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments

11. Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings

12. Unsupervised intra-speaker variability compensation based on Gestalt and model adaptation in speaker verification with telephone speech

13. Issues with uncertainty decoding for noise robust automatic speech recognition

14. Analysis and recognition of whispered speech.

15. <f>α</f>-Jacobian environmental adaptation

16. Matching training and test data distributions for robust speech recognition

17. Hidden-articulator Markov models for speech recognition

18. Robust speaker verification with state duration modeling

19. An integrated study of speaker normalisation and HMM adaptation for noise robust speaker-independent speech recognition

20. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

20 results on '"Noise Robustness"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources