Journal: computer speech & language / Topic: speech perception - Searchworks@Jio Institute Digital Library Search Results

Showing total 113 results

Start Over Topic speech perception Journal computer speech & language

113 results

1. A novel approach to remove outliers for parallel voice conversion.

Author: Shah, Nirmesh J. and Patil, Hemant A.
Subjects: *SPEECH perception, *OUTLIER detection, *VOICE change, *VOICE frequency, *VOCAL differences, *PRINCIPAL components analysis
Abstract: Alignment is a key step before learning a mapping function between a source and a target speaker's spectral features in various state-of-the-art parallel data Voice Conversion (VC) techniques. After alignment, some corresponding pairs are still inconsistent with the rest of the data and are considered outliers. These outliers shift the parameters of the mapping function from their true value and hence, negatively affect the learning of mapping function during the training phase of the VC task. To the best of the authors' knowledge, the effect of outliers (and hence, their removal) on quality of the converted voice has not been much explored in the VC literature. Recent research has shown the effectiveness of the outlier removal as a pre-processing step in the VC. In this paper, we extend this study with a detailed theory and analysis. The proposed method uses a score distance that is estimated using Robust Principal Component Analysis (ROBPCA) to detect the outliers. In particular, the outliers are determined using a fixed cut-off on the score distances, based on the degrees of freedom in a chi-squared distribution, which is speaker-pair independent. The fixed cut-off is due to the assumption that the score distances follow the normal (i.e., Gaussian) distribution. However, this is a weak statistical assumption even in the cases where quite many samples are available. Hence, in this paper, we propose to explore speaker-pair dependent cut-offs to detect the outliers. In addition, we have presented our results on two state-of-the-art databases, namely, CMU-ARCTIC and Voice Conversion Challenge (VCC) 2016 by developing various state-of-the-art methods in the VC. In particular, we have presented the effectiveness of the outlier removal on Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), and Deep Neural Network (DNN)-based VC techniques. Furthermore, we have presented subjective and objective evaluations using a 95% confidence interval for the statistical significance of the tests. We obtained an average 0.56% relative reduction in Mel Cepstral Distortion (MCD) with the proposed outlier removal approach as a pre-processing step. In particular, with the proposed speaker-pair dependent cut-off, we have observed relative improvement of 12.24% and 30.51% in the speech quality, and 39.7% and 4.27% absolute improvement in the speaker similarity for the CMU-ARCTIC and the VCC 2016, respectively. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

2. Perception and prediction of speaker appeal – A single speaker study.

Author: Cullen, Ailbhe, Hines, Andrew, and Harte, Naomi
Subjects: *AUTOMATIC speech recognition, *SPEECH processing systems, *POLITICAL oratory, *APPEAL to authority (Logical fallacy), *CHARISMA, *ACOUSTICS, *SPEECH perception
Abstract: In this paper we explore the automatic prediction of speaker appeal from recordings of political speech. The database used contains recordings of a single speaker in a wide range of situations (interview, election rally etc.) which has been annotated for six speaker traits: boring; charismatic; enthusiastic; inspiring; likeable; and persuasive. The aim of this study is to predict these ratings using acoustic features of the speech. We offer three key contributions in this paper. Firstly, we explore the effect of acoustic environment on the perception of speaker ability. We find significant biases in the perception of all six traits, with interview speech being consistently rated as less appealing, and election rally speech as more appealing. In our second contribution, we attempt to exploit this bias by modelling speech from each situation separately, which gives a significant improvement in classification performance. Finally, the database covers 7 years. Thus, our third contribution is an analysis of the variance in both annotations and acoustic features over time to uncover temporal trends in speaker appeal. We find significant trends which show a decline in the speaker’s prosodic activity over time, which mirror a decline in the perception of speaker appeal as measured by the database annotations. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

3. Combining evidences from magnitude and phase information using VTEO for person recognition using humming.

Author: Patil, Hemant A. and Madhavi, Maulik C.
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *QUERYING (Computer science), *INFORMATION retrieval, *HUMAN-computer interaction, *VERSIFICATION, *FEATURE extraction
Abstract: Most of the state-of-the-art speaker recognition system use natural speech signal (i.e., real speech, spontaneous speech or contextual speech) from the subjects. In this paper, recognition of a person is attempted from his or her hum with the help of machines. This kind of application can be useful to design person-dependent Query-by-Humming (QBH) system and hence, plays an important role in music information retrieval (MIR) system. In addition, it can be also useful for other interesting speech technological applications such as human-computer interaction, speech prosody analysis of disordered speech, and speaker forensics. This paper develops new feature extraction technique to exploit perceptually meaningful (due to mel frequency warping to imitate human perception process for hearing) phase spectrum information along with magnitude spectrum information from the hum signal. In particular, the structure of state-of-the-art feature set, namely, Mel Frequency Cepstral Coefficients (MFCCs) is modified to capture the phase spectrum information. In addition, a new energy measure , namely, Variable length Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain subband signals (i.e., an output of 24 triangular-shaped filters used in the mel filterbank). We refer this proposed feature set as MFCC-VTMP (i.e., mel frequency cepstral coefficients to capture perceptually meaningful magnitude and phase information via VTEO)The polynomial classifier (which is in-principle similar to other discriminatively-trained classifiers such as support vector machine (SVM) with polynomial kernel) is used as the basis for all the experiments. The effectiveness of proposed feature set is evaluated and consistently found to be better than MFCCs feature set for several evaluation factors, such as, comparison with other phase-based features, the order of polynomial classifier, person (speaker) modeling approach (such as, GMM-UBM and i -vector), the dimension of feature vector, robustness under signal degradation conditions, static vs. dynamic features, feature discrimination measures and intersession variability. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

4. PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition.

Author: Taerungruang, Supawat, Taninpong, Phimphaka, Chunwijitra, Vataya, Thatphithakkul, Sumonmas, Kasuriya, Sawit, Inthanon, Viroj, Paksaranuwat, Pawat, Thumronglaohapun, Salinee, Nakharutai, Nawapon, Inkeaw, Papangkorn, and Bootkrajang, Jakramate
Subjects: *THAI language, *AUTOMATIC speech recognition, *SPEECH perception
Abstract: • PaSCoNT is a parallel speech corpus of Northern-Central Thai recorded 100 h of speech. • PaSCoNT consists of 907,832 words and 6279 vocabulary items. • There were statistically significant differences speech tempo between Central Thai and Northern Thai. • ASR model using the PaSCoNT can be used for both Northern Thai and Central Thai dialect speech recognition. This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

5. A hearing-inspired approach for distant-microphone speech recognition in the presence of multiple sources

Author: Ma, Ning, Barker, Jon, Christensen, Heidi, and Green, Phil
Subjects: *MICROPHONES, *SPEECH perception, *AUTOMATIC speech recognition, *AUDIOLOGY, *SOUND reverberation, *AUDITORY scene analysis, *STATISTICAL models, *PROBABILITY theory, *SPECTROGRAMS, *MISSING data (Statistics)
Abstract: Abstract: This paper addresses the problem of speech recognition in reverberant multisource noise conditions using distant binaural microphones. Our scheme employs a two-stage fragment decoding approach inspired by Bregman''s account of auditory scene analysis, in which innate primitive grouping ‘rules’ are balanced by the role of learnt schema-driven processes. First, the acoustic mixture is split into local time-frequency fragments of individual sound sources using signal-level primitive grouping cues. Second, statistical models are employed to select fragments belonging to the sound source of interest, and the hypothesis-driven stage simultaneously searches for the most probable speech/background segmentation and the corresponding acoustic model state sequence. The paper reports recent advances in combining adaptive noise floor modelling and binaural localisation cues within this framework. By integrating signal-level grouping cues with acoustic models of the target sound source in a probabilistic framework, the system is able to simultaneously separate and recognise the sound of interest from the mixture, and derive significant recognition performance benefits from different grouping cue estimates despite their inherent unreliability in noisy conditions. Finally, the paper will show that missing data imputation can be applied via fragment decoding to allow reconstruction of a clean spectrogram that can be further processed and used as input to conventional ASR systems. The best performing system achieves an average keyword recognition accuracy of 85.83% on the PASCAL CHiME Challenge task. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

6. The PASCAL CHiME speech separation and recognition challenge

Author: Barker, Jon, Vincent, Emmanuel, Ma, Ning, Christensen, Heidi, and Green, Phil
Subjects: *PASCAL (Computer program language), *SPEECH, *ROBUST programming, *PERFORMANCE technology, *MICROPHONES, *SPEECH perception, *COMPARATIVE studies, *AUDITORY scene analysis
Abstract: Abstract: Distant microphone speech recognition systems that operate with human-like robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of the multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluations a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

7. The efficient incorporation of MLP features into automatic speech recognition systems

Author: Park, J., Diehl, F., Gales, M.J.F., Tomalin, M., and Woodland, P.C.
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *ACOUSTIC models, *CORPORA, *LATTICE theory, *VOCABULARY, *ARABIC language, *PERCEPTRONS
Abstract: Abstract: In recent years, the use of Multi-Layer Perceptron (MLP) derived acoustic features has become increasingly popular in automatic speech recognition systems. These features are typically used in combination with standard short-term spectral-based features, and have been found to yield consistent performance improvements. However there are a number of design decisions and issues associated with the use of MLP features for state-of-the-art speech recognition systems. Two modifications to the standard training/adaptation procedures are described in this work. First, the paper examines how MLP features, and the associated acoustic models, can be trained efficiently on large training corpora using discriminative training techniques. An approach that combines multiple individual MLPs is proposed, and this reduces the time needed to train MLPs on large amounts of data. In addition, to further speed up discriminative training, a lattice re-use method is proposed. The paper also examines how systems with MLP features can be adapted to a particular speakers, or acoustic environments. In contrast to previous work (where standard HMM adaptation schemes are used), linear input network adaptation is investigated. System performance is investigated within a multi-pass adaptation/combination framework. This allows the performance gains of individual techniques to be evaluated at various stages, as well as the impact in combination with other sub-systems. All the approaches considered in this paper are evaluated on an Arabic large vocabulary speech recognition task which includes both Broadcast News and Broadcast Conversation test data. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

8. The use of phase in complex spectrum subtraction for robust speech recognition

Author: Kleinschmidt, Tristan, Sridharan, Sridha, and Mason, Michael
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *SUBTRACTION (Mathematics), *SIGNAL-to-noise ratio, *ALGORITHMS, *MAGNITUDE estimation, *EXPERIMENTS, *ORACLE software
Abstract: Abstract: In this paper we propose a new method for utilising phase information by complementing it with traditional magnitude-only spectral subtraction speech enhancement through complex spectrum subtraction (CSS). The proposed approach has the following advantages over traditional magnitude-only spectral subtraction: (a) it introduces complementary information to the enhancement algorithm; (b) it reduces the total number of algorithmic parameters; and (c) is designed for improving clean speech magnitude spectra and is therefore suitable for both automatic speech recognition (ASR) and speech perception applications. Oracle-based ASR experiments verify this approach, showing an average of 20% relative word accuracy improvements when accurate estimates of the phase spectrum are available. Based on sinusoidal analysis and assuming stationarity between observations (which is shown to be better approximated as the frame rate is increased), this paper also proposes a novel method for acquiring the phase information called Phase Estimation via Delay Projection (PEDEP). Further oracle ASR experiments validate the potential for the proposed PEDEP technique in ideal conditions. Realistic implementation of CSS with PEDEP shows performance comparable to state of the art spectral subtraction techniques in a range of 15–20dB signal-to-noise ratio environments. These results clearly demonstrate the potential for using phase spectra in spectral subtractive enhancement applications, and at the same time highlight the need for deriving more accurate phase estimates in a wider range of noise conditions. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

9. Comparing human and automatic speech recognition in simple and complex acoustic scenes.

Author: Spille, Constantin, Kollmeier, Birger, and Meyer, Bernd T.
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *SPEECH research, *HUMAN-machine relationship, *ARTIFICIAL neural networks
Abstract: Former comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) have shown that humans outperform ASR systems in nearly all speech recognition tasks. However, recent progress in ASR has led to substantial improvements of recognition accuracy, and it is therefore unclear how large the task-dependent human-machine gap still remains. This paper investigates this gap between HSR and ASR based on deep neural networks (DNNs) in different acoustic conditions, with the aim of comparing differences and identifying processing strategies that should be considered in ASR. We find that DNN-based ASR reaches human performance for single-channel, small-vocabulary tasks in the presence of speech-shaped noise and in multi-talker babble noise, which is an important difference to previous human-machine comparisons: The speech reception threshold, i.e., the signal-to-noise ratio with 50% word recognition rate is at about −7 to −8 dB both for HSR and ASR. However, in more complex spatial scenes with diffuse noise and moving talkers, the SRT gap amounts to approximately 12 dB. Based on cross comparisons that use oracle knowledge (e.g., the speakers’ true position), incorrect responses are attributed to localization errors or missing pitch information to distinguish between speakers with different gender. In terms of the SRT, localization errors and missing spectral information amount to 2.1 and 3.2 dB, respectively. The comparison hence identifies specific components in ASR that can profit from learning from auditory signal processing. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

10. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

Author: Barker, Jon, Ma, Ning, Coy, André, and Cooke, Martin
Subjects: *AUDITORY scene analysis, *SPEECH perception, *LECTURERS, *NOISE control, *ROBUST control, *LISTENING, *INFORMATION processing, *PROGRAMMING languages
Abstract: Abstract: This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate. [Copyright &y& Elsevier]
Published: 2010
Full Text: View/download PDF

11. Combining missing-feature theory, speech enhancement, and speaker-dependent/-independent modeling for speech separation

Author: Ming, Ji, Hazen, Timothy J., and Glass, James R.
Subjects: *SPEECH perception, *MATHEMATICAL analysis, *AUDITORY scene analysis, *MATHEMATICAL models, *SENTENCES (Grammar), *SCIENTIFIC observation, *ROBUST control, *NOISE control, *DATABASES
Abstract: Abstract: This paper considers the separation and recognition of overlapped speech sentences assuming single-channel observation. A system based on a combination of several different techniques is proposed. The system uses a missing-feature approach for improving crosstalk/noise robustness, a Wiener filter for speech enhancement, hidden Markov models for speech reconstruction, and speaker-dependent/-independent modeling for speaker and speech recognition. We develop the system on the Speech Separation Challenge database, involving a task of separating and recognizing two mixing sentences without assuming advanced knowledge about the identity of the speakers nor about the signal-to-noise ratio. The paper is an extended version of a previous conference paper submitted for the challenge. [Copyright &y& Elsevier]
Published: 2010
Full Text: View/download PDF

12. On-line Stochastic Matching compensation for non-stationary noise

Author: Barreaud, V., Illina, I., and Fohr, D.
Subjects: *ALGORITHMS, *SPEECH perception, *LIPREADING, *PSYCHOLINGUISTICS
Abstract: Abstract: This paper treats the problem of noise compensation in speech recognition when training and testing conditions do not match. We are interested in two types of non-stationary noise that may be present during test, namely slowly varying and abruptly varying noises. The context of our work is the Stochastic Matching framework. The Stochastic Matching compensation method transforms test data using an affine compensation function whose parameters are computed off-line. Stochastic Matching approaches are interesting since they make little assumptions about the nature and the level of the noise but they are best suited for the compensation of stationary noise. In this paper we propose an original contribution to the Stochastic Matching framework. It is based on an on-line frame-synchronous version of Stochastic Matching method to compensate for slowly varying noise. Our contribution extends this compensation algorithm in order to compensate for abruptly varying noise. The basic idea of the proposed methods is to perform the compensation and the recognition steps at the same time. The environment changes are identified using monitoring algorithms. The performance of our proposed methods is evaluated on two speech databases, one recorded in moving cars (VODIS), and another one obtained by corrupting VODIS with abruptly varying noise from NOISEX. The proposed approaches significantly outperform classical compensation methods (Off-line Stochastic Matching, Sequential Mean Cepstrum Removal, Parallel Model Combination, Spectral Subtraction). For instance, we obtain up to 32.6% word error rate reduction over S-MCR on database corrupted by a 10dB abruptly varying white noise. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

13. Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance

Author: Nakamura, Masanobu, Iwano, Koji, and Furui, Sadaoki
Subjects: *SPEECH research, *COMPUTER science, *SPEECH perception, *AUDITORY adaptation, *EXTEMPORANEOUS speaking, *JAPANESE people
Abstract: Although speech derived from read texts, news broadcasts, and other similar prepared contexts can be recognized with high accuracy, recognition performance drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper statistically and quantitatively analyzes differences in acoustic features between spontaneous and read speech using two large-scale speech corpora, “Corpus of Spontaneous Japanese (CSJ)” and “Japanese Newspaper Article Sentences (JNAS)”. Experimental results show that spontaneous speech can be characterized by reduced spectral space in comparison with that of read speech, and that the more spontaneous, the more the spectral space shrinks. This paper also clarifies that reduction in the spectral space leads to reduction in phoneme recognition accuracy. This result indicates that spectral reduction is one major reason for the decrease of recognition accuracy in spontaneous speech. [Copyright &y& Elsevier]
Published: 2008
Full Text: View/download PDF

14. Soft indexing of speech content for search in spoken documents

Author: Chelba, Ciprian, Silva, Jorge, and Acero, Alex
Subjects: *SPEECH perception, *COMPUTER input-output equipment, *NATURAL language processing, *ELECTRONIC data processing, *HUMAN-computer interaction, *AUTOMATIC speech recognition, *INTELLIGIBILITY of speech, *SPEECH processing systems, *COMPUTATIONAL linguistics
Abstract: The paper presents the Position Specific Posterior Lattice (PSPL), a novel lossy representation of automatic speech recognition lattices that naturally lends itself to efficient indexing and subsequent relevance ranking of spoken documents. This technique explicitly takes into consideration the content uncertainty by means of using soft-hits. Indexing position information allows one to approximate N-gram expected counts and at the same time use more general proximity features in the relevance score calculation. In fact, one can easily port any state-of-the-art text-retrieval algorithm to the scenario of indexing ASR lattices for spoken documents, rather than using the 1-best recognition result. Experiments performed on a collection of lecture recordings—MIT iCampus database—show that the spoken document ranking performance was improved by 17–26% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer (ASR). The paper also addresses the problem of integrating speech and text content sources for the document search problem, as well as its usefulness from an ad hoc retrieval—keyword search—point of view. In this context, the PSPL formulation is naturally extended to deal with both speech and text content for a given document, where a new relevance ranking framework is proposed for integrating the different sources of information available. Experimental results on the MIT iCampus corpus show a relative improvement of 302% in Mean Average Precision (MAP) when using speech content and text-only metadata as opposed to just text-only metadata (which constitutes about 1% of the amount of data in the transcription of the speech content, measured in number of words). Further experiments show that even in scenarios for which the metadata size is artificially augmented such that it contains more than 10% of the spoken document transcription, the speech content still provides significant performance gains in MAP with respect to only using the text-metadata for relevance ranking. [Copyright &y& Elsevier]
Published: 2007
Full Text: View/download PDF

15. Features based on filtering and spectral peaks in autocorrelation domain for robust speech recognition

Author: Farahani, G., Ahadi, S.M., and Homayounpour, M.M.
Subjects: *SPEECH perception, *SPEECH processing systems, *WORD recognition, *AUTOMATIC speech recognition
Abstract: Abstract: In this paper, a set of features derived by filtering and spectral peak extraction in autocorrelation domain are proposed. We focus on the effect of the additive noise on speech recognition. Assuming that the channel characteristics and additive noises are stationary, these new features improve the robustness of speech recognition in noisy conditions. In this approach, initially, the autocorrelation sequence of a speech signal frame is computed. Filtering of the autocorrelation of speech signal is carried out in the second step, and then, the short-time power spectrum of speech is obtained from the speech signal through the fast Fourier transform. The power spectrum peaks are then calculated by differentiating the power spectrum with respect to frequency. The magnitudes of these peaks are then projected onto the mel-scale and pass the filter bank. Finally, a set of cepstral coefficients are derived from the outputs of the filter bank. The effectiveness of the new features for speech recognition in noisy conditions will be shown in this paper through a number of speech recognition experiments. A task of multi-speaker isolated-word recognition and another one of multi-speaker continuous speech recognition with various artificially added noises such as factory, babble, car and F16 were used in these experiments. Also, a set of experiments were carried out on Aurora 2 task. Experimental results show significant improvements under noisy conditions in comparison to the results obtained using traditional feature extraction methods. We have also reported the results obtained by applying cepstral mean normalization on the methods to get robust features against both additive noise and channel distortion. [Copyright &y& Elsevier]
Published: 2007
Full Text: View/download PDF

16. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences

Author: Zen, Heiga, Tokuda, Keiichi, and Kitamura, Tadashi
Subjects: *SPEECH perception, *MARKOV processes, *SPEECH processing systems, *AUTOMATIC speech recognition
Abstract: Abstract: In the present paper, a trajectory model, derived from a hidden Markov model (HMM) by imposing explicit relationships between static and dynamic feature vector sequences, is developed and evaluated. The derived model, named a trajectory HMM, can alleviate two limitations of the standard HMM, which are (i) piece-wise constant statistics within a state and (ii) conditional independence assumption of state output probabilities, without increasing the number of model parameters. In the present paper, a Viterbi-type training algorithm based on the maximum likelihood criterion is also derived. The performance of the trajectory HMM was evaluated both in speech recognition and synthesis. In a speaker-dependent continuous speech recognition experiment, the trajectory HMM achieved an error reduction over the corresponding standard HMM. Subjective listening test results showed that the introduction of the trajectory HMM improved the naturalness of synthetic speech. [Copyright &y& Elsevier]
Published: 2007
Full Text: View/download PDF

17. MAP adaptation of stochastic grammars

Author: Bacchiani, Michiel, Riley, Michael, Roark, Brian, and Sproat, Richard
Subjects: *SPEECH perception, *NUMERICAL analysis, *INTERPOLATION, *HYPOTHESIS
Abstract: Abstract: This paper investigates supervised and unsupervised adaptation of stochastic grammars, including n-gram language models and probabilistic context-free grammars (PCFGs), to a new domain. It is shown that the commonly used approaches of count merging and model interpolation are special cases of a more general maximum a posteriori (MAP) framework, which additionally allows for alternate adaptation approaches. This paper investigates the effectiveness of different adaptation strategies, and, in particular, focuses on the need for supervision in the adaptation process. We show that n-gram models as well as PCFGs benefit from either supervised or unsupervised MAP adaptation in various tasks. For n-gram models, we compare the benefit from supervised adaptation with that of unsupervised adaptation on a speech recognition task with an adaptation sample of limited size (about 17h), and show that unsupervised adaptation can obtain 51% of the 7.7% adaptation gain obtained by supervised adaptation. We also investigate the benefit of using multiple word hypotheses (in the form of a word lattice) for unsupervised adaptation on a speech recognition task for which there was a much larger adaptation sample available. The use of word lattices for adaptation required the derivation of a generalization of the well-known Good-Turing estimate. Using this generalization, we derive a method that uses Monte Carlo sampling for building Katz backoff models. The adaptation results show that, for adaptation samples of limited size (several tens of hours), unsupervised adaptation on lattices gives a performance gain over using transcripts. The experimental results also show that with a very large adaptation sample (1050h), the benefit from transcript-based adaptation matches that of lattice-based adaptation. Finally, we show that PCFG domain adaptation using the MAP framework provides similar gains in F-measure accuracy on a parsing task as was seen in ASR accuracy improvements with n-gram adaptation. Experimental results show that unsupervised adaptation provides 37% of the 10.35% gain obtained by supervised adaptation. [Copyright &y& Elsevier]
Published: 2006
Full Text: View/download PDF

18. The effect of pruning and compression on graphical representations of the output of a speech recognizer

Author: Liu, Yang, Harper, Mary P., Johnson, Michael T., and Jamieson, Leah H.
Subjects: *SPEECH perception, *HEARING, *ALGORITHMS, *GRAPHIC methods
Abstract: Large vocabulary continuous speech recognition can benefit from an efficient data structure for representing a large number of acoustic hypotheses compactly. Word graphs or lattices have been chosen as such an efficient interface between acoustic recognition engines and subsequent language processing modules. This paper first investigates the effect of pruning during acoustic decoding on the quality of word lattices and shows that by combining different pruning options (at the model level and word level), we can obtain word lattices with comparable accuracy to the original lattices and a manageable size. In order to use the word lattices as the input for a post-processing language module, they should preserve the target hypotheses and their scores while being as small as possible. In this paper, we introduce a word graph compression algorithm that significantly reduces the number of words in the graphical representation without eliminating utterance hypotheses or distorting their acoustic scores. We compare this word graph compression algorithm with several other lattice size-reducing approaches and demonstrate the relative strength of the new word graph compression algorithm for decreasing the number of words in the representation. Experiments are conducted across corpora and vocabulary sizes to determine the consistency of the pruning and compression results. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

19. Robust speech recognition and feature extraction using HMM2

Author: Weber, Katrin, Ikbal, Shajith, Bengio, Samy, and Bourlard, Hervé
Subjects: *SPEECH perception, *MARKOV processes, *PROBABILITY theory, *VOCAL tract
Abstract: This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to be related to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

20. Speech recognition with unknown partial feature corruption – a review of the union model

Author: Ming, Ji and Jack Smith, F.
Subjects: *SPEECH perception, *NOISE, *STATISTICS
Abstract: This paper provides a summary of our studies on robust speech recognition based on a new statistical approach – the probabilistic union model. We consider speech recognition given that part of the acoustic features may be corrupted by noise. The union model is a method for basing the recognition on the clean part of the features, thereby reducing the effect of the noise on recognition. To this end, the union model is similar to the missing feature method. However, the two methods achieve this end through different routes. The missing feature method usually requires the identity of the noisy data for noise removal, while the union model combines the local features based on the union of random events, to reduce the dependence of the model on information about the noise. We previously investigated the applications of the union model to speech recognition involving unknown partial corruption in frequency band, in time duration, and in feature streams. Additionally, a combination of the union model with conventional noise-reduction techniques was studied, as a means of dealing with a mixture of known or trainable noise and unknown unexpected noise. In this paper, a unified review, in the context of dealing with unknown partial feature corruption, is provided into each of these applications, giving the appropriate theory and implementation algorithms, along with an experimental evaluation. [Copyright &y& Elsevier]
Published: 2003
Full Text: View/download PDF

21. Large scale discriminative training of hidden Markov models for speech recognition

Author: Woodland, P. C. and Povey, D.
Subjects: *SPEECH perception, *GAUSSIAN processes, *MARKOV processes
Abstract: This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). This paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware. Details are given of the MMIE lattice-based implementation used with the extended Baum-Welch algorithm, which makes training of such large systems computationally feasible. Techniques for improving generalization using acoustic scaling and weakened language models are discussed. The overall technique has allowed the estimation of triphone and quinphone HMM parameters which has led to significant reductions in word error rate for the transcription of conversational telephone speech relative to our best systems trained using maximum likelihood estimation (MLE). This is in contrast to some previous studies, which have concluded that there is little benefit in using discriminative training for the most difficult large vocabulary speech recognition tasks. The lattice MMIE-based discriminative training scheme is also shown to out-perform the frame discrimination technique. Various properties of the lattice-based MMIE training scheme are investigated including comparisons of different lattice processing strategies (full search and exact-match) and the effect of lattice size on performance. Furthermore a scheme based on the linear interpolation of the MMIE and MLE objective functions is shown to reduce the danger of over-training. It is shown that HMMs trained with MMIE benefit as much as MLE-trained HMMs from applying model adaptation using maximum likelihood linear regression (MLLR). This has allowed the straightforward integration of MMIE-trained HMMs into complex multi-pass systems for transcription of conversational telephone speech and has contributed to our MMIE-trained systems giving the lowest word error rates in both the 2000 and 2001 NIST Hub5 evaluations. [Copyright &y& Elsevier]
Published: 2002
Full Text: View/download PDF

22. Sequential use of spectral models to reduce deletion and insertion errors in vowel detection.

Author: Kashani, Hamidreza Baradaran and Sayadiyan, Abolghasem
Subjects: *SEQUENCE (Linguistics), *SPEECH errors, *SPEECH perception, *VOWELS, *DELETION (Linguistics), *CONSONANTS
Abstract: From both perspectives of speech production and speech perception, vowels as syllable nuclei can be considered as the most significant speech events. Detection of vowel events from a speech signal is usually performed by a two-step procedure. First, a temporal objective contour (TOC), as a time-varying measure of vowel similarity, is generated from the speech signal. Second, vowel landmarks, as the places of vowel events, are extracted by locating prominent peaks of the TOC. In this paper, by employing some spectral models in a sequential manner, we propose a new framework that directly addresses three possible errors in the vowel detection problem, namely vowel deletion, consonant insertion, and vowel insertion. The proposed framework consists of three main steps as follows. At the first step, two solutions are proposed to essentially reduce the initial vowel deletion error. The first solution is to use the peaks detected by a conventional energy-based TOC, but without utilizing TOC smoothing and peak thresholding processes. The peaks detected by a spectral-based TOC generated on the basis of GMM models are also put forward as the second solution for achieving a smaller vowel deletion error. At the second step, a two-class support vector machine (SVM) classifier is adopted to identify the consonant peaks from the vowel ones. Removing the peaks classified as consonants reduces the consonant insertion error. Finally, a two-class SVM classifier is proposed to classify the consecutive peaks detected within the same vowel from the others. The merging of the peaks classified as “same vowel” considerably reduces the vowel insertion error. Experiments are separately conducted on three standard speech corpora, namely FARSDAT, TIMIT and TFARSDAT. The effectiveness of the techniques proposed to reduce three types of detection errors is verified. The criteria of total error (as the summation of three detection errors) and F-measure, respectively result in about 9.7% and 95.1% for FARSDAT, 17.5% and 91.3% for TIMIT, and 19.6% and 90.2% for the TFARSDAT corpus. The evaluation results show that the proposed framework outperforms the existing well-known methods in terms of both total error and F-measure on both read and spontaneous speech corpora. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

23. Conversational telephone speech recognition for Lithuanian.

Author: Lileikytė, Rasa, Lamel, Lori, Gauvain, Jean-Luc, and Gorin, Arseniy
Subjects: *SPEECH perception, *LITHUANIAN language, *COLLOQUIAL language, *PHONETIC transcriptions, *KEYWORD searching, *LINGUISTICS databases, *PRONUNCIATION, *VOCABULARY
Abstract: The research presented in the paper addresses conversational telephone speech recognition and keyword spotting for the Lithuanian language. Lithuanian can be considered a low e-resourced language as little transcribed audio data, and more generally, only limited linguistic resources are available electronically. Part of this research explores the impact of reducing the amount of linguistic knowledge and manual supervision when developing the transcription system. Since designing a pronunciation dictionary requires language-specific expertise, the need for manual supervision was assessed by comparing phonemic and graphemic units for acoustic modeling. Although the Lithuanian language is generally described in the linguistic literature with 56 phonemes, under low-resourced conditions some phonemes may not be sufficiently observed to be modeled. Therefore different phoneme inventories were explored to assess the effects of explicitly modeling diphthongs, affricates and soft consonants. The impact of using Web data for language modeling and additional untranscribed audio data for semi-supervised training was also measured. Out-of-vocabulary (OOV) keywords are a well-known challenge for keyword search. While word-based keyword search is quite effective for in-vocabulary words, OOV keywords are largely undetected. Morpheme-based subword units are compared with character n-gram-based units for their capacity to detect OOV keywords. Experimental results are reported for two training conditions defined in the IARPA Babel program: the full language pack and the very limited language pack, for which, respectively, 40 h and 3 h of transcribed training data are available. For both conditions, grapheme-based and phoneme-based models are shown to obtain comparable transcription and keyword spotting results. The use of Web texts for language modeling is shown to significantly improve both speech recognition and keyword spotting performance. Combining full-word and subword units leads to the best keyword spotting results. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

24. Exploiting automatic speech recognition errors to enhance partial and synchronized caption for facilitating second language listening.

Author: Mirzaei, Maryam Sadat, Meshgi, Kourosh, and Kawahara, Tatsuya
Subjects: *AUTOMATIC speech recognition, *SPEECH errors, *SECOND language acquisition, *LISTENING comprehension, *SYNCHRONIZATION software, *WORD frequency, *SPEECH perception, *TEMPO (Phonetics)
Abstract: This paper addresses the viability of using Automatic Speech Recognition (ASR) errors as the predictor of difficulties in speech segments, thereby exploiting them to improve Partial and Synchronized Caption (PSC), which we have proposed to train second language (L2) listening skill by encouraging listening over reading. The system uses ASR technology to make word-level text-to-speech synchronization and generates a partial caption. The baseline system determines difficult words based on three features: speech rate, word frequency and specificity. While it encompasses most of the difficult words, it does not cover a wide range of features that hinder L2 listening. Therefore, we propose the use of ASR systems as a model of L2 listeners and hypothesize that ASR errors can predict challenging speech segments for these learners. Among different cases of ASR errors, annotation results suggest the usefulness of four categories of homophones, minimal pairs, negatives, and breached boundaries for L2 listeners. A preliminary experiment with L2 learners focusing on these four categories of the ASR errors revealed that these cases highlight the problematic speech regions for L2 listeners. Based on the findings, the PSC system is enhanced to incorporate these kinds of useful ASR errors. An experiment with L2 learners demonstrated that the enhanced version of PSC is not only preferable, but also more helpful to facilitate the L2 listening process. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

25. Sparse coding based features for speech units classification.

Author: Sharma, Pulkit, Abrol, Vinayak, Dileep, A.D., and Sao, Anil Kumar
Subjects: *SPEECH, *MULTIPLE correspondence analysis (Statistics), *HINDI language, *SPEECH perception, *HIDDEN Markov models
Abstract: In this work, we propose sparse representation based features for speech units classification tasks. In order to effectively capture the variations in a speech unit, the proposed method employs multiple class specific dictionaries. Here, the training data belonging to each class is clustered into multiple clusters, and a principal component analysis (PCA) based dictionary is learnt for each cluster. It has been observed that coefficients corresponding to middle principal components can effectively discriminate among different speech units. Exploiting this observation, we propose to use a transformation function known as weighted decomposition (WD) of principal components, which is used to emphasize the discriminative information present in the PCA-based dictionary. In this paper, both raw speech samples and mel frequency cepstral coefficients (MFCC) are used as an initial representation for feature extraction. For comparison, various popular dictionary learning techniques such as K-singular value decomposition (KSVD), simultaneous codeword optimization (SimCO) and greedy adaptive dictionary (GAD) are also employed in the proposed framework. The effectiveness of the proposed features is demonstrated using continuous density hidden Markov model (CDHMM) based classifiers for (i) classification of isolated utterances of E-set of English alphabet, (ii) classification of consonant-vowel (CV) segments in Hindi language and (iii) classification of phoneme from TIMIT phonetic corpus. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

26. Dempster-Shafer theory for enhanced statistical model-based voice activity detection.

Author: Park, Tae-Jun and Chang, Joon-Hyuk
Subjects: *DEMPSTER-Shafer theory, *SPEECH perception, *WHITE noise, *TRAFFIC noise, *SIGNAL-to-noise ratio
Abstract: In this paper, we propose to combine the posterior probabilities of voice activity derived from different statistical model-based algorithms for enhanced voice activity detection. For this, the Dempster-Shafer (DS) theory of evidence is employed to represent and combine the different probabilities estimated by three different statistical model-based VAD algorithms including the Sohn’s likelihood ratio test (LRT)-based method, smoothed LRT-based method, and multiple observation LRT-based method. By considering a generalization of the Bayesian framework and permitting the characterization of uncertainty and ignorance through the DS theory, the probability of an ignorant state is eliminated through the orthogonal sum of several speech presence probabilities, which results in the performance improvement when detecting voice activity. According to objective test results, it is discovered the proposed DS theory-based VAD method offers significant improvements over the conventional approaches. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

27. Uncertainty weighting and propagation in DNN–HMM-based speech recognition.

Author: Novoa, José, Fredes, Josué, Poblete, Víctor, and Yoma, Néstor Becerra
Subjects: *SPEECH perception, *PHONOLOGICAL decoding, *AUTOMATIC speech recognition, *ARTIFICIAL neural networks, *NOISE control
Abstract: In this paper an uncertainty weighting scheme for DNN–HMM-based speech recognition is proposed to increase discriminability in the decoding process. To this end, the DNN pseudo-log-likelihoods are weighted according to the uncertainty variance assigned to the acoustic observation. The results presented here suggest that substantial reduction in WER is achieved with clean training. Moreover, modelling the uncertainty propagation through the DNN is not required and no approximations for non-linear activation functions are made. The presented method can be applied to any network topology that delivers log-likelihood-like scores. It can be combined with any noise removal technique and adds a minimal computational cost. This technique was exhaustively evaluated and combined with uncertainty-propagation-based schemes for computing the pseudo-log-likelihoods and uncertainty variance at the DNN output. Two proposed methods optimized the parameters of the weighting function by leveraging the grid search either on a development database representing the given task or on each utterance based on discrimination metrics. Experiments with Aurora-4 task showed that, with clean training, the proposed weighting scheme can reduce WER by a maximum of 21% compared with a baseline system with spectral subtraction and uncertainty propagation using the unscented transform. The uncertainty weighting method reduced the gap between clean and multi-noise/multi-condition training. This can be useful when it is not easy to train a DNN–HMM system in conditions that are similar to the testing ones. Finally, the presented results on the use of uncertainty are very competitive with those published elsewhere using the same database as the one employed here. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

28. Influence of speaker familiarity on blind and visually impaired children’s and young adults’ perception of synthetic voices.

Author: Pucher, Michael, Zillinger, Bettina, Woltron, Thomas, Toman, Markus, Schabus, Dietmar, Valentini-Botinhao, Cassia, Yamagishi, Junichi, and Schmid, Erich
Subjects: *SPEECH perception in children, *SPEECH synthesis, *FAMILIARITY (Psychology), *BLIND children, *YOUNG adult psychology, *YOUNG adults with disabilities, *PSYCHOLOGY
Abstract: In this paper, we evaluate how speaker familiarity influences the engagement times and performance of blind children and young adults when playing audio games made with different synthetic voices. We also show how speaker familiarity influences speaker and synthetic speech recognition. For the first experiment we develop synthetic voices of school children, their teachers and of speakers that are unfamiliar to them and use each of these voices to create variants of two audio games: a memory game and a labyrinth game. Results show that pupils have significantly longer engagement times and better performance when playing games that use synthetic voices built with their own voices. These findings can be used to improve the design of audio games and lecture books for blind and visually impaired children and young adults. In the second experiment we show that blind children and young adults are better in recognizing synthetic voices than their visually impaired companions. We also show that the average familiarity with a speaker and the similarity between a speaker’s synthetic and natural voice are correlated to the speaker’s synthetic voice recognition rate. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

29. Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Author: Shinnosuke Takamichi, Hiroshi Saruwatari, and Yuki Saito
Subjects: Speech perception, Computer science, Speech recognition, Short-time Fourier transform, Acoustic model, 020206 networking & telecommunications, Speech synthesis, 02 engineering and technology, Filter (signal processing), Function (mathematics), computer.software_genre, 01 natural sciences, Theoretical Computer Science, Human-Computer Interaction, symbols.namesake, Amplitude, Fourier transform, Computer Science::Sound, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, symbols, 010301 acoustics, computer, Software
Abstract: This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average pooling function along with a frequency axis; then the GAN-based distribution compensation is performed in the low-frequency-resolution domain. Because the low-frequency-resolution amplitude spectra approximately emulate filter banks, the proposed algorithm is expected to improve synthetic speech quality by reducing differences in spectral envelopes of natural and synthetic speech. Furthermore, various frequency scales that are related to human speech perception (e.g., mel and inverse mel frequency scales) can be introduced to the proposed training algorithm by applying an frequency warping function to amplitude spectra. This paper also proposes a GAN-based training algorithm using multi-frequency-resolution amplitude spectra that uses both low- and original-frequency-resolution amplitude spectra to reduce the differences in not only spectral envelopes but also fine structures. Experimental results demonstrate that (1) GANs using low-frequency-resolution amplitude spectra improve speech quality and work robustly against the settings of the frequency resolution and hyperparameters, (2) in comparison among low-, original-, and multi-frequency-resolution amplitude spectra, the use of low-frequency-resolution ones work best improve the synthetic speech quality, and (3) the use of the inverse mel frequency scale for obtaining low-frequency-resolution amplitude spectra further improves synthetic speech quality.
Published: 2019

30. Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories.

Author: Ramanarayanan, Vikram, Van Segbroeck, Maarten, and Narayanan, Shrikanth S.
Subjects: *SPEECH, *SPEECH perception, *AUDITORY perception, *ARTICULATION (Speech), *ENUNCIATION, *ORAL communication, *COMPUTATIONAL linguistics, *INTELLIGIBILITY of speech, *HEARING
Abstract: How the speech production and perception systems evolved in humans still remains a mystery today. Previous research suggests that human auditory systems are able, and have possibly evolved, to preserve maximal information about the speaker's articulatory gestures. This paper attempts an initial step toward answering the complementary question of whether speakers’ articulatory mechanisms have also evolved to produce sounds that can be optimally discriminated by the listener's auditory system. To this end we explicitly model, using computational methods, the extent to which derived representations of “primitive movements” of speech articulation can be used to discriminate between broad phone categories. We extract interpretable spatio-temporal primitive movements as recurring patterns in a data matrix of human speech articulation, i.e., representing the trajectories of vocal tract articulators over time. To this end, we propose a weakly-supervised learning method that attempts to find a part-based representation of the data in terms of recurring basis trajectory units (or primitives) and their corresponding activations over time. For each phone interval, we then derive a feature representation that captures the co-occurrences between the activations of the various bases over different time-lags. We show that this feature, derived entirely from activations of these primitive movements, is able to achieve a greater discrimination relative to using conventional features on an interval-based phone classification task. We discuss the implications of these findings in furthering our understanding of speech signal representations and the links between speech production and perception systems. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

31. Comparing human and automatic speech recognition in a perceptual restoration experiment.

Author: Remes, Ulpu, Ramírez López, Ana, Juvela, Lauri, Palomäki, Kalle, Brown, Guy J., Alku, Paavo, and Kurimo, Mikko
Subjects: *AUTOMATIC speech recognition, *SPEECH perception, *SPEECH processing systems, *INTELLIGIBILITY of speech, *CLEAN speech, *MISSING data (Statistics)
Abstract: Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called ‘missing data’ systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

32. Employing distance-based semantics to interpret spoken referring expressions.

Author: Zukerman, Ingrid, Kim, Su Nam, Kleinbauer, Thomas, and Moshtaghi, Masud
Subjects: *EXPRESSIVE behavior, *NUMERICAL analysis, *SPEECH perception, *SEMANTICS, *COMPARATIVE linguistics
Abstract: In this paper, we present Scusi? , an anytime numerical mechanism for the interpretation of spoken referring expressions. Our contributions are: (1) an anytime interpretation process that considers multiple alternatives at different interpretation stages (speech, syntax, semantics and pragmatics), which enables Scusi? to defer decisions to the end of the interpretation process; (2) a mechanism that combines scores associated with the output of the different interpretation stages, taking into account the uncertainty arising from a variety of sources, such as ambiguity or inaccuracy in a description, speech recognition errors and out-of-vocabulary terms; and (3) distance-based functions with probabilistic semantics that represent lexical similarity between objects’ names and similarity between stated requirements and physical properties of objects (viz colour, size and positional relations). We considered two approaches for combining these descriptive attributes, viz multiplicative and additive, and determined whether prioritizing certain interpretation stages and descriptive attributes affects interpretation performance. We conducted two experiments to evaluate different aspects of Scusi? 's performance: Interpretive, where we compared Scusi? 's understanding of descriptions that are mainly ambiguous or inaccurate with people's understanding of these descriptions, and Generative, where we assessed Scusi? 's understanding of naturally occurring spoken descriptions. Our results show that Scusi? 's understanding of the descriptions in the Interpretive trial is comparable to that of people; and that its performance is encouraging when given arbitrary spoken descriptions in diverse scenarios, and excellent for the corresponding written descriptions. In both experiments, Scusi? significantly outperformed a baseline system that maintains only top same-score interpretations. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

33. Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system.

Author: Kamper, Herman, de Wet, Febe, Hain, Thomas, and Niesler, Thomas
Subjects: *NORTH American language, *SPEECH perception, *ENGLISH language, *VOCABULARY, *PRONUNCIATION
Abstract: South African English is currently considered an under-resourced variety of English. Extensive speech resources are, however, available for North American (US) English. In this paper we consider the use of these US resources in the development of a South African large vocabulary speech recognition system. Specifically we consider two research questions. Firstly, we determine the performance penalties that are incurred when using US instead of South African language models, pronunciation dictionaries and acoustic models. Secondly, we determine whether US acoustic and language modelling data can be used in addition to the much more limited South African resources to improve speech recognition performance. In the first case we find that using a US pronunciation dictionary or a US language model in a South African system results in fairly small penalties. However, a substantial penalty is incurred when using a US acoustic model. In the second investigation we find that small but consistent improvements over a baseline South African system can be obtained by the additional use of US acoustic data. Larger improvements are obtained when complementing the South African language modelling data with US and/or UK material. We conclude that, when developing resources for an under-resourced variety of English, the compilation of acoustic data should be prioritised, language modelling data has a weaker effect on performance and the pronunciation dictionary the smallest. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

34. Efficient data selection for speech recognition based on prior confidence estimation using speech and monophone models.

Author: Satoshi Kobashikawa, Taichi Asami, Yoshikazu Yamaguchi, Hirokazu Masataki, and Satoshi Takahashi
Subjects: *SPEECH perception, *DATA analysis, *ESTIMATION theory, *MAXIMUM likelihood statistics, *GAUSSIAN mixture models
Abstract: This paper proposes an efficient speech data selection technique that can identify those data that will be well recognized. Conventional confidence measure techniques can also identify well-recognized speech data. However, those techniques require a lot of computation time for speech recognition processing to estimate confidence scores. Speech data with low confidence should not go through the time-consuming recognition process since they will yield erroneous spoken documents that will eventually be rejected. The proposed technique can select the speech data that will be acceptable for speech recognition applications. It rapidly selects speech data with high prior confidence based on acoustic likelihood values and using only speech and monophone models. Experiments show that the proposed confidence estimation technique is over 50 times faster than the conventional posterior confidence measure while providing equivalent data selection performance for speech recognition and spoken document retrieval. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

35. An iterative longest matching segment approach to speech enhancement with additive noise and channel distortion.

Author: Ji Ming and Crookes, Danny
Subjects: *ITERATIVE methods (Mathematics), *MATCHING theory, *IMAGE segmentation, *SPEECH enhancement, *COMPUTER algorithms, *SPEECH perception
Abstract: This paper presents a new approach to speech enhancement from single-channel measurements involving both noise and channel distortion (i.e., convolutional noise), and demonstrates its applications for robust speech recognition and for improving noisy speech quality. The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise for speech estimation. Third, we present an iterative algorithm which updates the noise and channel estimates of the corpus data model. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

36. Paraphrastic language models.

Author: Liu, X., Gales, M. J. F., and Woodland, P. C.
Subjects: *PARAPHRASE, *NATURAL language processing, *SENTENCES (Grammar), *VOCABULARY, *SPEECH perception, *ERROR rates
Abstract: Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when using n-gram language models (LMs). This paper proposes a novel form of language model, the paraphrastic LM, that addresses these issues. A phrase level paraphrase model statistically learned from standard text data with no semantic annotation is used to generate multiple paraphrase variants. LM probabilities are then estimated by maximizing their marginal probability. Multi-level language models estimated at both the word level and the phrase level are combined. An efficient weighted finite state transducer (WFST) based paraphrase generation approach is also presented. Significant error rate reductions of 0.5-0.6% absolute were obtained over the baseline n-gram LMs on two state-of-the-art recognition tasks for English conversational telephone speech and Mandarin Chinese broadcast speech using a paraphrastic multi-level LM modelling both word and phrase sequences. When it is further combined with word and phrase level feed-forward neural network LMs, a significant error rate reduction of 0.9% absolute (9% relative) and 0.5% absolute (5% relative) were obtained over the baseline n-gram and neural network LMs respectively. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

37. Automatic glottal inverse filtering with the Markov chain Monte Carlo method.

Author: Auvinen, Harri, Raitio, Tuomo, Airaksinen, Manu, Siltanen, Samuli, Story, Brad H., and Alku, Paavo
Subjects: *GLOTTALIZATION, *MARKOV chain Monte Carlo, *ESTIMATION theory, *SPEECH perception, *APPROXIMATION theory, *VOCAL tract
Abstract: Abstract: This paper presents a new glottal inverse filtering (GIF) method that utilizes a Markov chain Monte Carlo (MCMC) algorithm. First, initial estimates of the vocal tract and glottal flow are evaluated by an existing GIF method, iterative adaptive inverse filtering (IAIF). Simultaneously, the initially estimated glottal flow is synthesized using the Rosenberg–Klatt (RK) model and filtered with the estimated vocal tract filter to create a synthetic speech frame. In the MCMC estimation process, the first few poles of the initial vocal tract model and the RK excitation parameter are refined in order to minimize the error between the synthetic and original speech signals in the time and frequency domain. MCMC approximates the posterior distribution of the parameters, and the final estimate of the vocal tract is found by averaging the parameter values of the Markov chain. Experiments with synthetic vowels produced by a physical modeling approach show that the MCMC-based GIF method gives more accurate results compared to two known reference methods. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

38. Analysis and HMM-based synthesis of hypo and hyperarticulated speech.

Author: Picart, Benjamin, Drugman, Thomas, and Dutoit, Thierry
Subjects: *INTELLIGIBILITY of speech, *SPEECH synthesis, *AUTOMATIC speech recognition, *SPEECH perception, *ACOUSTIC signal processing, *PATTERN recognition systems
Abstract: Abstract: Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

39. Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions.

Author: Alexanderson, Simon and Beskow, Jonas
Subjects: *MOTION capture (Cinematography), *INTELLIGIBILITY of speech, *SPEECH perception, *AUTOMATIC speech recognition, *ACCURACY, *WORD recognition
Abstract: Abstract: In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

40. Joint speaker diarization and speech recognition based on region proposal networks.

Author: Huang, Zili, Delcroix, Marc, Garcia, Leibny Paola, Watanabe, Shinji, Raj, Desh, and Khudanpur, Sanjeev
Subjects: *SPEECH perception, *AUTOMATIC speech recognition, *DATA
Abstract: Speaker diarization, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity, is an important task for speech processing. The standard clustering-based diarization pipeline (1) segments the whole utterance into small chunks, (2) extracts speaker embedding for each chunk, and (3) groups the chunks into clusters, where each cluster represents one speaker. It has two major disadvantages: first, it contains several individually optimized modules in the pipeline, and second, it cannot handle overlapping speech. To address these issues, we proposed region proposal network-based speaker diarization (RPNSD) (Huang et al., 2020). In this paper, we perform a detailed study of the RPNSD system, and make two important contributions. First, we report its diarization performance on additional datasets and empirically investigate the impact of different system settings. Second, we integrate an automatic speech recognition (ASR) component into the RPNSD system and propose a new framework called RPN-JOINT that simultaneously performs diarization and ASR. Our experiments reveal that (1) the RPNSD system can consistently achieve diarization results that are competitive with state-of-the-art methods, and (2) the RPN-JOINT system offers several advantages over the conventional cascade of diarization and ASR systems. • Describes how to apply Region Proposal Network (RPN) to the speaker diarization task. • Shows how a diarization system addresses the overlapping speech problem using RPN. • Proposes a system that jointly performs speaker diarization and speech recognition. • Reports the performance on multiple public datasets with different settings. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

41. Dereverberation of autoregressive envelopes for far-field speech recognition.

Author: Purushothaman, Anurenjan, Sreeram, Anirudh, Kumar, Rohit, and Ganapathy, Sriram
Subjects: *SOUND reverberation, *SPEECH perception, *VOICE recognition software
Abstract: The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10–24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided. • Deriving a signal model for reverberation effects on sub-band speech envelope. • Dereverberation of the autoregressive estimates of the sub-band envelope using a CLSTM model followed by feature extraction for ASR. • Joint learning of the dereverberation model parameters and the acoustic model for ASR in a single neural pipeline. • Illustrating the performance benefits of the proposed approach for multiple ASR tasks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

42. ASR-based exercises for listening comprehension practice in European Portuguese

Author: Pellegrini, Thomas, Correia, Rui, Trancoso, Isabel, Baptista, Jorge, Mamede, Nuno, and Eskenazi, Maxine
Subjects: *COMPREHENSION, *PORTUGUESE language, *VOWEL reduction, *SPEECH perception, *CURRICULUM, *TECHNOLOGICAL innovations, *SECOND language acquisition, *SURVEYS
Abstract: Abstract: Spoken European Portuguese (EP) is known to be difficult to understand for L2 learners, due to phenomena such as strong vowel reduction. In this paper, we present a method to automatically generate exercises aimed at improving listening comprehension skills in EP. Learners identify the words pronounced in real speech utterances. The exercises introduce two innovative aspects: using broadcast news videos for curriculum and automatically generating exercises with material updated on a daily basis. The videos are automatically transcribed by a speech recognition engine. A filtering chain, used to select appropriate sentences, was validated by a first survey comprised of both manually and automatically selected sentences. Both sets were assigned good to very good subjective quality scores. A second survey concerned the features of the exercise interface. Subjects with varying self-reported exposure to Portuguese as a second language tested several interfaces and functionalities and highlighted their preferred features. The results confirmed that the largest difficulty was the fast speech rate. All participants valued slowed-down audio and video documents, though this feature was more often used by the lowest proficiency subjects. The exercises were integrated into a Web platform where they are automatically updated daily. Though further evaluation is needed to find whether the platform affords skill acquisition, it is expected to be particularly valuable for distance learners who need opportunities to access authentic audio documents in EP. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

43. Improved automatic detection of creak

Author: Kane, John, Drugman, Thomas, and Gobl, Christer
Subjects: *COMPUTER algorithms, *SPEECH perception, *SIGNAL processing, *PREDICTION models, *DATABASES, *ROBUST control, *SIGNAL-to-noise ratio, *COMPUTER networks
Abstract: Abstract: This paper describes a new algorithm for automatically detecting creak in speech signals. Detection is made by utilising two new acoustic parameters which are designed to characterise creaky excitations following previous evidence in the literature combined with new insights from observations in the current work. In particular the new method focuses on features in the Linear Prediction (LP) residual signal including the presence of secondary peaks as well as prominent impulse-like excitation peaks. These parameters are used as input features to a decision tree classifier for identifying creaky regions. The algorithm was evaluated on a range of read and conversational speech databases and was shown to clearly outperform the state-of-the-art. Further experiments involving degradations of the speech signal demonstrated robustness to both white and babble noise, providing better results than the state-of-the-art down to at least 20dB signal to noise ratio. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

44. Multiple cameras for audio-visual speech recognition in an automotive environment

Author: Navarathna, Rajitha, Dean, David, Sridharan, Sridha, and Lucey, Patrick
Subjects: *SPEECH perception, *MOTOR vehicles, *LIPREADING, *CAMERAS, *NOISE, *ORATORS, *HIDDEN Markov models, *WORD recognition
Abstract: Abstract: Audio-visual speech recognition, or the combination of visual lip-reading with traditional acoustic speech recognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visual speech recognition literature to show that further improvements in speech recognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker''s face are available. A series of visual speech recognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotive audio-visual speech database. We study the relative contribution between the side and central orientated cameras in improving visual speech recognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

45. Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Author: Delcroix, Marc, Kinoshita, Keisuke, Nakatani, Tomohiro, Araki, Shoko, Ogawa, Atsunori, Hori, Takaaki, Watanabe, Shinji, Fujimoto, Masakiyo, Yoshioka, Takuya, Oba, Takanobu, Kubo, Yotaro, Souden, Mehrez, Hahm, Seong-Jun, and Nakamura, Atsushi
Subjects: *SPEECH perception, *SPATIAL systems, *STATIONARY processes, *LIVING rooms, *SPEECH processing systems, *PATTERN recognition systems, *SPECTRAL theory, *ACOUSTIC models, *REGRESSION analysis
Abstract: Abstract: Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech–noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

46. BBN TransTalk: Robust multilingual two-way speech-to-speech translation for mobile platforms

Author: Prasad, Rohit, Natarajan, Prem, Stallard, David, Saleem, Shirin, Ananthakrishnan, Shankar, Tsakalidis, Stavros, Kao, Chia-lin, Choi, Fred, Meermeier, Ralf, Rawls, Mark, Devlin, Jacob, Krstovski, Kriste, and Challenner, Aaron
Subjects: *ROBUST control, *MULTILINGUAL communication, *SPEECH, *SPEECH perception, *TEXT-to-speech software, *MACHINE translating, *AMBIGUITY, *PRONUNCIATION
Abstract: Abstract: In this paper we present a speech-to-speech (S2S) translation system called the BBN TransTalk that enables two-way communication between speakers of English and speakers who do not understand or speak English. The BBN TransTalk has been configured for several languages including Iraqi Arabic, Pashto, Dari, Farsi, Malay, Indonesian, and Levantine Arabic. We describe the key components of our system: automatic speech recognition (ASR), machine translation (MT), text-to-speech (TTS), dialog manager, and the user interface (UI). In addition, we present novel techniques for overcoming specific challenges in developing high-performing S2S systems. For ASR, we present techniques for dealing with lack of pronunciation and linguistic resources and effective modeling of ambiguity in pronunciations of words in these languages. For MT, we describe techniques for dealing with data sparsity as well as modeling context. We also present and compare different user confirmation techniques for detecting errors that can cause the dialog to drift or stall. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

47. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

Author: Dines, John, Liang, Hui, Saheer, Lakshmi, Gibson, Matthew, Byrne, William, Oura, Keiichiro, Tokuda, Keiichi, Yamagishi, Junichi, King, Simon, Wester, Mirjam, Hirsimäki, Teemu, Karhila, Reima, and Kurimo, Mikko
Subjects: *TRANSLATIONS, *SPEECH perception, *ALGORITHMS, *STATISTICS, *ORATORS, *HIDDEN Markov models, *SIMILARITY (Language learning), *AUDITORY perception
Abstract: Abstract: In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

48. Stereo hidden Markov modeling for noise robust speech recognition

Author: Cui, Xiaodong, Afify, Mohamed, Gao, Yuqing, and Zhou, Bowen
Subjects: *MARKOV processes, *SPEECH perception, *ROBUST control, *GAUSSIAN mixture models, *COMPARATIVE studies, *VOCABULARY, *TRANSLATING & interpreting, *PERFORMANCE
Abstract: Abstract: This paper investigates a noise robust technique for automatic speech recognition which exploits hidden Markov modeling of stereo speech features from clean and noisy channels. The HMM trained this way, referred to as stereo HMM, has in each state a Gaussian mixture model (GMM) with a joint distribution of both clean and noisy speech features. Given the noisy speech input, the stereo HMM gives rise to a two-pass compensation and decoding process where MMSE denoising based on N-best hypotheses is first performed and followed by decoding the denoised speech in a reduced search space on lattice. Compared to the feature space GMM-based denoising approaches, the stereo HMM is advantageous as it has finer-grained noise compensation and makes use of information of the whole noisy feature sequence for the prediction of each individual clean feature. Experiments on large vocabulary spontaneous speech from speech-to-speech translation applications show that the proposed technique yields superior performance than its feature space counterpart in noisy conditions while still maintaining decent performance in clean conditions. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

49. A monotonic statistical machine translation approach to speaking style transformation

Author: Neubig, Graham, Akita, Yuya, Mori, Shinsuke, and Kawahara, Tatsuya
Subjects: *MACHINE translating, *SPEECH, *TRANSCRIPTION (Linguistics), *STENOGRAPHERS, *PUNCTUATION, *COLLOQUIAL language, *TRANSDUCERS, *SPEECH perception
Abstract: Abstract: This paper presents a method for automatically transforming faithful transcripts or ASR results into clean transcripts for human consumption using a framework we label speaking style transformation (SST). We perform a detailed analysis of the types of corrections performed by human stenographers when creating clean transcripts, and propose a model that is able to handle the majority of the most common corrections. In particular, the proposed model uses a framework of monotonic statistical machine translation to perform not only the deletion of disfluencies and insertion of punctuation, but also correction of colloquial expressions, insertions of omitted words, and other transformations. We provide a detailed description of the model implementation in the weighted finite state transducer (WFST) framework. An evaluation of the proposed model on both faithful transcripts and speech recognition results of parliamentary and lecture speech demonstrates the effectiveness of the proposed model in performing the wide variety of corrections necessary for creating clean transcripts. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

50. A segmental non-parametric-based phoneme recognition approach at the acoustical level

Author: Golipour, Ladan and O’Shaughnessy, Douglas
Subjects: *HIDDEN Markov models, *SPEECH perception, *LINGUISTICS, *INFORMATION & communication technologies, *ESTIMATION theory, *APPROXIMATION theory, *PHONEME (Linguistics), *GRAPH theory
Abstract: Abstract: Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limitations such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive use of higher level linguistic information to produce satisfactory results. Therefore, researchers began investigating the incorporation of various discriminative techniques at the acoustical level to induce more discrimination between speech units. As is known, the k-nearest neighbour (k-NN) density estimation is discriminant by nature and is widely used in the pattern recognition field. However, its application to speech recognition has been limited to few experiments. In this paper, we introduce a new segmental k-NN-based phoneme recognition technique. In this approach, a group-delay-based method generates phoneme boundary hypotheses, and an approximate version of k-NN density estimation is used for the classification and scoring of variable-length segments. During the decoding, the construction of the phonetic graph starts from the best phoneme boundary setting and progresses through splitting and merging segments using the remaining boundary hypotheses and constraints such as phoneme duration and broad-class similarity information. To perform the k-NN search, we take advantage of a similarity search algorithm called Spatial Approximate Sample Hierarchy (SASH). One major advantage of the SASH algorithm is that its computational complexity is independent of the dimensionality of the data. This allows us to use high-dimensional feature vectors to represent phonemes. By using phonemes as units of speech, the search space is very limited and the decoding process fast. Evaluation of the proposed algorithm with the sole use of the best hypothesis for every segment and excluding phoneme transitional probabilities, context-based, and language model information results in an accuracy of 58.5% with correctness of 67.8% on the TIMIT test dataset. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

Publisher

113 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources