Author: "Daniel Povey" - Searchworks@Jio Institute Digital Library Search Results

251. Advances in Arabic Speech Transcription at IBM Under the DARPA GALE Program

Author: Ahmad Emami, Hong-Kwang Kuo, Hagen Soltau, Lidia Mangu, George Saon, Brian Kingsbury, and Daniel Povey
Subjects: Acoustics and Ultrasonics, Artificial neural network, Machine translation, business.industry, Computer science, Speech recognition, Decision tree, Word error rate, computer.software_genre, Data modeling, Linguistic Data Consortium, Unsupervised learning, Artificial intelligence, Language model, Electrical and Electronic Engineering, business, computer, Natural language processing
Abstract: This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 2.5 machine translation evaluation. Key advances include the use of additional training data from the Linguistic Data Consortium (LDC), use of a very large vocabulary comprising 737 K words and 2.5 M pronunciation variants, automatic vowelization using flat-start training, cross-adaptation between unvowelized and vowelized acoustic models, and rescoring with a neural-network language model. The resulting system achieves word error rates below 10% on Arabic broadcasts. Very large scale experiments with unsupervised training demonstrate that the utility of unsupervised data depends on the amount of supervised data available. While unsupervised training improves system performance when a limited amount (135 h) of supervised data is available, these gains disappear when a greater amount (848 h) of supervised data is used, even with a very large (7069 h) corpus of unsupervised data. We also describe a method for modeling Arabic dialects that avoids the problem of data sparseness entailed by dialect-specific acoustic models via the use of non-phonetic, dialect questions in the decision trees. We show how this method can be used with a statically compiled decoding graph by partitioning the decision trees into a static component and a dynamic component, with the dynamic component being replaced by a mapping that is evaluated at run-time.
Published: 2009
Full Text: View/download PDF

252. Audio augmentation for speech recognition

Author: Daniel Povey, Tom Ko, Vijayaditya Peddinti, and Sanjeev Khudanpur
Subjects: Audio signal, Voice activity detection, Robustness (computer science), Computer science, Speech recognition, Speech technology, Acoustic model, Overfitting, Speech processing
Abstract: Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.
Published: 2015
Full Text: View/download PDF

253. Reverberation robust acoustic modeling using i-vectors with time delay neural networks

Author: Daniel Povey, Sanjeev Khudanpur, Guoguo Chen, and Vijayaditya Peddinti
Subjects: Reverberation, Artificial neural network, Computer science, Time delay neural network, Test set, Speech recognition, Word error rate, Invariant (mathematics)
Abstract: In reverberant environments there are long term interactions between speech and corrupting sources. In this paper a time delay neural network (TDNN) architecture, capable of learning long term temporal relationships and translation invariant representations, is used for reverberation robust acoustic modeling. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate. By subsampling the outputs at TDNN layers across time steps, training time is reduced. Using a parallel training algorithm we show that the TDNN can be trained on ∼ 5500 hours of speech data in 3 days using up to 32 GPUs. The TDNN is shown to provide results competitive with state of the art systems in the IARPA ASpIRE challenge, with 27.7% WER on the dev test set.
Published: 2015
Full Text: View/download PDF

254. A diversity-penalizing ensemble training method for deep learning

Author: Daniel Povey, Sanjeev Khudanpur, and Xiaohui Zhang
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Artificial neural network, business.industry, Computer science, Deep learning, Artificial intelligence, business, Training methods, Machine learning, computer.software_genre, computer, Diversity (business)
Abstract: A common way to improve the performance of deep learning is to train an ensemble of neural networks and combine them during decoding. However, this is computationally expensive in test time. In this paper, we propose an diversity-penalizing ensemble training (DPET) procedure, which trains an ensemble of DNNs, whose parameters were differently initialized, and penalizes differences between each individual DNN’s output and their average output. This way each model learns to emulate the average of the whole ensemble of models, and in test time we can use one arbitrarily chosen member of the ensemble. Experimental results on a variety of speech recognition tasks show that this technique is effective, and gives us most of the WER improvement of the ensemble method while being no more expensive in test time than using a single model.
Published: 2015
Full Text: View/download PDF

255. Modeling phonetic context with non-random forests for speech recognition

Author: Daniel Povey, Hainan Xu, Guoguo Chen, and Sanjeev Khudanpur
Subjects: Computer science, Entropy (statistical thermodynamics), business.industry, Speech recognition, Decision tree, computer.software_genre, Random forest, Entropy (classical thermodynamics), Entropy (information theory), Artificial intelligence, Entropy (energy dispersal), business, Entropy (arrow of time), computer, Natural language processing, Entropy (order and disorder)
Abstract: Modern speech recognition systems typically cluster triphone phonetic contexts using decision trees. In this paper we describe a way to build multiple complementary decision trees from the same data, for the purpose of system combination. We do this by jointly building the decision trees using an objective function that has an added entropy term to encourage diversity among the decision trees. After the trees are built, the systems are built in the standard way and the emission probabilities are combined during decoding. Experiments on multiple datasets show gains from the use of multiple trees, at the expense of evaluating multiple models in test time.
Published: 2015
Full Text: View/download PDF

256. Semi-supervised maximum mutual information training of deep neural network acoustic models

Author: Daniel Povey, Vimal Manohar, and Sanjeev Khudanpur
Subjects: Conditional entropy, Artificial neural network, Discriminative model, Computer science, business.industry, Speech recognition, Pattern recognition, Mutual information, Artificial intelligence, Transcription (software), business, Weighting
Abstract: Maximum Mutual Information (MMI) is a popular discriminative criterion that has been used in supervised training of acoustic models for automatic speech recognition. However, standard discriminative training is very sensitive to the accuracy of the transcription and hence its implementation in a semisupervised setting requires extensive filtering of data. We will show that if the supervision transcripts are not known, the natural analogue of MMI is to minimize the conditional entropy of the lattice of possible transcripts of the data. This is equivalent to the weighted average of MMI criterion over different reference transcripts, taking those reference transcripts and their weighting from the lattice itself. In this paper we describe experiments where we applied this method to the semi-supervised training of Deep Neural Network acoustic models. In our experimental setup, the proposed method gives up to 0.5% absolute WER improvement over a DNN trained with sMBR only on the transcribed part of the data. This is 37% of the improvement that we would get from doing sMBR training if we had the transcripts for the untranscribed part of the data.
Published: 2015
Full Text: View/download PDF

257. Librispeech: An ASR corpus based on public domain audio books

Author: Daniel Povey, Sanjeev Khudanpur, Vassil Panayotov, and Guoguo Chen
Subjects: Computer science, business.industry, Speech recognition, Word error rate, Speech corpus, computer.file_format, computer.software_genre, VoxForge, Scripting language, Language model, Artificial intelligence, RDF, business, computer, Natural language processing
Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.
Published: 2015
Full Text: View/download PDF

258. A Coarse-Grained Model for Optimal Coupling of ASR and SMT Systems for Speech Translation

Author: Jan Trmal, Sanjeev Khudanpur, Gaurav Kumar, Graeme Blackwood, and Daniel Povey
Subjects: Machine translation, Computer science, business.industry, Speech recognition, Interface (computing), InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Automatic translation, Word error rate, Translation (geometry), computer.software_genre, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Coupling (computer programming), Speech translation, Language model, Artificial intelligence, business, computer, Natural language processing
Abstract: Speech translation is conventionally carried out by cascading an automatic speech recognition (ASR) and a statistical machine translation (SMT) system. The hypotheses chosen for translation are based on the ASR system’s acoustic and language model scores, and typically optimized for word error rate, ignoring the intended downstream use: automatic translation. In this paper, we present a coarseto-fine model that uses features from the ASR and SMT systems to optimize this coupling. We demonstrate that several standard features utilized by ASR and SMT systems can be used in such a model at the speech-translation interface, and we provide empirical results on the Fisher Spanish-English speech translation corpus.
Published: 2015
Full Text: View/download PDF

259. Advances in speech transcription at IBM under the DARPA EARS program

Author: Lidia Mangu, George Saon, Geoffrey Zweig, Hagen Soltau, Stanley F. Chen, Brian Kingsbury, and Daniel Povey
Subjects: Acoustics and Ultrasonics, business.industry, Computer science, Speech recognition, Word error rate, Context (language use), Speech synthesis, Machine learning, computer.software_genre, Speech processing, Test set, Systems architecture, Artificial intelligence, Electrical and Electronic Engineering, IBM, business, computer, Test data
Abstract: This paper describes the technical and system building advances made in IBM's speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category
Published: 2006
Full Text: View/download PDF

260. Automatic transcription of conversational telephone speech

Author: L. Wang, Xunying Liu, M.J.F. Gales, G.L. Moore, Daniel Povey, P.C. Woodland, Thomas Hain, and G. Evermann
Subjects: Acoustics and Ultrasonics, business.industry, Computer science, media_common.quotation_subject, Speech recognition, Speech coding, Word error rate, Pronunciation, computer.software_genre, Conversation, Computer Vision and Pattern Recognition, Artificial intelligence, Language model, Electrical and Electronic Engineering, Transcription (software), business, Hidden Markov model, computer, Software, Natural language processing, Natural language, media_common
Abstract: This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 /spl times/ RT conversational speech transcription system.
Published: 2005
Full Text: View/download PDF

261. Removing redundancy from lattices

Author: Pegah Ghahremani, Hagen Soltau, Lidia Mangu, Hermann Ney, Daniel Povey, and David Nolden
Subjects: Theoretical computer science, Computer science, Redundancy (engineering)
Published: 2014
Full Text: View/download PDF

262. Combination of FST and CN search in spoken term detection

Author: Yun Wang, Guoguo Chen, Daniel Povey, Jan Trmal, Justin T. Chiu, and Alexander I. Rudnicky
Subjects: FOS: Computer and information sciences, Phrase, business.industry, Computer science, Pipeline (computing), Speech recognition, 89999 Information and Computing Sciences not elsewhere classified, Artificial intelligence, computer.software_genre, business, computer, Natural language processing, Term (time)
Abstract: Spoken Term Detection (STD) focuses on finding instances of a particular spoken word or phrase in an audio corpus. Most STD systems have a two-step pipeline, ASR followed by search. Two approaches to search are common, Confusion Network (CN) based search and Finite State Transducer (FST) based search. In this paper, we examine combination of these two different search approaches, using the same ASR output. We find that the CN search performs better on shorter queries, and FST search performs better on longer queries. By combining the different search results from the same ASR decoding, we achieve better performance compared to either search approach on its own. We also find that this improvement is additive to the usual combination of decoder results using different modeling techniques.
Published: 2014
Full Text: View/download PDF

263. Improving deep neural network acoustic models using generalized maxout networks

Author: Jan Trmal, Xiaohui Zhang, Daniel Povey, and Sanjeev Khudanpur
Subjects: Normalization (statistics), Vocabulary, Artificial neural network, Computer science, Generalization, business.industry, media_common.quotation_subject, Speech recognition, Pattern recognition, Rectifier (neural networks), Nonlinear system, Artificial intelligence, Hidden Markov model, business, media_common
Abstract: Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this paper we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see instability during training when training unbounded-output nonlinearities such as these, we also present a method to control that instability. This is the “normalization layer”, which is a nonlinearity that scales down all dimensions of its input in order to stop the average squared output from exceeding one. The performance of our proposed nonlinearities are compared with maxout, rectified linear units (ReLU), tanh units, and also with a discriminatively trained SGMM/HMM system, and our p-norm units with p equal to 2 are found to perform best.
Published: 2014
Full Text: View/download PDF

264. A keyword search system using open source software

Author: David Yarowsky, Florian Metze, Chunxi Liu, Aren Jansen, Dietrich Klakow, Daniel Povey, Vimal Manohar, Guoguo Chen, Jan Trmal, Xiaohui Zhang, Sanjeev Khudanpur, and Pegah Ghahremani
Subjects: FOS: Computer and information sciences, Keyword search, Computer science, business.industry, Open source software, computer.software_genre, Pipeline (software), Systems architecture, Deep neural networks, 89999 Information and Computing Sciences not elsewhere classified, Artificial intelligence, business, computer, Natural language processing
Abstract: Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.
Published: 2014
Full Text: View/download PDF

265. Using proxies for OOV keywords in the keyword search task

Author: Oguz Yilmaz, Guoguo Chen, Jan Trmal, Sanjeev Khudanpur, and Daniel Povey
Subjects: Vocabulary, Computer science, business.industry, Speech recognition, media_common.quotation_subject, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Confusion matrix, Pronunciation, computer.software_genre, Lexicon, Index (publishing), Phone, NIST, Artificial intelligence, business, computer, Natural language processing, Word (computer architecture), media_common
Abstract: We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR system's pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.
Published: 2013
Full Text: View/download PDF

266. Revisiting semi-continuous hidden Markov models

Author: Daniel Povey, Korbinian Riedhammer, Arnab Ghoshal, and Tobias Bocklet
Subjects: Signal processing, symbols.namesake, Training set, Computer science, Estimation theory, Speech recognition, symbols, Acoustic model, Hidden Markov model, Gaussian process, Data modeling
Published: 2012
Full Text: View/download PDF

267. Revisiting Recurrent Neural Networks for robust ASR

Author: Daniel Povey, Oriol Vinyals, and Suman V. Ravuri
Subjects: Artificial neural network, Time delay neural network, business.industry, Computer science, Speech recognition, Deep learning, Computer Science::Neural and Evolutionary Computation, Markov process, Context (language use), Perceptron, symbols.namesake, Recurrent neural network, symbols, State space, Artificial intelligence, Hidden Markov model, business
Abstract: In this paper, we show how new training principles and optimization techniques for neural networks can be used for different network structures. In particular, we revisit the Recurrent Neural Network (RNN), which explicitly models the Markovian dynamics of a set of observations through a non-linear function with a much larger hidden state space than traditional sequence models such as an HMM. We apply pretraining principles used for Deep Neural Networks (DNNs) and second-order optimization techniques to train an RNN. Moreover, we explore its application in the Aurora2 speech recognition task under mismatched noise conditions using a Tandem approach. We observe top performance on clean speech, and under high noise conditions, compared to multi-layer perceptrons (MLPs) and DNNs, with the added benefit of being a “deeper” model than an MLP but more compact than a DNN.
Published: 2012
Full Text: View/download PDF

268. Generating exact lattices in the WFST framework

Author: Korbinian Riedhammer, Mirko Hannemann, Lukas Burget, Daniel Povey, Milos Janda, Ngoc Thang Vu, Stefan Kombrink, Gilles Boulianne, Petr Motlicek, Yanmin Qian, Martin Karafiat, Karel Vesely, and Arnab Ghoshal
Subjects: symbols.namesake, Theoretical computer science, Lattice (order), symbols, dblp, Hidden Markov model, Viterbi algorithm, Decoding methods, Mathematics
Abstract: We describe a lattice generation method that is exact, i.e. it satisfies all the natural properties we would want from a lattice of alternative transcriptions of an utterance. This method does not introduce substantial overhead above one-best decoding. Our method is most directly applicable when using WFST decoders where the WFST is “fully expanded”, i.e. where the arcs correspond to HMM transitions. It outputs lattices that include HMM-state-level alignments as well as word labels. The general idea is to create a state-level lattice during decoding, and to do a special form of determinization that retains only the best-scoring path for each word sequence. This special determinization algorithm is a solution to the following problem: Given a WFST A, compute a WFST B that, for each input-symbol-sequence of A, contains just the lowest-cost path through A.
Published: 2012

269. Strategies for training large scale neural network language models

Author: Daniel Povey, Jan Cernocky, Lukas Burget, Anoop Deoras, and Tomas Mikolov
Subjects: Computational complexity theory, Artificial neural network, Computer science, Time delay neural network, business.industry, Speech recognition, Principle of maximum entropy, Hash function, Word error rate, Machine learning, computer.software_genre, Reduction (complexity), Artificial intelligence, Language model, business, computer
Abstract: We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10% relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens.
Published: 2011
Full Text: View/download PDF

270. Strategies for using MLP based features with limited target-language training data

Author: Daniel Povey, Ji Xu, Jia Liu, and Yanmin Qian
Subjects: Training set, Transcription (linguistics), System combination, Computer science, Speech recognition, Language training, Automatic speech
Abstract: Recently there has been some interest in the question of how to build LVCSR systems when there is only a limited amount of acoustic training data in the target language, but possibly more plentiful data in other languages. In this paper we investigate approaches using MLP based features. We experiment with two approaches: One is based on Automatic Speech Attribute Transcription (ASAT), in which we train classifiers to learn articulatory features. The other approach uses only the target-language data and relies on combination of multiple MLPs trained on different subsets. After system combination we get large improvements of more than 10% relative versus a conventional baseline. These feature-level approaches may also be combined with other, model-level methods for the multilingual or low-resource scenario.
Published: 2011
Full Text: View/download PDF

271. A symmetrization of the Subspace Gaussian Mixture Model

Author: Arnab Ghoshal, Daniel Povey, Petr Schwarz, and Martin Karafiat
Subjects: symbols.namesake, Subspace Gaussian Mixture Model, Estimation theory, Computer science, Speech recognition, symbols, Word error rate, Symmetrization, State (functional analysis), Hidden Markov model, Gaussian process
Abstract: Last year we introduced the Subspace Gaussian Mixture Model (SGMM), and we demonstrated Word Error Rate improvements on a fairly small-scale task. Here we describe an extension to the SGMM, which we call the symmetric SGMM. It makes the model fully symmetric between the “speech-state vectors” and “speaker vectors” by making the mixture weights depend on the speaker as well as the speech state. We had previously avoided this as it introduces difficulties for efficient likelihood evaluation and parameter estimation, but we have found a way to overcome those difficulties. We find that the symmetric SGMM can give a very worthwhile improvement over the previously described model. We will also describe some larger-scale experiments with the SGMM, and report on progress toward releasing open-source software that supports SGMMs.
Published: 2011
Full Text: View/download PDF

272. A basis method for robust estimation of constrained MLLR

Author: Kaisheng Yao and Daniel Povey
Subjects: Transformation matrix, Covariance matrix, business.industry, Robustness (computer science), Pattern recognition, Regression analysis, Affine transformation, Artificial intelligence, Hidden Markov model, Speaker recognition, business, Speaker adaptation, Mathematics
Abstract: Constrained Maximum Likelihood Linear Regression (CMLLR) is a widely used speaker adaptation technique in which an affine transform of the features is estimated for each speaker. However, when the amount of speech data available is very small (e.g. a few seconds), it can be difficult to get sufficiently accurate estimates of the transform parameters. In this paper we describe a method of estimating CMLLR robustly from less data. We do this by representing the CMLLR transform matrix as a weighted sum over basis matrices, where the basis is constructed in such a way that the most important variation is concentrated in the leading coefficients. Depending on the amount of data available, we can choose to estimate a smaller or larger number of coefficients.
Published: 2011
Full Text: View/download PDF

273. Approaches to automatic lexicon learning with limited training examples

Author: Arnab Ghoshal, Samuel Thomas, Nagendra Kumar Goel, Pinar Akyazi, Ariya Rastrow, Kai Feng, Mohit Agarwal, Martin Karafiat, Lukas Burget, Richard Rose, Ondrej Glembek, Petr Schwarz, and Daniel Povey
Subjects: Vocabulary, Training set, Computer science, business.industry, media_common.quotation_subject, Bootstrapping (linguistics), LVCSR, computer.software_genre, Lexicon, Artificial intelligence, business, computer, Natural language processing, Natural language, Lexicon Learning, media_common
Abstract: Preparation of a lexicon for speech recognition systems can be a significant effort in languages where the written form is not exactly phonetic. On the other hand, in languages where the written form is quite phonetic, some common words are often mispronounced. In this paper, we use a combination of lexicon learning techniques to explore whether a lexicon can be learned when only a small lexicon is available for boot-strapping. We discover that for a phonetic language such as Spanish, it is possible to do that better than what is possible from generic rules or hand-crafted pronunciations. For a more complex language such as English, we find that it is still possible but with some loss of accuracy.
Published: 2010
Full Text: View/download PDF

274. Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models

Author: Mohit Agarwal, Lukas Burget, Martin Karafiat, Pinar Akyazi, Richard Rose, Ondrej Glembek, Petr Schwarz, Samuel Thomas, Nagendra Kumar Goel, Daniel Povey, Arnab Ghoshal, Kai Feng, and Ariya Rastrow
Subjects: Structure (mathematical logic), Training set, Computer science, business.industry, Speech recognition, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Pattern recognition, Multilingual acoustic modeling, Mixture model, Large vocabulary speech recognition, Set (abstract data type), symbols.namesake, symbols, Artificial intelligence, Hidden Markov model, business, Subspace Gaussian mixture model, Gaussian process, Subspace topology
Abstract: Although research has previously been done on multilingual speech recognition, it has been found to be very difficult to improve over separately trained systems. The usual approach has been to use some kind of “universal phone set” that covers multiple languages. We report experiments on a different approach to multilingual speech recognition, in which the phone sets are entirely distinct but the model has parameters not tied to specific states that are shared across languages. We use a model called a “Subspace Gaussian Mixture Model” where states' distributions are Gaussian Mixture Models with a common structure, constrained to lie in a subspace of the total parameter space. The parameters that define this subspace can be shared across languages. We obtain substantial WER improvements with this approach, especially with very small amounts of in-language training data.
Published: 2010
Full Text: View/download PDF

275. The IBM 2008 GALE Arabic speech transcription system

Author: Upendra V. Chaudhari, Hong-Kwang Kuo, Lidia Mangu, Hagen Soltau, George Saon, Brian Kingsbury, Daniel Povey, and Stephen M. Chu
Subjects: Machine translation, Computer science, business.industry, Arabic, Speech recognition, Speech coding, Word error rate, computer.software_genre, language.human_language, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Test set, language, Language model, Artificial intelligence, Transcription (software), business, Hidden Markov model, computer, Natural language processing
Abstract: This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.
Published: 2010
Full Text: View/download PDF

276. An improved consensus-like method for Minimum Bayes Risk decoding and lattice combination

Author: Daniel Povey, Haihua Xu, Jie Zhu, and Lidia Mangu
Subjects: Bayes' theorem, business.industry, Computer science, Lattice (order), Word error rate, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Pattern recognition, Artificial intelligence, business, Decoding methods
Abstract: In this paper we describe a method for Minimum Bayes Risk decoding for speech recognition. This is a technique similar to Consensus a.k.a. Confusion Network Decoding, in which we attempt to find the hypothesis that minimizes the Bayes' Risk with respect to the word error rate, based on a lattice of alternative outputs. Our method is an E-M like technique which makes approximations which we believe are less severe than the approximations made in Consensus, and our experimental results show an improvement in WER both for lattice rescoring and lattice-based system combination, versus baselines such as Consensus, Confusion Network Combination and ROVER.
Published: 2010
Full Text: View/download PDF

277. The 2009 IBM GALE Mandarin broadcast transcription system

Author: Daniel Povey, Hong-Kwang Kuo, Lidia Mangu, Stephen M. Chu, Yong Qin, Shilei Zhang, and Qin Shi
Subjects: Normalization (statistics), Computer science, Speech recognition, language, Word error rate, Hidden Markov model, Mandarin Chinese, language.human_language
Abstract: This paper gives an up-to-date description of the IBM Mandarin broadcast transcription system developed under the DARPA GALE program. Technical advances over our previous system include a novel acoustic modeling approach using subspace Gaussian mixture models, a speaking rate adaptation method using frame rate normalization, and an effective recipe for lattice combination. We present results on three consortium-defined test sets. It is shown that with these advances, the new system attains a 9% relative reduction in character error rate compared to our previous GALE evaluation system. The reported 9.1% error rate on the phase three evaluation set represents the state of the art in Mandarin broadcast speech transcription.
Published: 2010
Full Text: View/download PDF

278. Minimum hypothesis phone error as a decoding method for speech recognition

Author: Daniel Povey, Jie Zhu, Guanyong Wu, and Haihua Xu
Subjects: Bayes' theorem, Discriminative model, Phone, Computer science, Speech recognition, Maximum a posteriori estimation, Word error rate, Bayes error rate, Sentence, Decoding methods
Abstract: In this paper we show how methods for approximating phone error as normally used for Minimum Phone Error (MPE) discriminative training, can be used instead as a decoding criterion for lattice rescoring. This is an alternative to Confusion Networks (CN) which are commonly used in speech recognition. The standard (Maximum A Posteriori) decoding approach is a Minimum Bayes Risk estimate with respect to the Sentence Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER). Methods such as CN and our proposed Minimum Hypothesis Phone Error (MHPE) aim to get closer to minimizing the expected WER. Based on preliminary experiments we find that our approach gives more improvement than CN, and is conceptually simpler.
Published: 2009
Full Text: View/download PDF

279. Large margin semi-tied covariance transforms for discriminative training

Author: Hagen Soltau, George Saon, and Daniel Povey
Subjects: Covariance matrix, business.industry, Computer science, Gaussian, Pattern recognition, Covariance, Machine learning, computer.software_genre, symbols.namesake, Discriminative model, Margin (machine learning), symbols, Artificial intelligence, business, Hidden Markov model, computer, Gaussian process
Abstract: We discuss the applicability of large margin techniques to the problem of estimating linear transforms for discriminative training of a semi-tied covariance (STC) model. Since STC models are good proxies for full-covariance (FC) Gaussian models, the idea is to combine the benefit of the latest discriminative training techniques and the modeling advantage of FC Gaussians at a much lower computational cost. We study the interaction of these transforms with feature-space and model-space discriminative training on state-of-the-art speaker adapted systems built for a large-scale Arabic broadcast news transcription task.
Published: 2009
Full Text: View/download PDF

280. Universal background model based speech recognition

Author: Stephen M. Chu, Balakrishnan Varadarajan, and Daniel Povey
Subjects: Estimation theory, Computer science, business.industry, Speech recognition, Speaker recognition, Machine learning, computer.software_genre, Field (computer science), Set (abstract data type), Tree (data structure), Pruning (decision trees), Artificial intelligence, Transcription (software), business, computer, Smoothing
Abstract: The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to acoustic modeling in ASR. We propose a tree-based parameter estimation technique for UBMs, and describe a set of smoothing and pruning methods to facilitate learning. The proposed UBM approach is benchmarked on a state-of-the-art large-vocabulary continuous speech recognition platform on a broadcast transcription task. Preliminary experiments reported in this paper already show very exciting results.
Published: 2008
Full Text: View/download PDF

281. Boosted MMI for model and feature-space discriminative training

Author: Karthik Visweswariah, Bhuvana Ramabhadran, Daniel Povey, Dimitri Kanevsky, George Saon, and Brian Kingsbury
Subjects: Boosting (machine learning), business.industry, Speech recognition, Feature vector, Feature extraction, Pattern recognition, Mutual information, symbols.namesake, Discriminative model, Phone, symbols, Accuracy function, Artificial intelligence, business, Gaussian process, Mathematics
Abstract: We present a modified form of the maximum mutual information (MMI) objective function which gives improved results for discriminative training. The modification consists of boosting the likelihoods of paths in the denominator lattice that have a higher phone error relative to the correct transcript, by using the same phone accuracy function that is used in Minimum Phone Error (MPE) training. We combine this with another improvement to our implementation of the Extended Baum-Welch update equations for MMI, namely the canceling of any shared part of the numerator and denominator statistics on each frame (a procedure that is already done in MPE). This change affects the Gaussian-specific learning rate. We also investigate another modification whereby we replace I-smoothing to the ML estimate with I-smoothing to the previous iteration's value. Boosted MMI gives better results than MPE in both model and feature-space discriminative training, although not consistently.
Published: 2008
Full Text: View/download PDF

282. Quick fmllr for speaker adaptation in speech recognition

Author: Daniel Povey, Stephen M. Chu, and Balakrishnan Varadarajan
Subjects: business.industry, Speech recognition, Feature vector, Word error rate, Pattern recognition, FMLLR, Reduction (complexity), Artificial intelligence, Transcription (software), business, Hidden Markov model, Adaptation (computer science), Sufficient statistic, Mathematics
Abstract: Feature space maximum likelihood linear regression (fMLLR) is a widely used technique for speaker adaptation in HMM-based speech recognition. However, in extremely resource constrained systems the time required to perform the sufficient statistics accumulation for fMLLR adaptation can be considerable. In this paper we describe a novel method that can lead to significant reduction in the time taken for statistics accumulation while preserving the adaptation gains. The proposed quick fMLLR (Q-fMLLR) algorithm is implemented in a state-of-the-art large-vocabulary continuous speech recognition system, and evaluated on a broadcast transcription task. We present results both in terms of the average likelihood after adaptation and the character error rate. It is shown that Q-fMLLR attains the performance of regular fMLLR with a fraction of the computation.
Published: 2008
Full Text: View/download PDF

283. The IBM 2006 Gale Arabic ASR System

Author: Daniel Povey, Hagen Soltau, J. Kuo, Lidia Mangu, George Saon, Brian Kingsbury, and Geoffrey Zweig
Subjects: Vocabulary, Machine translation, Arabic, business.industry, Computer science, Speech recognition, media_common.quotation_subject, Word error rate, computer.software_genre, Speech processing, language.human_language, Discriminative model, language, Artificial intelligence, IBM, Transcription (software), business, computer, Natural language processing, Natural language, media_common
Abstract: This paper describes the advances made in IBM's Arabic broadcast news transcription system which was fielded in the 2006 GALE ASR and machine translation evaluation. These advances were instrumental in lowering the word error rate by 42% relative over the course of one year and include: training on additional LDC data, large-scale discriminative training on 1800 hours of unsupervised data, automatic vowelization using a flat-start approach, use of a large vocabulary with 617K words and 2 million pronunciations and lastly, a system architecture based on cross-adaptation between unvowelized and vowelized acoustic models.
Published: 2007
Full Text: View/download PDF

284. fMPE: Discriminatively Trained Features for Speech Recognition

Author: Daniel Povey, Brian Kingsbury, George Saon, Hagen Soltau, Geoffrey Zweig, and Lidia Mangu
Subjects: Training set, business.industry, Computer science, Speech recognition, Feature vector, Acoustic model, Pattern recognition, symbols.namesake, Discriminative model, Robustness (computer science), symbols, Artificial intelligence, business, Hidden Markov model, Gaussian process
Abstract: MPE (minimum phone error) is a previously introduced technique for discriminative training of HMM parameters. fMPE applies the same objective function to the features, transforming the data with a kernel-like method and training millions of parameters, comparable to the size of the acoustic model. Despite the large number of parameters, fMPE is robust to over-training. The method is to train a matrix projecting from posteriors of Gaussians to a normal size feature space, and then to add the projected features to normal features such as PLP. The matrix is trained from a zero start using a linear method. Sparsity of posteriors ensures speed in both training and test time. The technique gives similar improvements to MPE (around 10% relative). MPE on top of fMPE results in error rates up to 6.5% relative better than MPE alone, or more if multiple layers of transform are trained.
Published: 2006
Full Text: View/download PDF

285. The IBM 2004 Conversational Telephony System for Rich Transcription

Author: Hagen Soltau, Geoffrey Zweig, Lidia Mangu, Brian Kingsbury, George Saon, and Daniel Povey
Subjects: Context model, Computer science, business.industry, Speech recognition, Test set, Systems architecture, System testing, Word error rate, Telephony, IBM, Transcription (software), business
Abstract: This paper describes the technical advances in IBM's conversational telephony submission to the DARPA-sponsored 2004 rich transcription evaluation (RT-04). These advances include a system architecture based on cross-adaptation; a new form of feature-based MPE training; the use of a full-scale discriminatively trained full covariance Gaussian system; the use of septaphone cross-word acoustic context in static decoding graphs; and the incorporation of 2100 hours of training data in every system component. These advances reduced the error rate by approximately 21% relative, on the 2003 test set, over the best-performing system in last year's evaluation, and produced the best results on the RT-04 current and progress CTS data.
Published: 2006
Full Text: View/download PDF

286. Morpheme-Based Language Modeling for Arabic Lvcsr

Author: Geoffrey Zweig, Stanley F. Chen, Daniel Povey, and G. Choueiter
Subjects: Vocabulary, Arabic, business.industry, Computer science, Speech recognition, media_common.quotation_subject, Speech coding, Word error rate, Isolating language, computer.software_genre, language.human_language, Morpheme, language, Artificial intelligence, Language model, business, computer, Natural language, Word (computer architecture), Natural language processing, media_common
Abstract: In this paper, we concentrate on Arabic speech recognition. Taking advantage of the rich morphological structure of the language, we use morpheme-based language modeling to improve the word error rate. We propose a simple constraining method to rid the decoding output of illegal morpheme sequences. We report the results obtained for word and morpheme language models using medium (
Published: 2006
Full Text: View/download PDF

287. Automated Quality Monitoring in the Call Center with ASR and Maximum Entropy

Author: George Saon, Lidia Mangu, Bhuvana Ramabhadran, Daniel Povey, Olivier Siohan, Geoffrey Zweig, and Brian Kingsbury
Subjects: Computer science, Entropy (statistical thermodynamics), Speech recognition, Principle of maximum entropy, Entropy (information theory), Sampling (statistics), Active listening, Entropy (energy dispersal), Precision and recall
Abstract: This paper describes an automated system for assigning quality scores to recorded call center conversations. The system combines speech recognition, pattern matching, and maximum entropy classification to rank calls according to their measured quality. Calls at both end of the spectrum are flagged as "interesting" and made available for further human monitoring. In this process, pattern matching on the ASR transcript is used to answer a set of standard quality control questions such as "did the agent use courteous words and phrases," and to generate a question-based score. This is interpolated with the probability of a call being "bad," as determined by maximum entropy operating on a set of ASR-derived features such as "maximum silence length" and the occurrence of selected n-gram word sequences. The system is trained on a set of calls with associated manual evaluation forms. We present precision and recall results from IBM's North American Help Desk indicating that for a given amount of listening effort, this system triples the number of bad calls that are identified, over the current policy of randomly sampling calls
Published: 2006
Full Text: View/download PDF

288. Automated quality monitoring for call centers using speech and NLP technologies

Author: Lidia Mangu, George Saon, Brian Kingsbury, Geoffrey Zweig, Bhuvana Ramabhadran, Olivier Siohan, and Daniel Povey
Subjects: business.industry, Computer science, Principle of maximum entropy, Speech recognition, Rank (computer programming), computer.software_genre, Conjunction (grammar), Set (abstract data type), Pattern matching, Artificial intelligence, IBM, business, Precision and recall, computer, Word (computer architecture), Natural language processing
Abstract: This paper describes an automated system for assigning quality scores to recorded call center conversations. The system combines speech recognition, pattern matching, and maximum entropy classification to rank calls according to their measured quality. Calls at both ends of the spectrum are flagged as "interesting" and made available for further human monitoring. In this process, the ASR transcript is used to answer a set of standard quality control questions such as "did the agent use courteous words and phrases," and to generate a question-based score. This is interpolated with the probability of a call being "bad," as determined by maximum entropy operating on a set of ASR-derived features such as "maximum silence length" and the occurrence of selected n-gram word sequences. The system is trained on a set of calls with associated manual evaluation forms. We present precision and recall results from IBM's North American Help Desk indicating that for a given amount of listening effort, this system triples the number of bad calls that are identified, over the current policy of randomly sampling calls. The application that will be demonstrated is a research prototype that was built in conjunction with IBM's North American call centers.
Published: 2006
Full Text: View/download PDF

289. The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings

Author: Alvaro Soneiro, Daniel Povey, Gerasimos Potamianos, Martin Westphal, Thomas Ross, Vit Libal, Henrik Schulz, Jing Huang, Stanley F. Chen, and Olivier Siohan
Subjects: Computer science, Microphone, Speech recognition, Headset, Acoustic model, Word error rate, Speech synthesis, computer.software_genre, Sound recording and reproduction, Phone, NIST, Language model, computer, Vocal tract
Abstract: We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meeting data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM). The system building process is similar to the IBM conversational telephone speech recognition system. However, the best models for the far-field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from scientific conference proceedings, and spontaneous telephone conversations. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year's CHIL evaluation. Furthermore, the developed STT system significantly outperformed our last year's results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal.
Published: 2006
Full Text: View/download PDF

290. Discriminatively trained features using fMPE for multi-stream audio-visual speech recognition

Author: Daniel Povey and Jing Huang
Subjects: business.industry, Computer science, Speech recognition, Headset, Process (computing), Audio-visual speech recognition, Pattern recognition, Multi stream, Transformation (function), Discriminative model, Phone, Artificial intelligence, business, Hidden Markov model, Test data
Abstract: fMPE is a recently introduced discriminative training technique that uses the Minimum Phone Error (MPE) discriminative criterion to train a feature-level transformation. In this paper we investigate fMPE trained audio/visual features for multistream HMM-based audio-visual speech recognition. A flexible, layer-based implementation of fMPE allows us to combine the the visual information with the audio stream using the discriminative traning process, and dispense with the multiple stream approach. Experiments are reported on the IBM infrared headset audio-visual database. On average of 20-speaker 1 hour speaker independent test data, the fMPE trained acoustic features achieve 33% relative gain. Adding video layers on top of audio layers gives additional 10% gain over fMPE trained features from the audio stream alone. The fMPE trained visual features achieve 14% relative gain, while the decision fusion of audio/visual streams with fMPE trained features achieves 29% relative gain. However, fMPE trained models do not improve over the original models on the mismatched noisy test data.
Published: 2005
Full Text: View/download PDF

291. Feature space Gaussianization

Author: Daniel Povey, Satya Dharanipragada, and George Saon
Subjects: Basis (linear algebra), business.industry, Gaussian, Cumulative distribution function, Pattern recognition, Probability density function, Empirical distribution function, symbols.namesake, Transformation (function), Dimension (vector space), symbols, Artificial intelligence, business, Divergence (statistics), Algorithm, Mathematics
Abstract: We propose a non-linear feature space transformation for speaker/environment adaptation which forces the individual dimensions of the acoustic data for every speaker to be Gaussian distributed. The transformation is given by the preimage under the Gaussian cumulative distribution function (CDF) of the empirical CDF on a per dimension basis. We show that, for a given dimension, this transformation achieves minimum divergence between the density function of the transformed adaptation data and the normal density with zero mean and unit variance. Experimental results on both small and large vocabulary tasks show consistent improvements over the application of linear adaptation transforms only.
Published: 2004
Full Text: View/download PDF

292. Discriminative training for HMM-based offline handwritten character recognition

Author: Daniel Povey and Roongroj Nopsuwanchai
Subjects: Training set, Computer science, business.industry, Intelligent character recognition, Speech recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Mutual information, Intelligent word recognition, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Handwriting recognition, Principal component analysis, Artificial intelligence, Hidden Markov model, business
Abstract: In this paper we report the use of discriminative training and other techniques to improve performance in a HMM-based isolated handwritten character recognition system. The discriminative training is maximum mutual information (MMI) training; we also improve results by using composite images which are the concatenation of the raw images, rotated and polar transformed versions of them; and we describe a technique called block-based principal component analysis (PCA). For effective discriminative training we need to increase the size of our training database, which we do by eroding and dilating the images to give a three-fold increase in training data. Although these techniques are tested using isolated Thai characters, both MMI and block-based PCA are applicable to the more difficult task of cursive handwriting recognition.
Published: 2004
Full Text: View/download PDF

293. Discriminative map for acoustic model adaptation

Author: Philip C. Woodland, Mark J. F. Gales, and Daniel Povey
Subjects: Quasi-maximum likelihood, Estimation theory, Computer science, business.industry, Speech recognition, Maximum likelihood, Acoustic model, Word error rate, Pattern recognition, Mutual information, Maximum likelihood sequence estimation, Discriminative model, Prior probability, Expectation–maximization algorithm, Maximum a posteriori estimation, Artificial intelligence, Likelihood function, business, Hidden Markov model
Abstract: In this paper we show how a discriminative objective function such as Maximum Mutual Information (MMI) can be combined with a prior distribution over the HMM parameters to give a discriminative Maximum A Posteriori (MAP) estimate for HMM training. The prior distribution can be based around the Maximum Likelihood (ML) parameter estimates, leading to a technique previously referred to as I-smoothing; or for adaptation it can be based around a MAP estimate of the ML parameters, leading to what we call MMI-MAP. This latter approach is shown to be effective for task adaptation, where data from one task (Voicemail) is used to adapt a HMM set trained on another task (Switchboard). It is shown that MMI-MAP results in a 2.1% absolute reduction in word error rate relative to standard ML-MAP with 30 hours of Voicemail task adaptation data starting from a MMI-trained Switchboard system.
Published: 2003
Full Text: View/download PDF

294. Porting: SwitchBoard to the VoiceMail task

Author: Philip C. Woodland, M.E. Gales, Y. Dong, and Daniel Povey
Subjects: Discriminative model, law, Computer science, Speech recognition, Voicemail, Word error rate, Porting, Natural language, law.invention
Abstract: The paper examines techniques that allow a well-trained source system built on one task to be rapidly adapted, or ported, to another target task. The two tasks considered are Hub5, or SwitchBoard, as the source system and VoiceMail as the target task. The two tasks are acoustically similar, both being telephone-bandwidth speech tasks, but differ in speaking style. SwitchBoard is conversational speech, VoiceMail is a set of voicemail messages. Various porting schemes for acoustic models are examined, including discriminative MAP and heteroscedastic LDA. Using around 28 hours of data, the error rate on VoiceMail was reduced by 42% relative compared to the baseline SwitchBoard performance.
Published: 2003
Full Text: View/download PDF

295. Improved discriminative training techniques for large vocabulary continuous speech recognition

Author: Daniel Povey and Philip C. Woodland
Subjects: Vocabulary, business.industry, Computer science, media_common.quotation_subject, Maximum likelihood, Speech recognition, Mutual information, Machine learning, computer.software_genre, Discriminative model, Artificial intelligence, Hidden Markov model, business, computer, media_common
Abstract: Investigates the use of discriminative training techniques for large vocabulary speech recognition with training datasets up to 265 hours. Techniques for improving lattice-based maximum mutual information estimation (MMIE) training are described and compared to frame discrimination (FD). An objective function which is an interpolation of MMIE and standard maximum likelihood estimation (MLE) is also discussed. Experimental results on both the Switchboard and North American Business News tasks show that MMIE training can yield significant performance improvements over standard MLE even for the most complex speech recognition problems with very large training sets.
Published: 2002
Full Text: View/download PDF

296. New features in the CU-HTK system for transcription of conversational telephone speech

Author: P.C. Woodland, Thomas Hain, Gunnar Evermann, and Daniel Povey
Subjects: Computer science, business.industry, Speech recognition, Posterior probability, Word error rate, NIST, Mutual information, Telephony, Pronunciation, Transcription (software), Hidden Markov model, business
Abstract: Discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved by the use of maximum mutual information estimation in training as well as maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. Improvements are demonstrated via performance on the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). In this evaluation the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin.
Published: 2002
Full Text: View/download PDF

297. Corrections to 'Automatic Transcription of Conversational Telephone Speech'

Author: G.L. Moore, Gunnar Evermann, Philip C. Woodland, Xunying Liu, Daniel Povey, Mark J. F. Gales, Thomas Hain, and Lan Wang
Subjects: Acoustics and Ultrasonics, business.industry, Computer science, Speech recognition, Telephony, Electrical and Electronic Engineering, Transcription (software), business
Published: 2006
Full Text: View/download PDF

298. Improved feature processing for deep neural networks

Author: Jan Cernocký, Daniel Povey, Shakti P. Rath, and Karel Veselý
Subjects: Computer science, business.industry, Speech recognition, Pipeline (computing), Pattern recognition, Feature Dimension, Dimension (vector space), Feature (computer vision), Deep neural networks, Artificial intelligence, Mel-frequency cepstrum, business, Baseline (configuration management), Decorrelation
Abstract: In this paper, we investigate alternative ways of processing MFCC-based features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing the 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT. Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM-based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN. Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA. Our final result is about 3% absolute better than our best GMM system, which is a discriminatively trained model.

299. Sequence-discriminative training of deep neural networks

Author: Lukas Burget, Karel Veselý, Arnab Ghoshal, and Daniel Povey
Subjects: Bayes' theorem, Sequence, Artificial neural network, Discriminative model, business.industry, Computer science, Deep learning, Speech recognition, Artificial intelligence, Mutual information, Heuristics, business, Word (computer architecture)
Abstract: Sequence-discriminative training of deep neural networks (DNNs) is investigated on a standard 300 hour American En- glish conversational telephone speech task. Different sequence- discriminative criteria — maximum mutual information (MMI), minimum phone error (MPE), state-level minimum Bayes risk (sMBR), and boosted MMI — are compared. Two different heuristics are investigated to improve the performance of the DNNs trained using sequence-based criteria — lattices are re- generated after the first iteration of training; and, for MMI and BMMI, the frames where the numerator and denominator hy- potheses are disjoint are removed from the gradient compu- tation. Starting from a competitive DNN baseline trained us- ing cross-entropy, different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on aver- age. Little difference is noticed between the different sequence- based criteria that are investigated. The experiments are done using the open-source Kaldi toolkit, which makes it possible for the wider community to reproduce these results. Index Terms: speech recognition, deep learning, sequence- criterion training, neural networks, reproducible research

300. Subspace Gaussian Mixture Models for speech recognition

Author: Mohit Agarwal, Daniel Povey, Arnab Ghoshal, Samuel Thomas, Nagendra Kumar Goel, Ondrej Glembek, Petr Schwarz, Martin Karafiat, Lukas Burget, Richard Rose, Ariya Rastrow, Kai Feng, and Pinar Akyazi
Subjects: Gaussian Mixture Models, Training set, Computer science, business.industry, Speech recognition, Acoustic model, Pattern recognition, Parameter space, Mixture model, symbols.namesake, Speech Recognition, symbols, Artificial intelligence, Representation (mathematics), Hidden Markov model, business, Gaussian process, Subspace topology, Hidden Markov Models
Abstract: We describe an acoustic modeling approach in which all phonetic states share a common Gaussian Mixture Model structure, and the means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM). Globally shared parameters define the subspace. This style of acoustic model allows for a much more compact representation and gives better results than a conventional modeling approach, particularly with smaller amounts of training data.

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Database

Publisher

367 results on '"Daniel Povey"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources