Author: "Daniel Povey" / Database: OpenAIRE - Searchworks@Jio Institute Digital Library Search Results

1. Fast and parallel decoding for transducer

Author: Wei Kang, Liyong Guo, Fangjun Kuang, Long Lin, Mingshuang Luo, Zengwei Yao, Xiaoyu Yang, Piotr Żelasko, and Daniel Povey
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast and parallel way due to an unconstrained number of symbols that can be emitted per time step. In this work, we introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences; we also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step in transducer decoding, making it more efficient to decode in parallel with batches. Furthermore, we propose an finite state automaton-based (FSA) parallel beam search algorithm that can run with graphs on GPU efficiently. The experiment results show that we achieve slight word error rate (WER) improvement as well as significant speedup in decoding. Our work is open-sourced and publicly available\footnote{https://github.com/k2-fsa/icefall}., Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing
Published: 2022

2. Pruned RNN-T for fast, memory-eﬀicient ASR training

Author: Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, and Daniel Povey
Published: 2022
Full Text: View/download PDF

3. LET-Decoder: A WFST-Based Lazy-Evaluation Token-Group Decoder With Exact Lattice Generation

Author: Mahsa Yarmohammadi, Daniel Povey, Hang Lv, Li Ke, Lei Xie, Yiming Wang, and Sanjeev Khudanpur
Subjects: Computer science, Applied Mathematics, Frame (networking), 020206 networking & telecommunications, 02 engineering and technology, Security token, Token passing, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Electrical and Electronic Engineering, Lazy evaluation, Hidden Markov model, Algorithm, Word (computer architecture), Decoding methods
Abstract: We propose a novel lazy-evaluation token-group decoding algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large vocabulary continuous speech recognition. In the standard on-the-fly composition decoder, a base WFST and one or more incremental WFSTs are composed during decoding, and then token passing algorithm is employed to generate the lattice on the composed search space, resulting in substantial computation overhead. To improve speed, the proposed algorithm adopts 1) a token-group method, which groups tokens with the same state in the base WFST on each frame and limits the capacity of the group and 2) a lazy-evaluation method, which does not expand a token group and its source token groups until it processes a word label during decoding. Experiments show that the proposed decoder works notably up to 3 times faster than the standard on-the-fly composition decoder.
Published: 2021
Full Text: View/download PDF

4. Delay-penalized transducer for low-latency streaming ASR

Author: Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long Lin, Piotr Żelasko, and Daniel Povey
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2)., Comment: Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing
Published: 2022
Full Text: View/download PDF

5. A Parallelizable Lattice Rescoring Strategy with Neural Language Models

Author: Daniel Povey, Sanjeev Khudanpur, and Li Ke
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Signal processing, Computer Science - Computation and Language, Parallelizable manifold, Computer science, High Energy Physics::Lattice, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Lattice expansion, Computer Science - Sound, Audio and Speech Processing (eess.AS), Lattice (order), Path (graph theory), FOS: Electrical engineering, electronic engineering, information engineering, Signal processing algorithms, Language model, Computation and Language (cs.CL), Algorithm, Decoding methods, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. First, lattices from first-pass decoding are expanded by the proposed posterior-based lattice expansion algorithm. Second, each expanded lattice is converted into a minimal list of hypotheses that covers every arc. Each hypothesis is constrained to be the best path for at least one arc it includes. For each lattice, the neural LM scores of the minimal list are computed in parallel and are then integrated back to the lattice in the rescoring stage. Experiments on the Switchboard dataset show that the proposed rescoring strategy obtains comparable recognition performance and generates more compact lattices than a competitive baseline method. Furthermore, the parallel rescoring method offers more flexibility by simplifying the integration of PyTorch-trained neural LMs for lattice rescoring with Kaldi., To appear at ICASSP 2021. 5 pages, 1 figure
Published: 2021
Full Text: View/download PDF

6. speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment

Author: Zhiyong Yan, Daniel Povey, Yongqing Wang, Zhiwen Zhang, Qiong Song, Yujun Wang, Huang Yukai, Junbo Zhang, and Ke Li
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, business.industry, Computer science, Speech corpus, Pronunciation, computer.software_genre, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Computer Science - Sound, Native english, Open source, Workflow, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Baseline system, Artificial intelligence, business, Computation and Language (cs.CL), computer, Natural language processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces a new open-source speech corpus named "speechocean762" designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children. Five experts annotated each of the utterances at sentence-level, word-level and phoneme-level. A baseline system is released in open source to illustrate the phoneme-level pronunciation assessment workflow on this corpus. This corpus is allowed to be used freely for commercial and non-commercial purposes. It is available for free download from OpenSLR, and the corresponding baseline system is published in the Kaldi speech recognition toolkit., Accepted in INTERSPEECH 2021
Published: 2021

7. DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs

Author: Andreas Stolcke, Sanjeev Khudanpur, Shinji Watanabe, Leibny Paola Garcia-Perera, Daniel Povey, Desh Raj, and Zili Huang
Subjects: FOS: Computer and information sciences, Beamforming, Sound (cs.SD), Majority rule, Voice activity detection, Computer science, Speech recognition, Region proposal, Approximation algorithm, 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, Natural language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems -- clustering-based, region proposal networks, and target-speaker voice activity detection -- on AMI and LibriCSS datasets, where it consistently outperforms the single best system. Additionally, we show that DOVER-Lap can be used for late fusion in multichannel diarization, and compares favorably with early fusion methods like beamforming., Comment: Accepted to IEEE SLT 2021
Published: 2021
Full Text: View/download PDF

8. An Asynchronous WFST-Based Decoder For Automatic Speech Recognition

Author: Hang Lv, Zhehuai Chen, Lei Xie, Daniel Povey, Sanjeev Khudanpur, and Hainan Xu
Subjects: FOS: Computer and information sciences, Signal processing, Sound (cs.SD), Computer science, Speech recognition, Computation, Process (computing), Data_CODINGANDINFORMATIONTHEORY, Computer Science - Sound, Asynchronous communication, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Overhead (computing), Language model, Pruning (decision trees), Decoding methods, Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science::Information Theory
Abstract: We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity., Comment: 5 pages, 5 figures, icassp
Published: 2021
Full Text: View/download PDF

9. Wake Word Detection with Streaming Transformers

Author: Daniel Povey, Hang Lv, Lei Xie, Sanjeev Khudanpur, and Yiming Wang
Subjects: FOS: Computer and information sciences, Sequence, Sound (cs.SD), Computer Science - Computation and Language, Artificial neural network, Computer science, Computer Science - Sound, Power (physics), Constant false alarm rate, Convolution, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Time complexity, Algorithm, Computation and Language (cs.CL), Word (computer architecture), Electrical Engineering and Systems Science - Audio and Speech Processing, Transformer (machine learning model)
Abstract: Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length., Comment: Accepted at IEEE ICASSP 2021. 5 pages, 3 figures
Published: 2021
Full Text: View/download PDF

10. GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Author: Yujun Wang, Wei Zou, Guoguo Chen, Shuaijiang Zhao, Guan-Bo Wang, Mingjie Jin, Yongqing Wang, Wei-Qiang Zhang, Jiayu Du, Shuzhou Chai, Daniel Povey, Zhiyong Yan, Jan Trmal, Shinji Watanabe, Xuchen Yao, Sanjeev Khudanpur, Zhao You, Dan Su, Junbo Zhang, Chao Weng, and Xiangang Li
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Computer science, Speech recognition, Word error rate, Filter (signal processing), Variety (linguistics), Pipeline (software), Computer Science - Sound, Multi domain, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Segmentation, Transcription (software), Computation and Language (cs.CL), Sentence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
Published: 2021
Full Text: View/download PDF

11. Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems

Author: Daniel Povey, Hervé Bourlard, Banriskhem K. Khonglah, Sibo Tong, Petr Motlicek, and Srikanth Madikeri
Subjects: Lattice (module), Computer science, Speech recognition, Mutual information
Published: 2020
Full Text: View/download PDF

12. An Alternative to MFCCs for ASR

Author: Sanjeev Khudanpur, Daniel Povey, Hynek Hermansky, Hossein Hadian, and Pegah Ghahramani
Subjects: Computer science, Speech recognition
Published: 2020
Full Text: View/download PDF

13. Neural Language Modeling With Implicit Cache Pointers

Author: Daniel Povey, Ke Li, and Sanjeev Khudanpur
Subjects: Recurrent neural network, Dependency (UML), Perplexity, Audio and Speech Processing (eess.AS), Computer science, Pointer (computer programming), Speech recognition, FOS: Electrical engineering, electronic engineering, information engineering, Treebank, Cache, Language model, Layer (object-oriented design), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: A cache-inspired approach is proposed for neural language models (LMs) to improve long-range dependency and better predict rare words from long contexts. This approach is a simpler alternative to attention-based pointer mechanism that enables neural LMs to reproduce words from recent history. Without using attention and mixture structure, the method only involves appending extra tokens that represent words in history to the output layer of a neural LM and modifying training supervisions accordingly. A memory-augmentation unit is introduced to learn words that are particularly likely to repeat. We experiment with both recurrent neural network- and Transformer-based LMs. Perplexity evaluation on Penn Treebank and WikiText-2 shows the proposed model outperforms both LSTM and LSTM with attention-based pointer mechanism and is more effective on rare words. N-best rescoring experiments on Switchboard indicate that it benefits both very rare and frequent words. However, it is challenging for the proposed model as well as two other models with attention-based pointer mechanism to obtain good overall WER reductions., To appear at Interspeech 2020
Published: 2020

14. Efficient MDI Adaptation for n-gram Language Models

Author: Ashish Arora, Ke Li, Ruizhe Huang, Sanjeev Khudanpur, and Daniel Povey
Subjects: FOS: Computer and information sciences, Vocabulary, Perplexity, Computer Science - Computation and Language, Computational complexity theory, Computer science, Principle of maximum entropy, media_common.quotation_subject, Word error rate, n-gram, Scalability, Language model, Algorithm, Computation and Language (cs.CL), media_common
Abstract: This paper presents an efficient algorithm for n-gram language model adaptation under the minimum discrimination information (MDI) principle, where an out-of-domain language model is adapted to satisfy the constraints of marginal probabilities of the in-domain data. The challenge for MDI language model adaptation is its computational complexity. By taking advantage of the backoff structure of n-gram model and the idea of hierarchical training method, originally proposed for maximum entropy (ME) language models, we show that MDI adaptation can be computed in linear-time complexity to the inputs in each iteration. The complexity remains the same as ME models, although MDI is more general than ME. This makes MDI adaptation practical for large corpus and vocabulary. Experimental results confirm the scalability of our algorithm on very large datasets, while MDI adaptation gets slightly worse perplexity but better word error rate results compared to simple linear interpolation., To appear in INTERSPEECH 2020. Appendix A of this full version will be filled soon
Published: 2020

15. PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Author: Yiming Wang, Sanjeev Khudanpur, Yiwen Shao, and Daniel Povey
Subjects: FOS: Computer and information sciences, Flexibility (engineering), Sound (cs.SD), Computer Science - Computation and Language, Artificial neural network, Computer science, Speech recognition, Mutual information, Computer Science - Sound, Espresso, End-to-end principle, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into new ASR projects, or other existing PyTorch-based ASR tools, as exemplified respectively by a new project PyChain-example, and Espresso, an existing end-to-end ASR toolkit. PyChain's efficiency and flexibility is demonstrated through such novel features as full GPU training on numerator/denominator graphs, and support for unequal length sequences. Experiments on the WSJ dataset show that with simple neural networks and commonly used machine learning techniques, PyChain can achieve competitive results that are comparable to Kaldi and better than other end-to-end ASR systems., Submtted to Interspeech 2020
Published: 2020

16. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Author: Naoyuki Kanda, David Snyder, Ashish Arora, Bar Ben Yair, Christoph Boeddeker, Neville Ryant, Jan Trmal, Xuankai Chang, Emmanuel Vincent, Daniel Povey, Aswin Shanmugam Subramanian, Shinji Watanabe, Michael I. Mandel, Zhaoheng Ni, Shota Horiguchi, Takuya Yoshioka, Sanjeev Khudanpur, Vimal Manohar, Yusuke Fujita, Desh Raj, Jon Barker, Center for Language and Speech Processing [Baltimore] (CLSP), Johns Hopkins University (JHU), Brooklyn College [CUNY, New York], City University of New York [New York] (CUNY), Department of Computer Science [Sheffield], University of Sheffield [Sheffield], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Universität Paderborn (UPB), Hitachi, Ltd, Microsoft Research [Redmond], Microsoft Corporation [Redmond, Wash.], Linguistic Data Consortium (LDC), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: CHiME challenge, computational paralinguistics, Conversational speech, Computer science, Speech recognition, speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Speaker diarisation, Speech enhancement, Open source, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Natural (music), speech enhancement, speaker diarization, 020201 artificial intelligence & image processing, speech separation, Set (psychology)
Abstract: International audience; Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
Published: 2020
Full Text: View/download PDF

17. An Empirical Study of Transformer-Based Neural Language Model Adaptation

Author: Daniel Povey, Hongzhao Huang, Li Ke, Fuchun Peng, Zhe Liu, Tianxing He, and Sanjeev Khudanpur
Subjects: Text corpus, Empirical research, Computer science, Speech recognition, 0502 economics and business, 05 social sciences, Language model, 050207 economics, 010501 environmental sciences, 01 natural sciences, 0105 earth and related environmental sciences, Transformer (machine learning model), Weighting
Abstract: We explore two adaptation approaches of deep Transformer based neural language models (LMs) for automatic speech recognition. The first approach is a pretrain-finetune framework, where we first pretrain a Transformer LM on a large-scale text corpus from scratch and then adapt it to relatively small target domains via finetuning. The second approach is a mixer of dynamically weighted models that are separately trained on source and target domains, aiming to improve simple linear interpolation with dynamic weighting. We compare the two approaches with three baselines - without adaptation, merging data, and simple interpolation - on Switchboard (SWBD) and Wall Street Journal (WSJ). Experiments show that the mixer model generally performs better than baselines and finetuning. Compared with no adaptation, finetuning and the mixer approach obtain up to relative 11.5% and 14.1% WER reductions on SWBD, respectively. The mixer model also outperforms linear interpolation and merging data. On WSJ, the mixer approach achieves a new state-of-the-art WER result.
Published: 2020
Full Text: View/download PDF

18. OOV Recovery with Efficient 2nd Pass Decoding and Open-vocabulary Word-level RNNLM Rescoring for Hybrid ASR

Author: Sanjeev Khudanpur, Xiaohui Zhang, and Daniel Povey
Subjects: Vocabulary, Computer science, Pipeline (computing), Speech recognition, media_common.quotation_subject, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Language model, 0305 other medical science, Decoding methods, Word (computer architecture), media_common
Abstract: In this paper, we investigate out-of-vocabulary (OOV) word recovery in hybrid automatic speech recognition (ASR) systems, with emphasis on dynamic vocabulary expansion for both Weight Finite State Transducer (WFST)-based decoding and word-level RNNLM rescoring. We first describe our OOV candidate generation method based on a hybrid lexical model (HLM) with phoneme-sequence constraints. Next, we introduce a framework for efficient second pass OOV recovery with a dynamically expanded vocabulary, showing that, by calibrating OOV candidates’ language model (LM) scores, it significantly improves OOV recovery and overall decoding performance compared to HLM-based first pass decoding. Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re-scoring framework, making it possible to re-score ASR hypotheses containing recovered OOVs, using a single word-level RNNLM ignorant of OOVs when it was trained. By evaluating OOV recovery and overall decoding performance on Spanish/English ASR ‘tasks, we show the proposed OOV recovery pipeline has the potential of an efficient open-vocab word-based ASR decoding framework, with minimal extra computation versus a standard WFST based decoding and RNNLM rescoring pipeline.
Published: 2020
Full Text: View/download PDF

19. Speaker Diarization with Region Proposal Network

Author: Yiwen Shao, Zili Huang, Yusuke Fujita, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur, and Paola Garcia
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Artificial neural network, Computer science, Speech recognition, Pipeline (computing), Region proposal, 010501 environmental sciences, 01 natural sciences, Computer Science - Sound, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, Baseline (configuration management), 0105 earth and related environmental sciences, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method: Region Proposal Network based Speaker Diarization (RPNSD). In this method, a neural network generates overlapped speech segment proposals, and compute their speaker embeddings at the same time. Compared with standard diarization systems, RPNSD has a shorter pipeline and can handle the overlapped speech. Experimental results on three diarization datasets reveal that RPNSD achieves remarkable improvements over the state-of-the-art x-vector baseline., Accepted to ICASSP 2020
Published: 2020

20. Wake Word Detection with Alignment-Free Lattice-Free MMI

Author: Hang Lv, Lei Xie, Sanjeev Khudanpur, Yiming Wang, and Daniel Povey
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Computer science, Speech recognition, Wake, Computer Science - Sound, Data set, Reduction (complexity), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, False alarm, Free lattice, Hidden Markov model, Computation and Language (cs.CL), Word (computer architecture), Electrical Engineering and Systems Science - Audio and Speech Processing, Spoken language
Abstract: Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set., Comment: Accepted at Interspeech 2020. 5 pages, 3 figures
Published: 2020
Full Text: View/download PDF

21. Multistream CNN for Robust Acoustic Modeling

Author: Jing Pan, Daniel Povey, Kyu Jeong Han, Tao Ma, and Venkata Krishna Naveen Tadala
Subjects: FOS: Computer and information sciences, Signal processing, Sound (cs.SD), Computer Science - Computation and Language, Artificial neural network, Computer science, Time delay neural network, Speech recognition, Convolutional neural network, Computer Science - Sound, Data modeling, Set (abstract data type), Robustness (computer science), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing, Communication channel
Abstract: This paper proposes multistream CNN, a novel neural network architecture for robust acoustic modeling in speech recognition tasks. The proposed architecture processes input speech with diverse temporal resolutions by applying different dilation rates to convolutional neural networks across multiple streams to achieve the robustness. The dilation rates are selected from the multiples of a sub-sampling rate of 3 frames. Each stream stacks TDNN-F layers (a variant of 1D CNN), and output embedding vectors from the streams are concatenated then projected to the final layer. We validate the effectiveness of the proposed multistream CNN architecture by showing consistent improvements against Kaldi's best TDNN-F model across various data sets. Multistream CNN improves the WER of the test-other set in the LibriSpeech corpus by 12% (relative). On custom data from ASAPP's production ASR system for a contact center, it records a relative WER improvement of 11% for customer channel audio to prove its robustness to data in the wild. In terms of real-time factor, multistream CNN outperforms the baseline TDNN-F by 15%, which also suggests its practicality on production systems. When combined with self-attentive SRU LM rescoring, multistream CNN contributes for ASAPP to achieve the best WER of 1.75% on test-clean in LibriSpeech., Comment: Accepted to ICASSP 2021
Published: 2020
Full Text: View/download PDF

22. Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR

Author: Sanjeev Khudanpur, Hossein Sameti, Daniel Povey, and Hossein Hadian
Subjects: Context model, Acoustics and Ultrasonics, Artificial neural network, Computer science, Pipeline (computing), Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Mutual information, Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, Computational Mathematics, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, 0305 other medical science, Hidden Markov model, Decoding methods, Word (computer architecture)
Abstract: In recent years, end-to-end approaches to automatic speech recognition have received considerable attention as they are much faster in terms of preparing resources. However, conventional multistage approaches, which rely on a pipeline of training hidden Markov models (HMM)-GMM models and tree-building steps still give the state-of-the-art results on most databases. In this study, we investigate flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition. We thoroughly look into different issues that arise in such a setup and propose a standalone system, which achieves word error rates (WER) comparable with that of the state-of-the-art multi-stage systems while being much faster to prepare. We propose to use full biphones to enable flat-start context-dependent (CD) modeling and show through experiments that our CD modeling approach can be almost as effective as regular tree-based CD modeling. We show that our flat-start LF-MMI setup together with this tree-free CD modeling technique achieves 10 to 25 % relative WER reduction compared to other end-to-end methods on well-known databases. The improvements are larger for smaller databases.
Published: 2018
Full Text: View/download PDF

23. Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs

Author: Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, and Yiming Wang
Subjects: Artificial neural network, Microphone, Time delay neural network, Computer science, Applied Mathematics, Speech recognition, Word error rate, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Frame rate, Convolution, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, Latency (engineering)
Abstract: Bidirectional long short-term memory (BLSTM) acoustic models provide a significant word error rate reduction compared to their unidirectional counterpart, as they model both the past and future temporal contexts. However, it is nontrivial to deploy bidirectional acoustic models for online speech recognition due to an increase in latency. In this letter, we propose the use of temporal convolution, in the form of time-delay neural network (TDNN) layers, along with unidirectional LSTM layers to limit the latency to 200 ms. This architecture has been shown to outperform the state-of-the-art low frame rate (LFR) BLSTM models. We further improve these LFR BLSTM acoustic models by operating them at higher frame rates at lower layers and show that the proposed model performs similar to these mixed frame rate BLSTMs. We present results on the Switchboard 300 h LVCSR task and the AMI LVCSR task, in the three microphone conditions.
Published: 2018
Full Text: View/download PDF

24. Probing the Information Encoded in X-Vectors

Author: Sanjeev Khudanpur, Desh Raj, David Snyder, and Daniel Povey
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Speaker verification, Artificial neural network, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, I vector, Speaker recognition, Computer Science - Sound, Extractor, Transcription (linguistics), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computation and Language (cs.CL), Utterance, Sentence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks., Comment: Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2019
Published: 2019
Full Text: View/download PDF

25. Incremental Lattice Determinization for WFST Decoders

Author: Daniel Povey, Lei Xie, Mahsa Yarmohammadi, Hainan Xu, Zhehuai Chen, Sanjeev Khudanpur, and Hang Lv
Subjects: Computer Science::Sound, Computer science, High Energy Physics::Lattice, Lattice (order), Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Algorithm, Computer Science::Databases, Utterance, Semiring
Abstract: We introduce a lattice determinization algorithm that can operate incrementally. That is, a word-level lattice can be generated for a partial utterance and then, once we have processed more audio, we can obtain a word-level lattice for the extended utterance without redoing all the work of lattice determinization. This is relevant for ASR decoders such as those used in Kaldi, which first generate a state-level lattice and then convert it to a word-level lattice using a determinization algorithm in a special semiring. Our incremental determinization algorithm is useful when word-level lattices are needed prior to the end of the utterance, and also reduces the latency due to determinization at the end of the utterance.
Published: 2019
Full Text: View/download PDF

26. The JHU ASR System for VOiCES from a Distance Challenge 2019

Author: Hainan Xu, Sanjeev Khudanpur, Daniel Povey, Yiming Wang, David Snyder, Vimal Manohar, and Phani Sankar Nidadavolu
Subjects: Computer science, Speech recognition
Published: 2019
Full Text: View/download PDF

27. Multi-PLDA Diarization on Children’s Speech

Author: Daniel Povey, Sanjeev Khudanpur, Jiamin Xie, and Leibny Paola Garcia-Perera
Subjects: Speaker diarisation, Computer science, Speech recognition
Published: 2019
Full Text: View/download PDF

28. State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18

Author: David Snyder, Leibny Paola Garcia-Perera, Pedro A. Torres-Carrasquillo, Jonas Borgstrom, Sanjeev Khudanpur, Gregory Sell, Fred Richardson, Daniel Garcia-Romero, Nanxin Chen, Daniel Povey, Francois Grondin, Alan V. McCree, Najim Dehak, Réda Dehak, Jesús Villalba, and Suwon Shon
Subjects: Computer science, Speech recognition, NIST, State (computer science), Speaker recognition
Published: 2019
Full Text: View/download PDF

29. Advances in Automatic Speech Recognition for Child Speech Using Factored Time Delay Neural Network

Author: Fei Wu, Leibny Paola Garcia-Perera, Sanjeev Khudanpur, and Daniel Povey
Subjects: Time delay neural network, Computer science, Speech recognition
Published: 2019
Full Text: View/download PDF

30. Improving Emotion Identification Using Phone Posteriors in Raw Speech Waveform Based DNN

Author: Najim Dehak, Nagendra Kumar Goel, Mousmita Sarma, Pegah Ghahremani, Daniel Povey, and Kandarpa Kumar Sarma
Subjects: Phone, Computer science, Speech recognition, Emotion identification, Waveform
Published: 2019
Full Text: View/download PDF

31. Speaker Recognition Benchmark Using the CHiME-5 Corpus

Author: Sanjeev Khudanpur, Daniel Povey, Daniel Garcia-Romero, Shinji Watanabe, Alan V. McCree, Gregory Sell, and David Snyder
Subjects: Computer science, Robustness (computer science), Speech recognition, Benchmark (computing), Speaker recognition
Published: 2019
Full Text: View/download PDF

32. The JHU Speaker Recognition System for the VOiCES 2019 Challenge

Author: Jesús Villalba, David Snyder, Gregory Sell, Nanxin Chen, Daniel Povey, Sanjeev Khudanpur, and Najim Dehak
Subjects: Computer science, Speech recognition, Speaker recognition system
Published: 2019
Full Text: View/download PDF

33. x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

Author: David Snyder, Gregory Sell, Daniel Povey, Daniel Garcia-Romero, Sanjeev Khudanpur, and Alan V. McCree
Subjects: Computer science, Speech recognition, Speaker recognition
Published: 2019
Full Text: View/download PDF

34. Optical Character Recognition with Chinese and Korean Character Decomposition

Author: David Etter, Sanjeev Khudanpur, Chun Chieh Chang, Daniel Povey, Leibny Paola García Perera, and Ashish Arora
Subjects: 050101 languages & linguistics, Artificial neural network, Computer science, business.industry, 05 social sciences, Word error rate, 02 engineering and technology, Optical character recognition, computer.software_genre, Character (mathematics), Transcription (linguistics), Handwriting recognition, 0202 electrical engineering, electronic engineering, information engineering, Decomposition (computer science), 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Artificial intelligence, business, computer, Natural language processing
Abstract: We present our work on Optical Character Recognition on Chinese and Korean Characters for line level transcriptions. One challenge for recognizing Chinese and Korean is that there are thousands of characters for a system to recognize. In addition, many uncommon characters only appear a couple of times in training. We use character decomposition methods to break characters into smaller constituent graphemes. CangJie is used for Chinese character decomposition and Korean Jamo is used for Korean character decomposition. Character decomposition reduces the size of the Neural Network models and allows training examples to be shared across uncommon characters with the same graphemes. We report that a CNNTDNN neural network model using character decomposition has significantly fewer parameters than the baseline while also improving character error rate.
Published: 2019
Full Text: View/download PDF

35. Using ASR Methods for OCR

Author: Desh Raj, Chun Chieh Chang, Jan Trmal, Hossein Hadian, Yiwen Shao, Sanjeev Khudanpur, Babak Rekabdar, Paola Garcia, Bagher BabaAli, David Etter, Shinji Watanabe, Vimal Manohar, Daniel Povey, and Ashish Arora
Subjects: Vocabulary, Training set, Artificial neural network, Machine translation, Arabic, Time delay neural network, Computer science, media_common.quotation_subject, Speech recognition, Lexicon, computer.software_genre, Convolutional neural network, language.human_language, ComputingMethodologies_PATTERNRECOGNITION, language, Language model, Hidden Markov model, computer, media_common
Abstract: Hybrid deep neural network hidden Markov models (DNN-HMM) have achieved impressive results on large vocabulary continuous speech recognition (LVCSR) tasks. However, the recent approaches using DNN-HMM models are not explored much for text recognition. Inspired by the current work in automatic speech recognition (ASR) and machine translation, we present an open vocabulary sub-word text recognition system. The sub-word lexicon and sub-word language model (LM) helps in overcoming the challenge of recognizing out of vocabulary (OOV) words, and a time delay neural network (TDNN) and convolution neural network (CNN) based DNN-HMM optical model (OM) efficiently models the sequence dependency in the line image. We present results on 12 datasets with training data varying from 6k lines to 600k lines. The system is built for 8 languages, i.e., English, French, Arabic, Chinese, Farsi, Tamil, Russian, and Korean. We report competitive results on several commonly used handwritten and printed text datasets.
Published: 2019
Full Text: View/download PDF

36. Speaker Recognition for Multi-speaker Conversations Using X-vectors

Author: Gregory Sell, Daniel Povey, Sanjeev Khudanpur, Alan V. McCree, Daniel Garcia-Romero, and David Snyder
Subjects: Computer science, Speech recognition, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, Speaker recognition, Measure (mathematics), Domain (software engineering), Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Embedding, 0305 other medical science, Cluster analysis
Abstract: Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.
Published: 2019
Full Text: View/download PDF

37. GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

Author: Justin Luitjens, Tim Kaldewey, Ryan Leary, Daniel Povey, and Hugo Braun
Subjects: FOS: Computer and information sciences, Speedup, Computer Science - Computation and Language, Computer science, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Viterbi algorithm, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, symbols, FOS: Electrical engineering, electronic engineering, information engineering, 0305 other medical science, Computation and Language (cs.CL), Decoding methods, Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science::Information Theory
Abstract: We present an optimized weighted finite-state transducer (WFST) decoder capable of online streaming and offline batch processing of audio using Graphics Processing Units (GPUs). The decoder is efficient in memory utilization, input/output (I/O) bandwidth, and uses a novel Viterbi implementation designed to maximize parallelism. The reduced memory footprint allows the decoder to process significantly larger graphs than previously possible, while optimizing I/O increases the number of simultaneous streams supported. GPU preprocessing of lattice segments enables intermediate lattice results to be returned to the requestor during streaming inference. Collectively, the proposed algorithm yields up to a 240x speedup over single core CPU decoding, and up to 40x faster decoding than the current state-of-the-art GPU decoder, while returning equivalent results. This decoder design enables deployment of production-grade ASR models on a large spectrum of systems, ranging from large data center servers to low-power edge devices., Comment: Accepted to ICASSP 2020
Published: 2019
Full Text: View/download PDF

38. Improving LF-MMI Using Unconstrained Supervisions for ASR

Author: Jan Trmal, Hossein Sameti, Hossein Hadian, Daniel Povey, and Sanjeev Khudanpur
Subjects: Artificial neural network, Computer science, Time delay neural network, Speech recognition, 02 engineering and technology, Mutual information, Graph, 030507 speech-language pathology & audiology, 03 medical and health sciences, Discriminative model, Error analysis, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science, Hidden Markov model
Abstract: We present our work on improving the numerator graph for discriminative training using the lattice-free maximum mutual information (MMI) criterion. Specifically, we propose a scheme for creating unconstrained numerator graphs by removing time constraints from the baseline numerator graphs. This leads to much smaller graphs and therefore faster preparation of training supervisions. By testing the proposed un-constrained supervisions using factorized time-delay neural network (TDNN) models, we observe 0.5% to 2.6% relative improvement over the state-of-the-art word error rates on various large-vocabulary speech recognition databases.
Published: 2018
Full Text: View/download PDF

39. A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models

Author: Vimal Manohar, Sanjeev Khudanpur, Pegah Ghahremani, and Daniel Povey
Subjects: Sequence, Kullback–Leibler divergence, Artificial neural network, Computer science, business.industry, Bandwidth (signal processing), 02 engineering and technology, Machine learning, computer.software_genre, Data modeling, 030507 speech-language pathology & audiology, 03 medical and health sciences, 020204 information systems, ComputingMilieux_COMPUTERSANDEDUCATION, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, Transfer of learning, Hidden Markov model, business, computer, Communication channel
Abstract: Teacher-student (T-S) learning is a transfer learning approach, where a teacher network is used to “teach” a student network to make the same predictions as the teacher. Originally formulated for model compression, this approach has also been used for domain adaptation, and is particularly effective when parallel data is available in source and target domains. The standard approach uses a frame-level objective of minimizing the KL divergence between the frame-level posteriors of the teacher and student networks. However, for sequence-trained models for speech recognition, it is more appropriate to train the student to mimic the sequence-level posterior of the teacher network. In this work, we compare this sequence-level KL divergence objective with another semi-supervised sequence-training method, namely the lattice-free MMI, for unsupervised domain adaptation. We investigate the approaches in multiple scenarios including adapting from clean to noisy speech, bandwidth mismatch and channel mismatch.
Published: 2018
Full Text: View/download PDF

40. JHU Diarization System Description

Author: Najim Dehak, L. Paola García-Perera, Zili Huang, Jesús Villalba, and Daniel Povey
Subjects: Speaker diarisation, Computer science, Speech recognition
Published: 2018
Full Text: View/download PDF

41. Acoustic Modeling from Frequency Domain Representations of Speech

Author: Sanjeev Khudanpur, Daniel Povey, Pegah Ghahremani, Hang Lv, and Hossein Hadian
Subjects: 0209 industrial biotechnology, 020901 industrial engineering & automation, Computer science, Frequency domain, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology
Published: 2018
Full Text: View/download PDF

42. End-to-end Speech Recognition Using Lattice-free MMI

Author: Daniel Povey, Sanjeev Khudanpur, Hossein Hadian, and Hossein Sameti
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Lattice (module), End-to-end principle, Computer science, Quantum mechanics, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, 0305 other medical science
Published: 2018
Full Text: View/download PDF

43. Output-Gate Projected Gated Recurrent Unit for Speech Recognition

Author: Ji Xu, Sanjeev Khudanpur, Yonghong Yan, Gaofeng Cheng, Lu Huang, and Daniel Povey
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science, Unit (housing)
Published: 2018
Full Text: View/download PDF

44. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Author: Hainan Xu, Daniel Povey, Mahsa Yarmohammadi, Yiming Wang, Ke Li, Sanjeev Khudanpur, and Gaofeng Cheng
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Low rank matrix factorization, Computer science, 0202 electrical engineering, electronic engineering, information engineering, Deep neural networks, 020201 artificial intelligence & image processing, 02 engineering and technology, 0305 other medical science, Algorithm
Published: 2018
Full Text: View/download PDF

45. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge

Author: Daniel Povey, David Snyder, Vimal Manohar, Jesús Villalba, Najim Dehak, Shinji Watanabe, Sanjeev Khudanpur, Daniel Garcia-Romero, Gregory Sell, Alan V. McCree, and Matthew Maciejewski
Subjects: business.industry, Computer science, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Speaker diarisation, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, computer, Natural language processing
Published: 2018
Full Text: View/download PDF

46. End-to-end Deep Neural Network Age Estimation

Author: Pegah Ghahremani, Jesús Villalba, Najim Dehak, Sanjeev Khudanpur, Phani Sankar Nidadavolu, Nanxin Chen, and Daniel Povey
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, End-to-end principle, Artificial neural network, Computer science, Age estimation, business.industry, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, Artificial intelligence, 0305 other medical science, business
Published: 2018
Full Text: View/download PDF

47. Emotion Identification from Raw Speech Signals Using DNNs

Author: Daniel Povey, Mousmita Sarma, Nagendra Kumar Goel, Najim Dehak, Pegah Ghahremani, and Kandarpa Kumar Sarma
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, Emotion identification, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science
Published: 2018
Full Text: View/download PDF

48. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

Author: Tom Ko, Daniel Povey, David Snyder, Brian Mak, and Yingke Zhu
Subjects: 030507 speech-language pathology & audiology, 03 medical and health sciences, Speaker verification, Computer science, Speech recognition, Text independent, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 0305 other medical science
Published: 2018
Full Text: View/download PDF

49. Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition

Author: Yiming Wang, Ke Li, Sanjeev Khudanpur, Hainan Xu, and Daniel Povey
Subjects: Conversational speech, Recurrent neural network, Computer science, Speech recognition, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 020201 artificial intelligence & image processing, 02 engineering and technology, Language model, Adaptation (computer science)
Published: 2018
Full Text: View/download PDF

50. A GPU-based WFST Decoder with Exact Lattice Generation

Author: Daniel Povey, Zhehuai Chen, Justin Luitjens, Sanjeev Khudanpur, Hainan Xu, and Yiming Wang
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Speedup, Computer science, I.2.7, 68T10, 02 engineering and technology, Parallel computing, Security token, 01 natural sciences, Scheduling (computing), Lattice (module), Token passing, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Pruning (decision trees), Graphics, Computation and Language (cs.CL), 010301 acoustics, Decoding methods
Abstract: We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesign the exact lattice generation and lattice pruning algorithms for better utilization of the GPUs. Experiments on the Switchboard corpus show that the proposed method achieves identical 1-best results and lattice quality in recognition and confidence measure tasks, while running 3 to 15 times faster than the single process Kaldi decoder. The above results are reported on different GPU architectures. Additionally we obtain a 46-fold speedup with sequence parallelism and multi-process service (MPS) in GPU., accepted by INTERSPEECH 2018
Published: 2018
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

151 results on '"Daniel Povey"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources