Author: "Audhkhasi, Kartik" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Audhkhasi, Kartik"' showing total 153 results

Start Over Author "Audhkhasi, Kartik"

153 results on '"Audhkhasi, Kartik"'

1. STAB: Speech Tokenizer Assessment Benchmark

Author: Vashishth, Shikhar, Singh, Harman, Bharadwaj, Shikhar, Ganapathy, Sriram, Asawaroengchai, Chulayuth, Audhkhasi, Kartik, Rosenberg, Andrew, Bapna, Ankur, and Ramabhadran, Bhuvana
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices., Comment: 5 pages
Published: 2024

2. O-1: Self-training with Oracle and 1-best Hypothesis

Author: Baskar, Murali Karthick, Rosenberg, Andrew, Ramabhadran, Bhuvana, and Audhkhasi, Kartik
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80\% relative compared to EMBR which bridges the gap by 43\% relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12\% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR, thereby speaking to the scalability of the proposed objective for large-scale datasets.
Published: 2023

3. Large-scale Language Model Rescoring on Long-form Data

Author: Chen, Tongzhou, Allauzen, Cyril, Huang, Yinghui, Park, Daniel, Rybach, David, Huang, W. Ronny, Cabrera, Rodrigo, Audhkhasi, Kartik, Ramabhadran, Bhuvana, Moreno, Pedro J., and Riley, Michael
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM. Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., Comment: 5 pages, accepted in ICASSP 2023
Published: 2023

4. Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

Author: Zeineldeen, Mohammad, Audhkhasi, Kartik, Baskar, Murali Karthick, and Ramabhadran, Bhuvana
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNN-T architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data., Comment: Accepted at ICASSP 2023
Published: 2023

5. Modular Hybrid Autoregressive Transducer

Author: Meng, Zhong, Chen, Tongzhou, Prabhavalkar, Rohit, Zhang, Yu, Wang, Gary, Audhkhasi, Kartik, Emond, Jesse, Strohman, Trevor, Ramabhadran, Bhuvana, Huang, W. Ronny, Variani, Ehsan, Huang, Yinghui, and Moreno, Pedro J.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT., Comment: 8 pages, 1 figure, in SLT 2022
Published: 2022

6. Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

Author: Audhkhasi, Kartik, Huang, Yinghui, Ramabhadran, Bhuvana, and Moreno, Pedro J.
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Attention layers are an integral part of modern end-to-end automatic speech recognition systems, for instance as part of the Transformer or Conformer architecture. Attention is typically multi-headed, where each head has an independent set of learned parameters and operates on the same input feature sequence. The output of multi-headed attention is a fusion of the outputs from the individual heads. We empirically analyze the diversity between representations produced by the different attention heads and demonstrate that the heads become highly correlated during the course of training. We investigate a few approaches to increasing attention head diversity, including using different attention mechanisms for each head and auxiliary training loss functions to promote head diversity. We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach, and obtain WER improvements of up to 6% relative on the Librispeech corpus. Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters., Comment: Accepted for publication in Interspeech 2022
Published: 2022

7. Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Author: Huang, Yinghui, Kuo, Hong-Kwang, Thomas, Samuel, Kons, Zvi, Audhkhasi, Kartik, Kingsbury, Brian, Hoory, Ron, and Picheny, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.7
Abstract: Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the text-to-intent data is converted into speech-to-intent data using a multi-speaker text-to-speech system. The proposed approaches recover 80% of performance lost due to using limited intent-labeled speech., Comment: 5 pages, published in ICASSP 2020
Published: 2020

8. End-to-End Spoken Language Understanding Without Full Transcripts

Author: Kuo, Hong-Kwang J., Tüske, Zoltán, Thomas, Samuel, Huang, Yinghui, Audhkhasi, Kartik, Kingsbury, Brian, Kurata, Gakuto, Kons, Zvi, Hoory, Ron, and Lastras, Luis
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.7
Abstract: An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without word-for-word transcripts. Training such models is very useful as they can drastically reduce the cost of data collection. We created two types of such speech-to-entities models, a CTC model and an attention-based encoder-decoder model, by adapting models trained originally for speech recognition. Given that our experiments involve speech input, these systems need to recognize both the entity label and words representing the entity value correctly. For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words: there was little degradation when trained on just entities versus full transcripts. We also explored the scenario where the entities are in an order not necessarily related to spoken order in the utterance. With its ability to do re-ordering, the attention model did remarkably well, achieving only about 2% degradation in speech-to-bag-of-entities F1 score., Comment: 5 pages, to be published in Interspeech 2020
Published: 2020

9. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Author: Rouditchenko, Andrew, Boggust, Angie, Harwath, David, Chen, Brian, Joshi, Dhiraj, Thomas, Samuel, Audhkhasi, Kartik, Kuehne, Hilde, Panda, Rameswar, Feris, Rogerio, Kingsbury, Brian, Picheny, Michael, Torralba, Antonio, and Glass, James
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. We perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts. Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval. Our code, data, and trained models will be released at avlnet.csail.mit.edu, Comment: A version of this work has been accepted to Interspeech 2021
Published: 2020

10. Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

Author: Tüske, Zoltán, Saon, George, Audhkhasi, Kartik, and Kingsbury, Brian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, 68T10, I.2.7
Abstract: It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.7% and 7.8% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources., Comment: 5 pages, 2 figures
Published: 2020

11. Challenging the Boundaries of Speech Recognition: The MALACH Corpus

Author: Picheny, Michael, Tüske, Zóltan, Kingsbury, Brian, Audhkhasi, Kartik, Cui, Xiaodong, and Saon, George
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speech community. The collection consists of unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech - all still open problems for speech recognition systems. Transcription is challenging even for skilled human annotators. This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech. To reduce the barrier for entry, a lexicon and training and testing setups have been created and baseline results using current deep learning technologies are presented. The metadata has just been released by LDC (LDC2019S11). It is hoped that this resource will enable the community to build on top of these baselines so that the extremely important information in these and related oral histories becomes accessible to a wider audience., Comment: Accepted for publication at INTERSPEECH 2019
Published: 2019

12. Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation

Author: Kurata, Gakuto and Audhkhasi, Kartik
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Conventional automatic speech recognition (ASR) systems trained from frame-level alignments can easily leverage posterior fusion to improve ASR accuracy and build a better single model with knowledge distillation. End-to-end ASR systems trained using the Connectionist Temporal Classification (CTC) loss do not require frame-level alignment and hence simplify model training. However, sparse and arbitrary posterior spike timings from CTC models pose a new set of challenges in posterior fusion from multiple models and knowledge distillation between CTC models. We propose a method to train a CTC model so that its spike timings are guided to align with those of a pre-trained guiding CTC model. As a result, all models that share the same guiding model have aligned spike timings. We show the advantage of our method in various scenarios including posterior fusion of CTC models and knowledge distillation between CTC models with different architectures. With the 300-hour Switchboard training data, the single word CTC model distilled from multiple models improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000 Switchboard/CallHome test sets without using any data augmentation, language model, or complex decoder., Comment: Accepted to Interspeech 2019
Published: 2019

13. Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition

Author: Settle, Shane, Audhkhasi, Kartik, Livescu, Karen, and Picheny, Michael
Subjects: Computer Science - Computation and Language
Abstract: Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are simpler to train, and more efficient to decode with, than sub-word systems. However, A2W systems can have difficulties at training time when data is limited, and at decoding time when recognizing words outside the training vocabulary. To address these shortcomings, we investigate the use of recently proposed acoustic and acoustically grounded word embedding techniques in A2W systems. The idea is based on treating the final pre-softmax weight matrix of an AWE recognizer as a matrix of word embedding vectors, and using an externally trained set of word embeddings to improve the quality of this matrix. In particular we introduce two ideas: (1) Enforcing similarity at training time between the external embeddings and the recognizer weights, and (2) using the word embeddings at test time for predicting out-of-vocabulary words. Our word embedding model is acoustically grounded, that is it is learned jointly with acoustic embeddings so as to encode the words' acoustic-phonetic content; and it is parametric, so that it can embed any arbitrary (potentially out-of-vocabulary) sequence of characters. We find that both techniques improve the performance of an A2W recognizer on conversational telephone speech., Comment: To appear at ICASSP 2019
Published: 2019

14. Task Vector Algebra for ASR Models

Author: Ramesh, Gowtham, primary, Audhkhasi, Kartik, additional, and Ramabhadran, Bhuvana, additional
Published: 2024
Full Text: View/download PDF

15. Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Author: Yang, Xuesong, Audhkhasi, Kartik, Rosenberg, Andrew, Thomas, Samuel, Ramabhadran, Bhuvana, and Hasegawa-Johnson, Mark
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to individual accents. In this paper, we explore an alternate model where we jointly learn an accent classifier and a multi-task acoustic model. Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that our joint model outperforms the strong multi-task acoustic model baseline. We obtain a 5.94% relative improvement in word error rate on British English, and 9.47% relative improvement on American English. This illustrates that jointly modeling with accent information improves acoustic model performance., Comment: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)
Published: 2018

16. Building competitive direct acoustics-to-word models for English conversational speech recognition

Author: Audhkhasi, Kartik, Kingsbury, Brian, Ramabhadran, Bhuvana, Saon, George, and Picheny, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder or language model. We find that model initialization, training data order, and regularization have the most impact on the A2W model performance. Next, we present a joint word-character A2W model that learns to first spell the word and then recognize it. This model provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training., Comment: Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
Published: 2017

17. Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Author: Audhkhasi, Kartik, Ramabhadran, Bhuvana, Saon, George, Picheny, Michael, and Nahamoo, David
Subjects: Computer Science - Computation and Language, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models., Comment: Submitted to Interspeech-2017
Published: 2017

18. English Conversational Telephone Speech Recognition by Humans and Machines

Author: Saon, George, Kurata, Gakuto, Sercu, Tom, Audhkhasi, Kartik, Thomas, Samuel, Dimitriadis, Dimitrios, Cui, Xiaodong, Ramabhadran, Bhuvana, Picheny, Michael, Lim, Lynn-Li, Roomi, Bergul, and Hall, Phil
Subjects: Computer Science - Computation and Language
Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.
Published: 2017

19. End-to-End ASR-free Keyword Search from Speech

Author: Audhkhasi, Kartik, Rosenberg, Andrew, Sethy, Abhinav, Ramabhadran, Bhuvana, and Kingsbury, Brian
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Abstract: End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-end system for text query-based keyword search (KWS) from speech trained with minimal supervision. Our E2E KWS system consists of three sub-systems. The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation. The second sub-system is a character-level RNN language model using embeddings learned from a convolutional neural network. Since the acoustic and text query embeddings occupy different representation spaces, they are input to a third feed-forward neural network that predicts whether the query occurs in the acoustic utterance or not. This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster., Comment: Published in the IEEE 2017 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA
Published: 2017
Full Text: View/download PDF

20. Invariant Representations for Noisy Speech Recognition

Author: Serdyuk, Dmitriy, Audhkhasi, Kartik, Brakel, Philémon, Ramabhadran, Bhuvana, Thomas, Samuel, and Bengio, Yoshua
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning, Computer Science - Sound, Statistics - Machine Learning
Abstract: Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neural network acoustic model to learn invariant feature representations. We use ideas from recent research on image generation using Generative Adversarial Networks and domain adaptation ideas extending adversarial gradient-based training. A recent work from Ganin et al. proposes to use adversarial training for image domain adaptation by using an intermediate representation from the main target classification network to deteriorate the domain classifier performance through a separate neural network. Our work focuses on investigating neural architectures which produce representations invariant to noise conditions for ASR. We evaluate the proposed architecture on the Aurora-4 task, a popular benchmark for noise robust ASR. We show that our method generalizes better than the standard multi-condition training especially when only a few noise categories are seen during training., Comment: 5 pages, 1 figure, 1 table, NIPS workshop on end-to-end speech recognition
Published: 2016

21. Diverse Embedding Neural Network Language Models

Author: Audhkhasi, Kartik, Sethy, Abhinav, and Ramabhadran, Bhuvana
Subjects: Computer Science - Computation and Language, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Abstract: We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function. Our language modeling experiments on the Penn Treebank data set show the performance benefit of using a DENNLM., Comment: Under review as workshop contribution at ICLR 2015
Published: 2014

22. Noise can speed backpropagation learning and deep bidirectional pretraining

Author: Kosko, Bart, Audhkhasi, Kartik, and Osoba, Osonde
Published: 2020
Full Text: View/download PDF

23. Generalized Ambiguity Decomposition for Understanding Ensemble Diversity

Author: Audhkhasi, Kartik, Sethy, Abhinav, Ramabhadran, Bhuvana, and Narayanan, Shrikanth S.
Subjects: Statistics - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning, I.5
Abstract: Diversity or complementarity of experts in ensemble pattern recognition and information processing systems is widely-observed by researchers to be crucial for achieving performance improvement upon fusion. Understanding this link between ensemble diversity and fusion performance is thus an important research question. However, prior works have theoretically characterized ensemble diversity and have linked it with ensemble performance in very restricted settings. We present a generalized ambiguity decomposition (GAD) theorem as a broad framework for answering these questions. The GAD theorem applies to a generic convex ensemble of experts for any arbitrary twice-differentiable loss function. It shows that the ensemble performance approximately decomposes into a difference of the average expert performance and the diversity of the ensemble. It thus provides a theoretical explanation for the empirically-observed benefit of fusing outputs from diverse classifiers and regressors. It also provides a loss function-dependent, ensemble-dependent, and data-dependent definition of diversity. We present extensions of this decomposition to common regression and classification loss functions, and report a simulation-based analysis of the diversity term and the accuracy of the decomposition. We finally present experiments on standard pattern recognition data sets which indicate the accuracy of the decomposition for real-world classification and regression problems., Comment: 32 pages, 10 figures
Published: 2013

24. O-1: Self-training with Oracle and 1-best Hypothesis

Author: Baskar, Murali Karthick, primary, Rosenberg, Andrew, additional, Ramabhadran, Bhuvana, additional, and Audhkhasi, Kartik, additional
Published: 2023
Full Text: View/download PDF

25. Large-Scale Language Model Rescoring on Long-Form Data

Author: Chen, Tongzhou, primary, Allauzen, Cyril, additional, Huang, Yinghui, additional, Park, Daniel, additional, Rybach, David, additional, Huang, W. Ronny, additional, Cabrera, Rodrigo, additional, Audhkhasi, Kartik, additional, Ramabhadran, Bhuvana, additional, Moreno, Pedro J., additional, and Riley, Michael, additional
Published: 2023
Full Text: View/download PDF

26. Robust Knowledge Distillation from RNN-T Models with Noisy Training Labels Using Full-Sum Loss

Author: Zeineldeen, Mohammad, primary, Audhkhasi, Kartik, additional, Baskar, Murali Karthick, additional, and Ramabhadran, Bhuvana, additional
Published: 2023
Full Text: View/download PDF

27. Modular Conformer Training for Flexible End-to-End ASR

Author: Audhkhasi, Kartik, primary, Farris, Brian, additional, Ramabhadran, Bhuvana, additional, and Moreno, Pedro J., additional
Published: 2023
Full Text: View/download PDF

28. Modular Hybrid Autoregressive Transducer

Author: Meng, Zhong, primary, Chen, Tongzhou, additional, Prabhavalkar, Rohit, additional, Zhang, Yu, additional, Wang, Gary, additional, Audhkhasi, Kartik, additional, Emond, Jesse, additional, Strohman, Trevor, additional, Ramabhadran, Bhuvana, additional, Huang, W. Ronny, additional, Variani, Ehsan, additional, Huang, Yinghui, additional, and Moreno, Pedro J., additional
Published: 2023
Full Text: View/download PDF

29. Noise-enhanced convolutional neural networks

Author: Audhkhasi, Kartik, Osoba, Osonde, and Kosko, Bart
Published: 2016
Full Text: View/download PDF

30. Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

Author: Gupta, Rahul, Audhkhasi, Kartik, Lee, Sungbok, and Narayanan, Shrikanth
Published: 2016
Full Text: View/download PDF

31. Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

Author: Audhkhasi, Kartik, primary, Huang, Yinghui, additional, Ramabhadran, Bhuvana, additional, and Moreno, Pedro J., additional
Published: 2022
Full Text: View/download PDF

32. Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

Author: Bone, Daniel, Goodwin, Matthew S., Black, Matthew P., Lee, Chi-Chun, Audhkhasi, Kartik, and Narayanan, Shrikanth
Subjects: Machine learning -- Analysis, Autism -- Diagnosis -- Research, Health
Abstract: Machine learning has immense potential to enhance diagnostic and intervention research in the behavioral sciences, and may be especially useful in investigations involving the highly prevalent and heterogeneous syndrome of autism spectrum disorder. However, use of machine learning in the absence of clinical domain expertise can be tenuous and lead to misinformed conclusions. To illustrate this concern, the current paper critically evaluates and attempts to reproduce results from two studies (Wall et al. in Transl Psychiatry 2(4):e100, 2012a (See CR35); PloS One 7(8), 2012b (See CR34)) that claim to drastically reduce time to diagnose autism using machine learning. Our failure to generate comparable findings to those reported by Wall and colleagues using larger and more balanced data underscores several conceptual and methodological problems associated with these studies. We conclude with proposed best-practices when using machine learning in autism research, and highlight some especially promising areas for collaborative work at the intersection of computational and behavioral science., Author(s): Daniel Bone[sup.1] , Matthew S. Goodwin[sup.2] , Matthew P. Black[sup.3] , Chi-Chun Lee[sup.4] , Kartik Audhkhasi[sup.1] , Shrikanth Narayanan[sup.1] Author Affiliations: (1) Signal Analysis & Interpretation Laboratory (SAIL), University [...]
Published: 2015
Full Text: View/download PDF

33. Regularizing Word Segmentation by Creating Misspellings

Author: Xu, Hainan, primary, Audhkhasi, Kartik, additional, Huang, Yinghui, additional, Emond, Jesse, additional, and Ramabhadran, Bhuvana, additional
Published: 2021
Full Text: View/download PDF

34. Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition

Author: Audhkhasi, Kartik, primary, Chen, Tongzhou, additional, Ramabhadran, Bhuvana, additional, and Moreno, Pedro J., additional
Published: 2021
Full Text: View/download PDF

35. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Author: Rouditchenko, Andrew, primary, Boggust, Angie, additional, Harwath, David, additional, Chen, Brian, additional, Joshi, Dhiraj, additional, Thomas, Samuel, additional, Audhkhasi, Kartik, additional, Kuehne, Hilde, additional, Panda, Rameswar, additional, Feris, Rogerio, additional, Kingsbury, Brian, additional, Picheny, Michael, additional, Torralba, Antonio, additional, and Glass, James, additional
Published: 2021
Full Text: View/download PDF

36. Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition

Author: Xu, Hainan, primary, Huang, Yinghui, additional, Zhu, Yun, additional, Audhkhasi, Kartik, additional, and Ramabhadran, Bhuvana, additional
Published: 2021
Full Text: View/download PDF

37. Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings

Author: Thomas, Samuel, primary, Audhkhasi, Kartik, additional, and Kingsbury, Brian, additional
Published: 2020
Full Text: View/download PDF

38. End-to-End Spoken Language Understanding Without Full Transcripts

Author: Kuo, Hong-Kwang J., primary, Tüske, Zoltán, additional, Thomas, Samuel, additional, Huang, Yinghui, additional, Audhkhasi, Kartik, additional, Kingsbury, Brian, additional, Kurata, Gakuto, additional, Kons, Zvi, additional, Hoory, Ron, additional, and Lastras, Luis, additional
Published: 2020
Full Text: View/download PDF

39. Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard

Author: Tüske, Zoltán, primary, Saon, George, additional, Audhkhasi, Kartik, additional, and Kingsbury, Brian, additional
Published: 2020
Full Text: View/download PDF

40. Alignment-Length Synchronous Decoding for RNN Transducer

Author: Saon, George, primary, Tuske, Zoltan, additional, and Audhkhasi, Kartik, additional
Published: 2020
Full Text: View/download PDF

41. Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

Author: Huang, Yinghui, primary, Kuo, Hong-Kwang, additional, Thomas, Samuel, additional, Kons, Zvi, additional, Audhkhasi, Kartik, additional, Kingsbury, Brian, additional, Hoory, Ron, additional, and Picheny, Michael, additional
Published: 2020
Full Text: View/download PDF

42. Simplified LSTMS for Speech Recognition

Author: Saon, George, primary, Tuske, Zoltan, additional, Audhkhasi, Kartik, additional, Kingsbury, Brian, additional, Picheny, Michael, additional, and Thomas, Samuel, additional
Published: 2019
Full Text: View/download PDF

43. Multi-Task CTC Training with Auxiliary Feature Reconstruction for End-to-End Speech Recognition

Author: Kurata, Gakuto, primary and Audhkhasi, Kartik, additional
Published: 2019
Full Text: View/download PDF

44. Challenging the Boundaries of Speech Recognition: The MALACH Corpus

Author: Picheny, Michael, primary, Tüske, Zoltán, additional, Kingsbury, Brian, additional, Audhkhasi, Kartik, additional, Cui, Xiaodong, additional, and Saon, George, additional
Published: 2019
Full Text: View/download PDF

45. Advancing Sequence-to-Sequence Based Speech Recognition

Author: Tüske, Zoltán, primary, Audhkhasi, Kartik, additional, and Saon, George, additional
Published: 2019
Full Text: View/download PDF

46. Detection and Recovery of OOVs for Improved English Broadcast News Captioning

Author: Thomas, Samuel, primary, Audhkhasi, Kartik, additional, Tüske, Zoltán, additional, Huang, Yinghui, additional, and Picheny, Michael, additional
Published: 2019
Full Text: View/download PDF

47. Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation

Author: Kurata, Gakuto, primary and Audhkhasi, Kartik, additional
Published: 2019
Full Text: View/download PDF

48. Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition

Author: Audhkhasi, Kartik, primary, Saon, George, additional, Tüske, Zoltán, additional, Kingsbury, Brian, additional, and Picheny, Michael, additional
Published: 2019
Full Text: View/download PDF

49. Sequence Noise Injected Training for End-to-end Speech Recognition

Author: Saon, George, primary, Tuske, Zoltan, additional, Audhkhasi, Kartik, additional, and Kingsbury, Brian, additional
Published: 2019
Full Text: View/download PDF

50. Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition

Author: Settle, Shane, primary, Audhkhasi, Kartik, additional, Livescu, Karen, additional, and Picheny, Michael, additional
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

153 results on '"Audhkhasi, Kartik"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources