Author: "Mahadeokar, Jay" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mahadeokar, Jay"' showing total 109 results

Start Over Author "Mahadeokar, Jay"

109 results on '"Mahadeokar, Jay"'

1. Efficient Streaming LLM for Speech Recognition

Author: Jia, Junteng, Keren, Gil, Zhou, Wei, Lakomkin, Egor, Zhang, Xiaohui, Wu, Chunyang, Seide, Frank, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs -- not only do they extrapolate poorly beyond the audio length seen during training, but they are also computationally inefficient due to the quadratic cost of attention. In this work, we introduce SpeechLLM-XL, a linear scaling decoder-only model for streaming speech recognition. We process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted. During training, the transcript is segmented into chunks, using a CTC forced alignment estimated from encoder output. SpeechLLM-XL with 1.28 seconds chunk size achieves 2.7%/6.7% WER on LibriSpeech test clean/other, and it shows no quality degradation on long form utterances 10x longer than the training utterances.
Published: 2024

2. Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Author: Kang, Wonjune, Jia, Junteng, Wu, Chunyang, Zhou, Wei, Lakomkin, Egor, Gaur, Yashesh, Sari, Leda, Kim, Suyoun, Li, Ke, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker's emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.
Published: 2024

3. M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Author: Yang, Yufeng, Raj, Desh, Lin, Ju, Moritz, Niko, Jia, Junteng, Keren, Gil, Lakomkin, Egor, Huang, Yiteng, Donley, Jacob, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach., Comment: In submission to IEEE ICASSP 2025
Published: 2024

4. Faster Speech-LLaMA Inference with Multi-token Prediction

Author: Raj, Desh, Keren, Gil, Jia, Junteng, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training for such models. We evaluate our models on a variety of public benchmarks, where they reduce the number of decoder calls by ~3.2x while maintaining or improving WER performance., Comment: Submitted to IEEE ICASSP 2025
Published: 2024

5. The Llama 3 Herd of Models

Author: Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, Goyal, Anirudh, Hartshorn, Anthony, Yang, Aobo, Mitra, Archi, Sravankumar, Archie, Korenev, Artem, Hinsvark, Arthur, Rao, Arun, Zhang, Aston, Rodriguez, Aurelien, Gregerson, Austen, Spataru, Ava, Roziere, Baptiste, Biron, Bethany, Tang, Binh, Chern, Bobbie, Caucheteux, Charlotte, Nayak, Chaya, Bi, Chloe, Marra, Chris, McConnell, Chris, Keller, Christian, Touret, Christophe, Wu, Chunyang, Wong, Corinne, Ferrer, Cristian Canton, Nikolaidis, Cyrus, Allonsius, Damien, Song, Daniel, Pintz, Danielle, Livshits, Danny, Esiobu, David, Choudhary, Dhruv, Mahajan, Dhruv, Garcia-Olano, Diego, Perino, Diego, Hupkes, Dieuwke, Lakomkin, Egor, AlBadawy, Ehab, Lobanova, Elina, Dinan, Emily, Smith, Eric Michael, Radenovic, Filip, Zhang, Frank, Synnaeve, Gabriel, Lee, Gabrielle, Anderson, Georgia Lewis, Nail, Graeme, Mialon, Gregoire, Pang, Guan, Cucurell, Guillem, Nguyen, Hailey, Korevaar, Hannah, Xu, Hu, Touvron, Hugo, Zarov, Iliyan, Ibarra, Imanol Arrieta, Kloumann, Isabel, Misra, Ishan, Evtimov, Ivan, Copet, Jade, Lee, Jaewon, Geffert, Jan, Vranes, Jana, Park, Jason, Mahadeokar, Jay, Shah, Jeet, van der Linde, Jelmer, Billock, Jennifer, Hong, Jenny, Lee, Jenya, Fu, Jeremy, Chi, Jianfeng, Huang, Jianyu, Liu, Jiawen, Wang, Jie, Yu, Jiecao, Bitton, Joanna, Spisak, Joe, Park, Jongsoo, Rocca, Joseph, Johnstun, Joshua, Saxe, Joshua, Jia, Junteng, Alwala, Kalyan Vasuden, Upasani, Kartikeya, Plawiak, Kate, Li, Ke, Heafield, Kenneth, Stone, Kevin, El-Arini, Khalid, Iyer, Krithika, Malik, Kshitiz, Chiu, Kuenley, Bhalla, Kunal, Rantala-Yeary, Lauren, van der Maaten, Laurens, Chen, Lawrence, Tan, Liang, Jenkins, Liz, Martin, Louis, Madaan, Lovish, Malo, Lubo, Blecher, Lukas, Landzaat, Lukas, de Oliveira, Luke, Muzzi, Madeline, Pasupuleti, Mahesh, Singh, Mannat, Paluri, Manohar, Kardas, Marcin, Oldham, Mathew, Rita, Mathieu, Pavlova, Maya, Kambadur, Melanie, Lewis, Mike, Si, Min, Singh, Mitesh Kumar, Hassan, Mona, Goyal, Naman, Torabi, Narjes, Bashlykov, Nikolay, Bogoychev, Nikolay, Chatterji, Niladri, Duchenne, Olivier, Çelebi, Onur, Alrassy, Patrick, Zhang, Pengchuan, Li, Pengwei, Vasic, Petar, Weng, Peter, Bhargava, Prajjwal, Dubal, Pratik, Krishnan, Praveen, Koura, Punit Singh, Xu, Puxin, He, Qing, Dong, Qingxiao, Srinivasan, Ragavan, Ganapathy, Raj, Calderer, Ramon, Cabral, Ricardo Silveira, Stojnic, Robert, Raileanu, Roberta, Girdhar, Rohit, Patel, Rohit, Sauvestre, Romain, Polidoro, Ronnie, Sumbaly, Roshan, Taylor, Ross, Silva, Ruan, Hou, Rui, Wang, Rui, Hosseini, Saghar, Chennabasappa, Sahana, Singh, Sanjay, Bell, Sean, Kim, Seohyun Sonia, Edunov, Sergey, Nie, Shaoliang, Narang, Sharan, Raparthy, Sharath, Shen, Sheng, Wan, Shengye, Bhosale, Shruti, Zhang, Shun, Vandenhende, Simon, Batra, Soumya, Whitman, Spencer, Sootla, Sten, Collot, Stephane, Gururangan, Suchin, Borodinsky, Sydney, Herman, Tamar, Fowler, Tara, Sheasha, Tarek, Georgiou, Thomas, Scialom, Thomas, Speckbacher, Tobias, Mihaylov, Todor, Xiao, Tong, Karn, Ujjwal, Goswami, Vedanuj, Gupta, Vibhor, Ramanathan, Vignesh, Kerkez, Viktor, Gonguet, Vincent, Do, Virginie, Vogeti, Vish, Petrovic, Vladan, Chu, Weiwei, Xiong, Wenhan, Fu, Wenyin, Meers, Whitney, Martinet, Xavier, Wang, Xiaodong, Tan, Xiaoqing Ellen, Xie, Xinfeng, Jia, Xuchao, Wang, Xuewei, Goldschlag, Yaelle, Gaur, Yashesh, Babaei, Yasmine, Wen, Yi, Song, Yiwen, Zhang, Yuchen, Li, Yue, Mao, Yuning, Coudert, Zacharie Delpierre, Yan, Zheng, Chen, Zhengxing, Papakipos, Zoe, Singh, Aaditya, Grattafiori, Aaron, Jain, Abha, Kelsey, Adam, Shajnfeld, Adam, Gangidi, Adithya, Victoria, Adolfo, Goldstand, Ahuva, Menon, Ajay, Sharma, Ajay, Boesenberg, Alex, Vaughan, Alex, Baevski, Alexei, Feinstein, Allie, Kallet, Amanda, Sangani, Amit, Yunus, Anam, Lupu, Andrei, Alvarado, Andres, Caples, Andrew, Gu, Andrew, Ho, Andrew, Poulton, Andrew, Ryan, Andrew, Ramchandani, Ankit, Franco, Annie, Saraf, Aparajita, Chowdhury, Arkabandhu, Gabriel, Ashley, Bharambe, Ashwin, Eisenman, Assaf, Yazdan, Azadeh, James, Beau, Maurer, Ben, Leonhardi, Benjamin, Huang, Bernie, Loyd, Beth, De Paola, Beto, Paranjape, Bhargavi, Liu, Bing, Wu, Bo, Ni, Boyu, Hancock, Braden, Wasti, Bram, Spence, Brandon, Stojkovic, Brani, Gamido, Brian, Montalvo, Britt, Parker, Carl, Burton, Carly, Mejia, Catalina, Wang, Changhan, Kim, Changkyu, Zhou, Chao, Hu, Chester, Chu, Ching-Hsiang, Cai, Chris, Tindal, Chris, Feichtenhofer, Christoph, Civin, Damon, Beaty, Dana, Kreymer, Daniel, Li, Daniel, Wyatt, Danny, Adkins, David, Xu, David, Testuggine, Davide, David, Delia, Parikh, Devi, Liskovich, Diana, Foss, Didem, Wang, Dingkang, Le, Duc, Holland, Dustin, Dowling, Edward, Jamil, Eissa, Montgomery, Elaine, Presani, Eleonora, Hahn, Emily, Wood, Emily, Brinkman, Erik, Arcaute, Esteban, Dunbar, Evan, Smothers, Evan, Sun, Fei, Kreuk, Felix, Tian, Feng, Ozgenel, Firat, Caggioni, Francesco, Guzmán, Francisco, Kanayet, Frank, Seide, Frank, Florez, Gabriela Medina, Schwarz, Gabriella, Badeer, Gada, Swee, Georgia, Halpern, Gil, Thattai, Govind, Herman, Grant, Sizov, Grigory, Guangyi, Zhang, Lakshminarayanan, Guna, Shojanazeri, Hamid, Zou, Han, Wang, Hannah, Zha, Hanwen, Habeeb, Haroun, Rudolph, Harrison, Suk, Helen, Aspegren, Henry, Goldman, Hunter, Damlaj, Ibrahim, Molybog, Igor, Tufanov, Igor, Veliche, Irina-Elena, Gat, Itai, Weissman, Jake, Geboski, James, Kohli, James, Asher, Japhet, Gaya, Jean-Baptiste, Marcus, Jeff, Tang, Jeff, Chan, Jennifer, Zhen, Jenny, Reizenstein, Jeremy, Teboul, Jeremy, Zhong, Jessica, Jin, Jian, Yang, Jingyi, Cummings, Joe, Carvill, Jon, Shepard, Jon, McPhie, Jonathan, Torres, Jonathan, Ginsburg, Josh, Wang, Junjie, Wu, Kai, U, Kam Hou, Saxena, Karan, Prasad, Karthik, Khandelwal, Kartikay, Zand, Katayoun, Matosich, Kathy, Veeraraghavan, Kaushik, Michelena, Kelly, Li, Keqian, Huang, Kun, Chawla, Kunal, Lakhotia, Kushal, Huang, Kyle, Chen, Lailin, Garg, Lakshya, A, Lavender, Silva, Leandro, Bell, Lee, Zhang, Lei, Guo, Liangpeng, Yu, Licheng, Moshkovich, Liron, Wehrstedt, Luca, Khabsa, Madian, Avalani, Manav, Bhatt, Manish, Tsimpoukelli, Maria, Mankus, Martynas, Hasson, Matan, Lennie, Matthew, Reso, Matthias, Groshev, Maxim, Naumov, Maxim, Lathi, Maya, Keneally, Meghan, Seltzer, Michael L., Valko, Michal, Restrepo, Michelle, Patel, Mihir, Vyatskov, Mik, Samvelyan, Mikayel, Clark, Mike, Macey, Mike, Wang, Mike, Hermoso, Miquel Jubert, Metanat, Mo, Rastegari, Mohammad, Bansal, Munish, Santhanam, Nandhini, Parks, Natascha, White, Natasha, Bawa, Navyata, Singhal, Nayan, Egebo, Nick, Usunier, Nicolas, Laptev, Nikolay Pavlovich, Dong, Ning, Zhang, Ning, Cheng, Norman, Chernoguz, Oleg, Hart, Olivia, Salpekar, Omkar, Kalinli, Ozlem, Kent, Parkin, Parekh, Parth, Saab, Paul, Balaji, Pavan, Rittner, Pedro, Bontrager, Philip, Roux, Pierre, Dollar, Piotr, Zvyagina, Polina, Ratanchandani, Prashant, Yuvraj, Pritish, Liang, Qian, Alao, Rachad, Rodriguez, Rachel, Ayub, Rafi, Murthy, Raghotham, Nayani, Raghu, Mitra, Rahul, Li, Raymond, Hogan, Rebekkah, Battey, Robin, Wang, Rocky, Maheswari, Rohan, Howes, Russ, Rinott, Ruty, Bondu, Sai Jayesh, Datta, Samyak, Chugh, Sara, Hunt, Sara, Dhillon, Sargun, Sidorov, Sasha, Pan, Satadru, Verma, Saurabh, Yamamoto, Seiji, Ramaswamy, Sharadh, Lindsay, Shaun, Feng, Sheng, Lin, Shenghao, Zha, Shengxin Cindy, Shankar, Shiva, Zhang, Shuqiang, Wang, Sinong, Agarwal, Sneha, Sajuyigbe, Soji, Chintala, Soumith, Max, Stephanie, Chen, Stephen, Kehoe, Steve, Satterfield, Steve, Govindaprasad, Sudarshan, Gupta, Sumit, Cho, Sungmin, Virk, Sunny, Subramanian, Suraj, Choudhury, Sy, Goldman, Sydney, Remez, Tal, Glaser, Tamar, Best, Tamara, Kohler, Thilo, Robinson, Thomas, Li, Tianhe, Zhang, Tianjun, Matthews, Tim, Chou, Timothy, Shaked, Tzook, Vontimitta, Varun, Ajayi, Victoria, Montanez, Victoria, Mohan, Vijai, Kumar, Vinay Satish, Mangla, Vishal, Albiero, Vítor, Ionescu, Vlad, Poenaru, Vlad, Mihailescu, Vlad Tiberiu, Ivanov, Vladimir, Li, Wei, Wang, Wenchen, Jiang, Wenwen, Bouaziz, Wes, Constable, Will, Tang, Xiaocheng, Wang, Xiaofang, Wu, Xiaojian, Wang, Xiaolan, Xia, Xide, Wu, Xilun, Gao, Xinbo, Chen, Yanjun, Hu, Ye, Jia, Ye, Qi, Ye, Li, Yenda, Zhang, Yilin, Zhang, Ying, Adi, Yossi, Nam, Youngjin, Yu, Wang, Hao, Yuchen, Qian, Yundi, He, Yuzi, Rait, Zach, DeVito, Zachary, Rosnbrick, Zef, Wen, Zhaoduo, Yang, Zhenyu, and Zhao, Zhiwei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Published: 2024

6. Towards scalable efficient on-device ASR with transfer learning

Author: Pandey, Laxmi, Li, Ke, Guo, Jinxi, Paul, Debjyoti, Guo, Arthur, Mahadeokar, Jay, and Zhang, Xuedong
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.
Published: 2024

7. Effective internal language model training and fusion for factorized transducer model

Author: Guo, Jinxi, Moritz, Niko, Ma, Yingyi, Seide, Frank, Wu, Chunyang, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Mike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly., Comment: Accepted to ICASSP 2024
Published: 2024

8. AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

Author: Fathullah, Yassir, Wu, Chunyang, Lakomkin, Egor, Li, Ke, Jia, Junteng, Shangguan, Yuan, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Mike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.
Published: 2023

9. Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Author: Xie, Jiamin, Li, Ke, Guo, Jinxi, Tjandra, Andros, Shangguan, Yuan, Sari, Leda, Wu, Chunyang, Jia, Junteng, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
Published: 2023

10. TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Author: Shangguan, Yuan, Yang, Haichuan, Li, Danni, Wu, Chunyang, Fathullah, Yassir, Wang, Dilin, Dalmia, Ayushi, Krishnamoorthi, Raghuraman, Kalinli, Ozlem, Jia, Junteng, Mahadeokar, Jay, Lei, Xin, Seltzer, Mike, and Chandra, Vikas
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant., Comment: Meta AI; Submitted to ICASSP 2024
Published: 2023

11. Prompting Large Language Models with Speech Recognition Abilities

Author: Fathullah, Yassir, Wu, Chunyang, Lakomkin, Egor, Jia, Junteng, Shangguan, Yuan, Li, Ke, Guo, Jinxi, Xiong, Wenhan, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Mike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.
Published: 2023

12. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Author: Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}., Comment: Accepted to NeurIPS 2023
Published: 2023

13. Towards Selection of Text-to-speech Data to Augment ASR Training

Author: Liu, Shuo, Sarı, Leda, Wu, Chunyang, Keren, Gil, Shangguan, Yuan, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its $30\,\%$, which is superior to several baseline methods.
Published: 2023

14. Multi-Head State Space Model for Speech Recognition

Author: Fathullah, Yassir, Wu, Chunyang, Shangguan, Yuan, Jia, Junteng, Xiong, Wenhan, Mahadeokar, Jay, Liu, Chunxi, Shi, Yangyang, Kalinli, Ozlem, Seltzer, Mike, and Gales, Mark J. F.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model., Comment: Interspeech 2023
Published: 2023

15. Improving Fast-slow Encoder based Transducer with Streaming Deliberation

Author: Li, Ke, Mahadeokar, Jay, Guo, Jinxi, Shi, Yangyang, Keren, Gil, Kalinli, Ozlem, Seltzer, Michael L., and Le, Duc
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while keeping its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and implicitly learns to correct recognition errors. We modify the parallel beam search algorithm for fast-slow encoder based transducer to be efficient and compatible with the deliberation model. In addition, the deliberation model is designed to process streaming data. To further improve the deliberation performance, a simple text augmentation approach is explored. We also compare LSTM and Conformer models for encoding partial hypotheses. Experiments on Librispeech and in-house data show relative WER reductions (WERRs) from 3% to 5% with a slight increase in model size and negligible extra token emission latency compared with fast-slow encoder based transducer. Compared with vanilla neural transducers, the proposed deliberation model together with fast-slow encoder based transducer obtains relative 10-11% WERRs on Librispeech and around relative 6% WERR on in-house data with smaller emission delays., Comment: Submitted to ICASSP 2023
Published: 2022

16. Dynamic Speech Endpoint Detection with Regression Targets

Author: Liang, Dawei, Su, Hang, Singh, Tarun, Mahadeokar, Jay, Puri, Shanil, Zhu, Jiedan, Thomaz, Edison, and Seltzer, Mike
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart homes devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this paper, we propose a novel regression-based speech end-pointing model, which enables an end-pointer to adjust its detection behavior based on context of user queries. Specifically, we present a pause modeling method and show its effectiveness for dynamic end-pointing. Based on our experiments with vendor-collected smartphone and wearables speech queries, our strategy shows a better trade-off between endpointing latency and accuracy, compared to the traditional classification-based method. We further discuss the benefits of this model and generalization of the framework in the paper., Comment: Manuscript submitted to ICASSP 2023
Published: 2022

17. Anchored Speech Recognition with Neural Transducers

Author: Raj, Desh, Jia, Junteng, Mahadeokar, Jay, Wu, Chunyang, Moritz, Niko, Zhang, Xiaohui, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions., Comment: To appear at IEEE ICASSP 2023
Published: 2022

18. An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Author: Moritz, Niko, Seide, Frank, Le, Duc, Mahadeokar, Jay, and Fuegen, Christian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set., Comment: Accepted to SLT 2022
Published: 2022

19. Federated Domain Adaptation for ASR with Full Self-Supervision

Author: Jia, Junteng, Mahadeokar, Jay, Zheng, Weiyi, Shangguan, Yuan, Kalinli, Ozlem, and Seide, Frank
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL training algorithm, non-IID-ness, and Differential Privacy have been well studied in the literature, this paper focuses on two challenges of practical importance for improving on-device ASR: the lack of ground-truth transcriptions and the scarcity of compute resource and network bandwidth on edge devices. First, we propose a FL system for on-device ASR domain adaptation with full self-supervision, which uses self-labeling together with data augmentation and filtering techniques. The system can improve a strong Emformer-Transducer based ASR model pretrained on out-of-domain data, using in-domain audio without any ground-truth transcriptions. Second, to reduce the training cost, we propose a self-restricted RNN Transducer (SR-RNN-T) loss, a variant of alignment-restricted RNN-T that uses Viterbi alignments from self-supervision. To further reduce the compute and network cost, we systematically explore adapting only a subset of weights in the Emformer-Transducer. Our best training recipe achieves a $12.9\%$ relative WER reduction over the strong out-of-domain baseline, which equals $70\%$ of the reduction achievable with full human supervision and centralized training.
Published: 2022

20. Streaming parallel transducer beam search with fast-slow cascaded encoders

Author: Mahadeokar, Jay, Shi, Yangyang, Li, Ke, Le, Duc, Zhu, Jiedan, Chandra, Vikas, Kalinli, Ozlem, and Seltzer, Michael L
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20% WER reduction with a slight increase in token emission delays on the public Librispeech dataset and in-house datasets. We also explore techniques to reduce the computation by distributing processing between the fast and slow encoders. Lastly, we explore sharing the parameters in the fast encoder to reduce the memory footprint. This enables low latency processing on edge devices with low computation cost and a low memory footprint., Comment: 5 pages, 2 figures, Interspeech 2022 submission
Published: 2022

21. TorchAudio: Building Blocks for Audio and Speech Processing

Author: Yang, Yao-Yuan, Hira, Moto, Ni, Zhaoheng, Chourdia, Anjali, Astafurov, Artyom, Chen, Caroline, Yeh, Ching-Feng, Puhrsch, Christian, Pollack, David, Genzel, Dmitriy, Greenberg, Donny, Yang, Edward Z., Lian, Jason, Mahadeokar, Jay, Hwang, Jeff, Chen, Ji, Goldsborough, Peter, Roy, Prabhat, Narenthiran, Sean, Watanabe, Shinji, Chintala, Soumith, Quenneville-Bélair, Vincent, and Shi, Yangyang
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This document describes version 0.10 of TorchAudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of TorchAudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. TorchAudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of TorchAudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations., Comment: Accepted by ICASSP 2022
Published: 2021

22. Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

Author: Shi, Yangyang, Wu, Chunyang, Wang, Dilin, Xiao, Alex, Mahadeokar, Jay, Zhang, Xiaohui, Liu, Chunxi, Li, Ke, Shangguan, Yuan, Nagaraja, Varun, Kalinli, Ozlem, and Seltzer, Mike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains similar training and decoding efficiency. Given the similar latency, using the non-causal convolution with lookahead context gives better accuracy than causal convolution, especially for open-domain dictation scenarios. Besides, this paper applies talking-head attention and a novel history context compression scheme to further improve the performance. The talking-head attention improves the multi-head self-attention by transferring information among different heads. The history context compression method introduces more extended history context compactly. On our in-house data, the proposed methods improve a small Emformer baseline with lookahead context by relative WERR 5.1\%, 14.5\%, 8.4\% on open-domain dictation, assistant general scenarios, and assistant calling scenarios, respectively., Comment: 5 pages, 3 figures, submit to ICASSP 2022
Published: 2021

23. Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

Author: Mahadeokar, Jay, Shi, Yangyang, Shangguan, Yuan, Wu, Chunyang, Xiao, Alex, Su, Hang, Le, Duc, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible de-coding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases, Comment: Submitted to Interspeech 2021 (under review)
Published: 2021

24. Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

Author: Le, Duc, Jain, Mahaveer, Keren, Gil, Kim, Suyoun, Shi, Yangyang, Mahadeokar, Jay, Chan, Julian, Shangguan, Yuan, Fuegen, Christian, Kalinli, Ozlem, Saraf, Yatharth, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that combines shallow fusion, trie-based deep biasing, and neural network language model contextualization. These techniques result in significant 19.5% relative Word Error Rate improvement over existing contextual biasing approaches and 5.4%-9.3% improvement compared to a strong hybrid baseline on both open-domain and constrained contextualization tasks, where the targets consist of mostly rare long-tail words. Our final system remains lightweight and modular, allowing for quick modification without model re-training., Comment: Accepted for presentation at INTERSPEECH 2021
Published: 2021

25. Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

Author: Shi, Yangyang, Nagaraja, Varun, Wu, Chunyang, Mahadeokar, Jay, Le, Duc, Prabhavalkar, Rohit, Xiao, Alex, Yeh, Ching-Feng, Chan, Julian, Fuegen, Christian, Kalinli, Ozlem, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language
Abstract: We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET assigns different encoders to decode different parts of an utterance. We apply and compare the layer dropout and the collaborative learning for DET training. The layer dropout method that randomly drops out encoder layers in the training phase, can do on-demand layer dropout in decoding. Collaborative learning jointly trains multiple encoders with different depths in one single model. Experiment results on Librispeech and in-house data show that DET provides a flexible accuracy and latency trade-off. Results on Librispeech show that the full-size encoder in DET relatively reduces the word error rate of the same size baseline by over 8%. The lightweight encoder in DET trained with collaborative learning reduces the model size by 25% but still gets similar WER as the full-size baseline. DET gets similar accuracy as a baseline model with better latency on a large in-house data set by assigning a lightweight encoder for the beginning part of one utterance and a full-size encoder for the rest., Comment: 5 pages, 2 figures, submitted Interspeech 2021
Published: 2021

26. Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Author: Shangguan, Yuan, Prabhavalkar, Rohit, Su, Hang, Mahadeokar, Jay, Shi, Yangyang, Zhou, Jiatong, Wu, Chunyang, Le, Duc, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques - model architectures, training criteria, decoding hyperparameters, and endpointer parameters - on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly correlated with observed UPL. Thus, conventional algorithmic latency measurements might be inadequate in accurately capturing latency observed when models are deployed on embedded devices. Instead, we find that factors affecting token emission latency, and endpointing behavior have a larger impact on UPL. We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, while utilizing the recently proposed alignment regularization mechanism., Comment: Proc. of Interspeech 2021
Published: 2021

27. Memory-efficient Speech Recognition on Smart Devices

Author: Venkatesh, Ganesh, Valliappan, Alagappan, Mahadeokar, Jay, Shangguan, Yuan, Fuegen, Christian, Seltzer, Michael L., and Chandra, Vikas
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects device battery life and limits their usability on low-power devices. We address transducer model's memory access concerns by optimizing their model architecture and designing novel recurrent cell designs. We demonstrate that i) model's energy cost is dominated by accessing model weights from off-chip memory, ii) transducer model architecture is pivotal in determining the number of accesses to off-chip memory and just model size is not a good proxy, iii) our transducer model optimizations and novel recurrent cell reduces off-chip memory accesses by 4.5x and model size by 2x with minimal accuracy impact.
Published: 2021

28. Deep Shallow Fusion for RNN-T Personalization

Author: Le, Duc, Keren, Gil, Chan, Julian, Mahadeokar, Jay, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the lack of external language models and difficulties in recognizing rare long-tail words, specifically entity names. In this work, we present novel techniques to improve RNN-T's ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement compared to a strong RNN-T baseline which uses shallow fusion and text-to-speech augmentation. Our work helps push the boundary of RNN-T personalization and close the gap with hybrid systems on use cases where biasing and entity recognition are crucial., Comment: To appear at SLT 2021
Published: 2020

29. Alignment Restricted Streaming Recurrent Neural Network Transducer

Author: Mahadeokar, Jay, Shangguan, Yuan, Le, Duc, Keren, Gil, Su, Hang, Le, Thong, Yeh, Ching-Feng, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs., Comment: Accepted for presentation at IEEE Spoken Language Technology Workshop (SLT) 2021
Published: 2020

30. Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Author: Kim, Suyoun, Shangguan, Yuan, Mahadeokar, Jay, Bruguier, Antoine, Fuegen, Christian, Seltzer, Michael L., and Le, Duc
Subjects: Computer Science - Computation and Language
Abstract: Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training. Previous work has proposed various fusion methods to incorporate external NNLMs into end-to-end ASR to address this weakness. In this paper, we propose extensions to these techniques that allow RNN-T to exploit external NNLMs during both training and inference time, resulting in 13-18% relative Word Error Rate improvement on Librispeech compared to strong baselines. Furthermore, our methods do not incur extra algorithmic latency and allow for flexible plug-and-play of different NNLMs without re-training. We also share in-depth analysis to better understand the benefits of the different NNLM fusion methods. Our work provides a reliable technique for leveraging unpaired text data to significantly improve RNN-T while keeping the system streamable, flexible, and lightweight., Comment: submitted to ICASSP 2021
Published: 2020

31. Contextual RNN-T For Open Domain ASR

Author: Jain, Mahaveer, Keren, Gil, Mahadeokar, Jay, Zweig, Geoffrey, Metze, Florian, and Saraf, Yatharth
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.
Published: 2020

32. RNN-T For Latency Controlled ASR With Improved Beam Search

Author: Jain, Mahaveer, Schubert, Kjell, Mahadeokar, Jay, Yeh, Ching-Feng, Kalgaonkar, Kaustubh, Sriram, Anuroop, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model. This greatly simplifies training and inference and hence makes RNN-T a desirable choice for ASR systems. In this work, we investigate use of RNN-T in applications that require a tune-able latency budget during inference time. We also improved the decoding speed of the originally proposed RNN-T beam search algorithm. We evaluated our proposed system on English videos ASR dataset and show that neural RNN-T models can achieve comparable WER and better computational efficiency compared to a well tuned hybrid ASR baseline.
Published: 2019

33. Spatial Attention for Far-field Speech Recognition with Deep Beamforming Neural Networks

Author: He, Weipeng, Lu, Lu, Zhang, Biqiao, Mahadeokar, Jay, Kalgaonkar, Kaustubh, and Fuegen, Christian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In this paper, we introduce spatial attention for refining the information in multi-direction neural beamformer for far-field automatic speech recognition. Previous approaches of neural beamformers with multiple look directions, such as the factored complex linear projection, have shown promising results. However, the features extracted by such methods contain redundant information, as only the direction of the target speech is relevant. We propose using a spatial attention subnet to weigh the features from different directions, so that the subsequent acoustic model could focus on the most relevant features for the speech recognition. Our experimental results show that spatial attention achieves up to 9% relative word error rate improvement over methods without the attention., Comment: To be presented at ICASSP 2020
Published: 2019

34. Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Author: Yeh, Ching-Feng, Mahadeokar, Jay, Kalgaonkar, Kaustubh, Wang, Yongqiang, Le, Duc, Jain, Mahaveer, Schubert, Kjell, Fuegen, Christian, and Seltzer, Michael L.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.
Published: 2019

35. Transformer-based Acoustic Modeling for Hybrid Speech Recognition

Author: Wang, Yongqiang, Mohamed, Abdelrahman, Le, Duc, Liu, Chunxi, Xiao, Alex, Mahadeokar, Jay, Huang, Hongzhao, Tjandra, Andros, Zhang, Xiaohui, Zhang, Frank, Fuegen, Christian, Zweig, Geoffrey, and Seltzer, Michael L.
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We demonstrate that on the widely used Librispeech benchmark, our transformer-based AM outperforms the best published hybrid result by 19% to 26% relative when the standard n-gram language model (LM) is used. Combined with neural network LM for rescoring, our proposed approach achieves state-of-the-art results on Librispeech. Our findings are also confirmed on a much larger internal dataset., Comment: to appear in ICASSP 2020
Published: 2019
Full Text: View/download PDF

36. Effective Internal Language Model Training and Fusion for Factorized Transducer Model

Author: Guo, Jinxi, primary, Moritz, Niko, additional, Ma, Yingyi, additional, Seide, Frank, additional, Wu, Chunyang, additional, Mahadeokar, Jay, additional, Kalinli, Ozlem, additional, Fuegen, Christian, additional, and Seltzer, Mike, additional
Published: 2024
Full Text: View/download PDF

37. Prompting Large Language Models with Speech Recognition Abilities

Author: Fathullah, Yassir, primary, Wu, Chunyang, additional, Lakomkin, Egor, additional, Jia, Junteng, additional, Shangguan, Yuan, additional, Li, Ke, additional, Guo, Jinxi, additional, Xiong, Wenhan, additional, Mahadeokar, Jay, additional, Kalinli, Ozlem, additional, Fuegen, Christian, additional, and Seltzer, Mike, additional
Published: 2024
Full Text: View/download PDF

38. Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of a Multilingual ASR Model

Author: Xie, Jiamin, primary, Li, Ke, additional, Guo, Jinxi, additional, Tjandra, Andros, additional, Shangguan, Yuan, additional, Sari, Leda, additional, Wu, Chunyang, additional, Jia, Junteng, additional, Mahadeokar, Jay, additional, and Kalinli, Ozlem, additional
Published: 2024
Full Text: View/download PDF

39. TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-Device ASR Models

Author: Shangguan, Yuan, primary, Yang, Haichuan, additional, Li, Danni, additional, Wu, Chunyang, additional, Fathullah, Yassir, additional, Wang, Dilin, additional, Dalmia, Ayushi, additional, Krishnamoorthi, Raghuraman, additional, Kalinli, Ozlem, additional, Jia, Junteng, additional, Mahadeokar, Jay, additional, Lei, Xin, additional, Seltzer, Mike, additional, and Chandra, Vikas, additional
Published: 2024
Full Text: View/download PDF

40. Joint Federated Learning and Personalization for on-Device ASR

Author: Jia, Junteng, primary, Li, Ke, additional, Malek, Mani, additional, Malik, Kshitiz, additional, Mahadeokar, Jay, additional, Kalinli, Ozlem, additional, and Seide, Frank, additional
Published: 2023
Full Text: View/download PDF

41. Multi-Head State Space Model for Speech Recognition

Author: Fathullah, Yassir, primary, Wu, Chunyang, additional, Shangguan, Yuan, additional, Jia, Junteng, additional, Xiong, Wenhan, additional, Mahadeokar, Jay, additional, Liu, Chunxi, additional, Shi, Yangyang, additional, Kalinli, Ozlem, additional, Seltzer, Mike, additional, and Gales, Mark J. F., additional
Published: 2023
Full Text: View/download PDF

42. Anchored Speech Recognition with Neural Transducers

Author: Raj, Desh, primary, Jia, Junteng, additional, Mahadeokar, Jay, additional, Wu, Chunyang, additional, Moritz, Niko, additional, Zhang, Xiaohui, additional, and Kalinli, Ozlem, additional
Published: 2023
Full Text: View/download PDF

43. Improving fast-slow Encoder based Transducer with Streaming Deliberation

Author: Li, Ke, primary, Mahadeokar, Jay, additional, Guo, Jinxi, additional, Shi, Yangyang, additional, Keren, Gil, additional, Kalinli, Ozlem, additional, Seltzer, Michael L., additional, and Le, Duc, additional
Published: 2023
Full Text: View/download PDF

44. Dynamic Speech Endpoint Detection with Regression Targets

Author: Liang, Dawei, primary, Su, Hang, additional, Singh, Tarun, additional, Mahadeokar, Jay, additional, Puri, Shanil, additional, Zhu, Jiedan, additional, Thomaz, Edison, additional, and Seltzer, Mike, additional
Published: 2023
Full Text: View/download PDF

45. An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Author: Moritz, Niko, primary, Seide, Frank, additional, Le, Duc, additional, Mahadeokar, Jay, additional, and Fuegen, Christian, additional
Published: 2023
Full Text: View/download PDF

46. Faster Replacement Paths Algorithm for Undirected, Positive Integer Weighted Graphs with Small Diameter

Author: Mahadeokar, Jay, Saxena, Sanjeev, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Arumugam, S., editor, and Smyth, W. F., editor
Published: 2012
Full Text: View/download PDF

47. Streaming parallel transducer beam search with fast slow cascaded encoders

Author: Mahadeokar, Jay, primary, Shi, Yangyang, additional, Li, Ke, additional, Le, Duc, additional, Zhu, Jiedan, additional, Chandra, Vikas, additional, Kalinli, Ozlem, additional, and Seltzer, Michael, additional
Published: 2022
Full Text: View/download PDF

48. Federated Domain Adaptation for ASR with Full Self-Supervision

Author: Jia, Junteng, primary, Mahadeokar, Jay, additional, Zheng, Weiyi, additional, Shangguan, Yuan, additional, Kalinli, Ozlem, additional, and Seide, Frank, additional
Published: 2022
Full Text: View/download PDF

49. Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution

Author: Shi, Yangyang, primary, Wu, Chunyang, additional, Wang, Dilin, additional, Xiao, Alex, additional, Mahadeokar, Jay, additional, Zhang, Xiaohui, additional, Liu, Chunxi, additional, Li, Ke, additional, Shangguan, Yuan, additional, Nagaraja, Varun, additional, Kalinli, Ozlem, additional, and Seltzer, Mike, additional
Published: 2022
Full Text: View/download PDF

50. Faster algorithm to find anti-risk path between two nodes of an undirected graph

Author: Mahadeokar, Jay and Saxena, Sanjeev
Published: 2014
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

109 results on '"Mahadeokar, Jay"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources