Author: "Garg, Vineet" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Garg, Vineet"' showing total 25 results

Start Over Author "Garg, Vineet"

25 results on '"Garg, Vineet"'

1. Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Author: Ognjen, Rudovic, Dighe, Pranay, Su, Yi, Garg, Vineet, Dharur, Sameer, Niu, Xiaochuan, Abdelaziz, Ahmed H., Adya, Saurabh, and Tewfik, Ahmed
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
Published: 2024

2. Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Author: Kumar, Satyam, Buddi, Sai Srujana, Sarawgi, Utkarsh Oggy, Garg, Vineet, Ranjan, Shivesh, Ognjen, Rudovic, Abdelaziz, Ahmed Hussen, and Adya, Saurabh
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.
Published: 2024

3. Importance of dual vocational education

Author: Garg, Vineet Kumar and Singh, Lokendra Vikram
Published: 2017

4. Vocational education & training scenario (International Perspective)

Author: Singh, Lokendra Vikram and Garg, Vineet Kumar
Published: 2017

5. Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

Author: Sarawgi, Utkarsh Oggy, Berkowitz, John, Garg, Vineet, Kundu, Arnav, Cho, Minsik, Buddi, Sai Srujana, Adya, Saurabh, and Tewfik, Ahmed
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, I.2.6, I.5.1, I.5.4, I.6.5
Abstract: Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming models (i.e., by adding more parameters) to improve the predictive power may not be viable for real-world tasks. In this work, we propose a new loss, Streaming Anchor Loss (SAL), to better utilize the given learning capacity by encouraging the model to learn more from essential frames. More specifically, our SAL and its focal variations dynamically modulate the frame-wise cross entropy loss based on the importance of the corresponding frames so that a higher loss penalty is assigned for frames within the temporal proximity of semantically critical events. Therefore, our loss ensures that the model training focuses on predicting the relatively rare but task-relevant frames. Experimental results with standard lightweight convolutional and recurrent streaming networks on three different speech based detection tasks demonstrate that SAL enables the model to learn the overall task more effectively with improved accuracy and latency, without any additional data, model parameters, or architectural changes., Comment: Published at IEEE ICASSP 2024, please see https://ieeexplore.ieee.org/abstract/document/10447222
Published: 2023
Full Text: View/download PDF

6. Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study

Author: Brueggeman, Avamarie, Higuchi, Takuya, Delfarah, Masood, Shum, Stephen, and Garg, Vineet
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Noise robustness is a key aspect of successful speech applications. Speech enhancement (SE) has been investigated to improve automatic speech recognition accuracy; however, its effectiveness for keyword spotting (KWS) is still under-investigated. In this paper, we conduct a comprehensive study on single-channel speech enhancement for keyword spotting on the Google Speech Command (GSC) dataset. To investigate robustness to noise, the GSC dataset is augmented with noise signals from the WSJ0 Hipster Ambient Mixtures (WHAM!) noise dataset. Our investigation includes not only applying SE before KWS but also performing joint training of the SE frontend and KWS backend models. Moreover, we explore audio injection, a common approach to reduce distortions by using a weighted average of the enhanced and original signals. Audio injection is then further optimized by using another model that predicts the weight for each utterance. Our investigation reveals that SE can improve KWS accuracy on noisy speech when the backend model is trained on clean speech; however, despite our extensive exploration, it is difficult to improve the KWS accuracy with SE when the backend is trained on noisy speech.
Published: 2023

7. Leveraging Large Language Models for Exploiting ASR Uncertainty

Author: Dighe, Pranay, Su, Yi, Zheng, Shangshang, Liu, Yunshu, Garg, Vineet, Niu, Xiaochuan, and Tewfik, Ahmed
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications., Comment: Added references
Published: 2023

8. Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Author: Garg, Vineet, Rudovic, Ognjen, Dighe, Pranay, Abdelaziz, Ahmed H., Marchi, Erik, Adya, Saurabh, Dhir, Chandra, and Tewfik, Ahmed
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to train an acoustics-only model for the FTM task by regularizing its loss function via knowledge distillation from an ASR-based (LatticeRNN) model. This improves the model decisions, resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over the base acoustics-only model. We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%., Comment: Submitted to INTERSPEECH 2022
Published: 2022

9. Streaming on-device detection of device directed speech from voice and touch-based invocation

Author: Rudovic, Ognjen, Bindal, Akanksha, Garg, Vineet, Simha, Pramod, Dighe, Pranay, and Kajarekar, Sachin
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion. We compare this approach with streaming alternatives based on vanilla Average layer, and canonical LSTMs, and show: (i) that all the models show only a small degradation in accuracy compared with the invocation-specific models, and (ii) that the newly introduced streaming TCN consistently performs better or comparable with the alternatives, while mitigating device undirected speech faster in time, and with (relative) reduction in runtime peak-memory over the LSTM-based approach of 33% vs. 7%, when compared to a non-streaming counterpart.
Published: 2021

10. Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

Author: Garg, Vineet, Chang, Wonil, Sigtia, Siddharth, Adya, Saurabh, Simha, Pramod, Dighe, Pranay, and Dhir, Chandra
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18% relative reduction in false reject rate (FRR) for the VTD task at a given false alarm rate. Moreover, our model suppresses 95% of the false triggers with an additional one second of post-trigger audio. Finally, on-device measurements show 32% reduction in runtime memory and 56% reduction in inference time compared to non-streaming version of the model.
Published: 2021

11. Progressive Voice Trigger Detection: Accuracy vs Latency

Author: Sigtia, Siddharth, Bridle, John, Richards, Hywel, Clark, Pascal, Marchi, Erik, and Garg, Vineet
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision. However, waiting to listen to more audio each time incurs a latency increase. Progressive Voice Trigger Detection allows us to trade-off latency and accuracy by accepting clear trigger candidates quickly, but waiting for more context to decide whether to accept more marginal examples. Using a two-stage architecture, we show that by delaying the decision for just 3% of detected true triggers in the test set, we are able to obtain a relative improvement of 66% in false rejection rate, while incurring only a negligible increase in latency., Comment: Camera Ready Version: ICASSP 2021
Published: 2020

12. Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Author: Adya, Saurabh, Garg, Vineet, Sigtia, Siddharth, Simha, Pramod, and Dhir, Chandra
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield $\sim$60% relative reduction in false reject rates for a given false-alarm rate, while requiring 10% fewer parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70% relative reduction in inference time. Additionally, the proposed network architectures are $\sim$5X faster to train., Comment: INTERSPEECH, 2020
Published: 2020

13. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Author: Camacho-Rodríguez, Jesús, Chauhan, Ashutosh, Gates, Alan, Koifman, Eugene, O'Malley, Owen, Garg, Vineet, Haindrich, Zoltan, Shelukhin, Sergey, Jayachandran, Prasanth, Seth, Siddharth, Jaiswal, Deepak, Bouguerra, Slim, Bangarwa, Nishant, Hariappan, Sankar, Agarwal, Anishek, Dere, Jason, Dai, Daniel, Nair, Thejas, Dembla, Nita, Vijayaraghavan, Gopal, and Hagleitner, Günther
Subjects: Computer Science - Databases
Abstract: Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap., Comment: SIGMOD'19, 14 pages
Published: 2019

14. Expert Recommendations on the Usage of Non-vitamin K Antagonist Oral Anticoagulants (NOACs) from India: Current Perspective and Future Direction

Author: Singh, Balbir, Pai, Paresh, Kumar, Harish, George, Sheeba, Mahapatra, Sandeep, Garg, Vineet, Gupta, G. N., Makineni, Kiran, Ganeshwala, Gaurav, Narkhede, Pravin, Naqvi, Syed M. H., Gaurav, Kumar, and Hukkeri, Mohammed Y. K.
Published: 2022
Full Text: View/download PDF

15. Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

Author: Sarawgi, Utkarsh Oggy, primary, Berkowitz, John, additional, Garg, Vineet, additional, Kundu, Arnav, additional, Cho, Minsik, additional, Buddi, Sai Srujana, additional, Adya, Saurabh, additional, and Tewfik, Ahmed, additional
Published: 2024
Full Text: View/download PDF

16. Leveraging Large Language Models for Exploiting ASR Uncertainty

Author: Dighe, Pranay, primary, Su, Yi, additional, Zheng, Shangshang, additional, Liu, Yunshu, additional, Garg, Vineet, additional, Niu, Xiaochuan, additional, and Tewfik, Ahmed, additional
Published: 2024
Full Text: View/download PDF

17. Less Is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types

Author: Rudovic, Oggi, primary, Chang, Wonil, additional, Garg, Vineet, additional, Dighe, Pranay, additional, Simha, Pramod, additional, Berkowitz, Jack, additional, Abdelaziz, Ahmed H., additional, Kajarekar, Sachin, additional, Marchi, Erik, additional, and Adya, Saurabh, additional
Published: 2023
Full Text: View/download PDF

18. Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Author: Garg, Vineet, primary, Rudovic, Ognjen, additional, Dighe, Pranay, additional, Abdelaziz, Ahmed Hussen, additional, Marchi, Erik, additional, Adya, Saurabh, additional, Dhir, Chandra, additional, and Tewfik, Ahmed, additional
Published: 2022
Full Text: View/download PDF

19. Streaming on-Device Detection of Device Directed Speech from Voice and Touch-Based Invocation

Author: Rudovic, Ognjen, Bindal, Akanksha, Garg, Vineet, Simha, Pramod, Dighe, Pranay, and Kajarekar, Sachin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion. We compare this approach with streaming alternatives based on vanilla Average layer, and canonical LSTMs, and show: (i) that all the models show only a small degradation in accuracy compared with the invocation-specific models, and (ii) that the newly introduced streaming TCN consistently performs better or comparable with the alternatives, while mitigating device undirected speech faster in time, and with (relative) reduction in runtime peak-memory over the LSTM-based approach of 33% vs. 7%, when compared to a non-streaming counterpart.
Published: 2022

20. Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

Author: Garg, Vineet, primary, Chang, Wonil, additional, Sigtia, Siddharth, additional, Adya, Saurabh, additional, Simha, Pramod, additional, Dighe, Pranay, additional, and Dhir, Chandra, additional
Published: 2021
Full Text: View/download PDF

21. Progressive Voice Trigger Detection: Accuracy vs Latency

Author: Sigtia, Siddharth, primary, Bridle, John, additional, Richards, Hywel, additional, Clark, Pascal, additional, Marchi, Erik, additional, and Garg, Vineet, additional
Published: 2021
Full Text: View/download PDF

22. Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

Author: Adya, Saurabh, primary, Garg, Vineet, additional, Sigtia, Siddharth, additional, Simha, Pramod, additional, and Dhir, Chandra, additional
Published: 2020
Full Text: View/download PDF

23. Apache Hive

Author: Camacho-Rodríguez, Jesús, primary, Chauhan, Ashutosh, additional, Gates, Alan, additional, Koifman, Eugene, additional, O'Malley, Owen, additional, Garg, Vineet, additional, Haindrich, Zoltan, additional, Shelukhin, Sergey, additional, Jayachandran, Prasanth, additional, Seth, Siddharth, additional, Jaiswal, Deepak, additional, Bouguerra, Slim, additional, Bangarwa, Nishant, additional, Hariappan, Sankar, additional, Agarwal, Anishek, additional, Dere, Jason, additional, Dai, Daniel, additional, Nair, Thejas, additional, Dembla, Nita, additional, Vijayaraghavan, Gopal, additional, and Hagleitner, Günther, additional
Published: 2019
Full Text: View/download PDF

24. Automatic Image Colorization Using Adversarial Training

Author: Lal, Shamit, primary, Garg, Vineet, additional, and Verma, Om Prakash, additional
Published: 2017
Full Text: View/download PDF

25. Genetic association in chronic periodontitis through dermatoglyphics: An unsolved link?

Author: Astekar, Madhusudan, primary, Astekar, Sowmya, additional, Garg, Vineet, additional, Agarwal, Ashutosh, additional, and Murari, Aditi, additional
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

25 results on '"Garg, Vineet"'

1. Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

2. Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

3. Importance of dual vocational education

4. Vocational education & training scenario (International Perspective)

5. Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

6. Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study

7. Leveraging Large Language Models for Exploiting ASR Uncertainty

8. Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

9. Streaming on-device detection of device directed speech from voice and touch-based invocation

10. Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

11. Progressive Voice Trigger Detection: Accuracy vs Latency

12. Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

13. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

14. Expert Recommendations on the Usage of Non-vitamin K Antagonist Oral Anticoagulants (NOACs) from India: Current Perspective and Future Direction

15. Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

16. Leveraging Large Language Models for Exploiting ASR Uncertainty

17. Less Is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types

18. Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

19. Streaming on-Device Detection of Device Directed Speech from Voice and Touch-Based Invocation

20. Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

21. Progressive Voice Trigger Detection: Accuracy vs Latency

22. Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

23. Apache Hive

24. Automatic Image Colorization Using Adversarial Training

25. Genetic association in chronic periodontitis through dermatoglyphics: An unsolved link?

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

25 results on '"Garg, Vineet"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources