34 results on '"Sunit Sivasankaran"'
Search Results
2. Speech Separation with Large-Scale Self-Supervised Learning.
- Author
-
Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, and Sefik Emre Eskimez
- Published
- 2023
- Full Text
- View/download PDF
3. Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.
- Author
-
Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, and Takuya Yoshioka
- Published
- 2023
- Full Text
- View/download PDF
4. Target word activity detector: An approach to obtain ASR word boundaries without lexicon.
- Author
-
Sunit Sivasankaran, Eric Sun, Jin-Yu Li 0001, Yan Huang 0028, and Jing Pan
- Published
- 2024
- Full Text
- View/download PDF
5. NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription.
- Author
-
Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe'er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong 0001, Min Tang, Huaming Wang, and Eyal Krupka
- Published
- 2024
- Full Text
- View/download PDF
6. COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning.
- Author
-
Jing Pan, Jian Wu 0027, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen 0006, Shujie Liu 0001, and Jinyu Li 0001
- Published
- 2023
- Full Text
- View/download PDF
7. Explaining Deep Learning Models for Speech Enhancement.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Dominique Fohr
- Published
- 2021
- Full Text
- View/download PDF
8. Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers.
- Author
-
Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, and Emmanuel Vincent 0001
- Published
- 2020
- Full Text
- View/download PDF
9. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Dominique Fohr
- Published
- 2020
- Full Text
- View/download PDF
10. SLOGD: Speaker Location Guided Deflation Approach to Speech Separation.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Dominique Fohr
- Published
- 2020
- Full Text
- View/download PDF
11. Simulating realistic speech overlaps improves multi-talker ASR.
- Author
-
Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, and Takuya Yoshioka
- Published
- 2022
- Full Text
- View/download PDF
12. Speech separation with large-scale self-supervised learning.
- Author
-
Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, and Sefik Emre Eskimez
- Published
- 2022
- Full Text
- View/download PDF
13. Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Dominique Fohr
- Published
- 2018
- Full Text
- View/download PDF
14. Phone Merging For Code-Switched Speech Recognition.
- Author
-
Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, and Monojit Choudhury
- Published
- 2018
- Full Text
- View/download PDF
15. VoiceHome-2, an extended corpus for multichannel speech processing in real homes.
- Author
-
Nancy Bertin, Ewen Camberlein, Romain Lebarbenchon, Emmanuel Vincent 0001, Sunit Sivasankaran, Irina Illina, and Frédéric Bimbot
- Published
- 2019
- Full Text
- View/download PDF
16. Discriminative importance weighting of augmented training data for acoustic model training.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Irina Illina
- Published
- 2017
- Full Text
- View/download PDF
17. An extended experimental investigation of DNN uncertainty propagation for noise robust ASR.
- Author
-
Karan Nathwani, Juan Andres Morales-Cordovilla, Sunit Sivasankaran, Irina Illina, and Emmanuel Vincent 0001
- Published
- 2017
- Full Text
- View/download PDF
18. A French Corpus for Distant-Microphone Speech Processing in Real Homes.
- Author
-
Nancy Bertin, Ewen Camberlein, Emmanuel Vincent 0001, Romain Lebarbenchon, Stéphane Peillon, éric Lamande, Sunit Sivasankaran, Frédéric Bimbot, Irina Illina, Ariane Tom, Sylvain Fleury, and éric Jamet
- Published
- 2016
- Full Text
- View/download PDF
19. A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions.
- Author
-
Sunit Sivasankaran, Emmanuel Vincent 0001, and Irina Illina
- Published
- 2017
- Full Text
- View/download PDF
20. Robust ASR using neural network based speech enhancement and feature simulation.
- Author
-
Sunit Sivasankaran, Aditya Arie Nugraha, Emmanuel Vincent 0001, Juan Andres Morales-Cordovilla, Siddharth Dalmia, Irina Illina, and Antoine Liutkus
- Published
- 2015
- Full Text
- View/download PDF
21. The Speed Submission to DIHARD II: Contributions & Lessons Learned.
- Author
-
Md. Sahidullah, Jose Patino 0001, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov, Alessio Brutti, Romain Serizel, Emmanuel Vincent 0001, Nicholas W. D. Evans, Sébastien Marcel, Stefano Squartini, and Claude Barras
- Published
- 2019
22. Statistics based features for unvoiced sound classification.
- Author
-
Sunit Sivasankaran and K. M. M. Prabhu
- Published
- 2013
- Full Text
- View/download PDF
23. Speech separation with large-scale self-supervised learning
- Author
-
Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, and Sefik Emre Eskimez
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained model with the SS network under a limited computation budget, including a low frame rate SSL model training setup and a fine-tuning scheme using only the part of the pre-trained model. Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set. For conversation transcription on real meeting recordings using continuous speech separation, the proposed model achieves 6.8% and 10.6% of relative WER reductions over the purely supervised baseline on AMI and ICSI evaluation sets, respectively, while reducing the computational cost by 38%.
- Published
- 2022
24. Asteroid: the PyTorch-based audio source separation toolkit for researchers
- Author
-
Ariel Frank, Emmanuel Vincent, Fabian-Robert Stöter, Mathieu Hu, Joris Cosentino, Manuel Pariente, Samuele Cornell, Sunit Sivasankaran, David Ditter, Efthymios Tzinis, Juan M. Martín-Doñas, Antoine Deleforge, Michel Olvera, Jens Heitkaemper, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Università Politecnica delle Marche [Ancona] (UNIVPM), Department of Electrical and Computer Engineering [Urbana] (University of Illinois), University of Illinois at Urbana-Champaign [Urbana], University of Illinois System-University of Illinois System, University of Paderborn, Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Universidad de Granada = University of Granada (UGR), University of Hamburg, Technion - Israel Institute of Technology [Haifa], Grid'5000, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM), and University of Granada [Granada]
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer science ,open-source software ,020207 software engineering ,02 engineering and technology ,[INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE] ,end-to-end ,Computer Science - Sound ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Speech enhancement ,Computer engineering ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,Asteroid ,source separation ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Source separation ,020201 artificial intelligence & image processing ,speech enhancement ,Software architecture ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid ., Submitted to Interspeech 2020
- Published
- 2020
25. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition
- Author
-
Emmanuel Vincent, Sunit Sivasankaran, Dominique Fohr, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Grid'5000, ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016), This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice commandadapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations (see https://www.grid5000.fr) and using the EXPLOR centre, hosted by the University of Lorraine., Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
- Subjects
Multichannel speech separation ,WSJ0-2mix reverberated ,Signal processing ,Noise measurement ,Artificial neural network ,Computer science ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,Speech processing ,Signal-to-noise ratio ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Adaptive beamformer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of $29.4$% was achieved using the ground truth localization information and $42.4$% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between the speakers has a higher impact on the ASR performance, to the extent of reducing the WER by $59$% relative for a SIR increase of $15$ dB. By contrast, increasing the spatial distance to $50^\circ$ or more improves the WER by $23$% relative only, Comment: Submitted to ICASSP 2020
- Published
- 2019
- Full Text
- View/download PDF
26. Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment
- Author
-
Emmanuel Vincent, Dominique Fohr, and Sunit Sivasankaran
- Subjects
Reverberation ,Computer science ,Speech recognition ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,020201 artificial intelligence & image processing ,02 engineering and technology ,Noise (video) ,Convolutional neural network ,Word (computer architecture) ,Task (project management) - Abstract
Speaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distant-microphone voice command system, in the presence of overlapping speech. We employ a convolutional neural network based localization system and investigate multiple identifiers as additional inputs to the system in order to characterize this speaker. We conduct experiments using ground truth identifiers which are obtained assuming the availability of clean speech and also in realistic conditions where the identifiers are computed from the corrupted speech. We find that the identifier consisting of the ground truth time-frequency mask corresponding to the target speaker provides the best localization performance and we propose methods to estimate such a mask in adverse reverberant and noisy conditions using the considered keyword.
- Published
- 2018
- Full Text
- View/download PDF
27. Phone Merging for Code-switched Speech Recognition
- Author
-
Sunayana Sitaram, Kalika Bali, Sunit Sivasankaran, Monojit Choudhury, Brij Mohan Lal Srivastava, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Microsoft Research India [Bangalore], Microsoft Research, collocated with ACL 2018, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Hindi ,Computer science ,Speech recognition ,media_common.quotation_subject ,Speech recognigntion ,Acoustic model ,02 engineering and technology ,Code-switching ,Code switching ,language.human_language ,030507 speech-language pathology & audiology ,03 medical and health sciences ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Phone ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,0202 electrical engineering, electronic engineering, information engineering ,language ,020201 artificial intelligence & image processing ,Conversation ,0305 other medical science ,Merge (version control) ,media_common - Abstract
International audience; Speakers in multilingual communities often switch between or mix multiple languages in the same conversation. Automatic Speech Recognition (ASR) of code-switched speech faces many challenges including the influence of phones of different languages on each other. This paper shows evidence that phone sharing between languages improves the Acoustic Model performance for Hindi-English code-switched speech. We compare base-line system built with separate phones for Hindi and English with systems where the phones were manually merged based on linguistic knowledge. Encouraged by the improved ASR performance after manually merging the phones, we further investigate multiple data-driven methods to identify phones to be merged across the languages. We show detailed analysis of automatic phone merging in this language pair and the impact it has on individual phone accuracies and WER. Though the best performance gain of 1.2% WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.
- Published
- 2018
28. A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions
- Author
-
Emmanuel Vincent, Sunit Sivasankaran, Irina Illina, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Grid'5000
- Subjects
Reverberation ,evaluation ,Artificial neural network ,Computer science ,Speech recognition ,Feature extraction ,Word error rate ,Acoustic model ,020206 networking & telecommunications ,02 engineering and technology ,acoustic model adaptation ,robust ASR ,FMLLR ,Theoretical Computer Science ,Human-Computer Interaction ,030507 speech-language pathology & audiology ,03 medical and health sciences ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,0305 other medical science ,dereverberation ,Software ,Decoding methods - Abstract
International audience; Robustness to reverberation is a key concern for distant-microphone ASR. Various approaches have been proposed, including single-channel or multichannel dereverberation, robust feature extraction, alternative acoustic models, and acoustic model adaptation. However, to the best of our knowledge, a detailed study of these techniques in varied reverberation conditions is still missing in the literature. In this paper, we conduct a series of experiments to assess the impact of various dereverberation and acoustic model adaptation approaches on the ASR performance in the range of reverberation conditions found in real domestic environments. We consider both established approaches such as WPE and newer approaches such as learning hidden unit contribution (LHUC) adaptations, whose performance has not been reported before in this context, and we employ them in combination. Our results indicate that performing weighted prediction error (WPE) dereverberation on a reverberated test speech utterance and decoding using an deep neural network (DNN) acoustic model trained with multi-condition reverberated speech with feature-space maximum likelihood linear regression (fMLLR) transformed features, outperforms more recent approaches and helps significantly reduce the word error rate (WER).
- Published
- 2017
- Full Text
- View/download PDF
29. An extended experimental investigation of DNN uncertainty propagation for noise robust ASR
- Author
-
Sunit Sivasankaran, Karan Nathwani, Irina Illina, Emmanuel Vincent, Juan A. Morales-Cordovilla, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Universidad de Granada (UGR), Grid'5000, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Universidad de Granada = University of Granada (UGR)
- Subjects
Propagation of uncertainty ,Engineering ,acoustic modeling ,Artificial neural network ,Noise measurement ,business.industry ,Speech recognition ,uncertainty estimation ,Estimator ,020206 networking & telecommunications ,02 engineering and technology ,uncertainty propagation ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Noise ,Robust ASR ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Distortion ,0202 electrical engineering, electronic engineering, information engineering ,0305 other medical science ,business ,Decoding methods ,DNN - Abstract
International audience; Automatic speech recognition (ASR) in noisy environments remains a challenging goal. Recently, the idea of estimating the uncertainty about the features obtained after speech enhancement and propagating it to dynamically adapt deep neural network (DNN) based acoustic models has raised some interest. However, the results in the literature were reported on simulated noisy datasets for a limited variety of uncertainty estimators. We found that they vary significantly in different conditions. Hence, the main contribution of this work is to assess DNN uncertainty decoding performance for different data conditions and different uncertainty estimation/propagation techniques. In addition, we propose a neural network based uncertainty estima-tor and compare it with other uncertainty estimators. We report detailed ASR results on the CHiME-2 and CHiME-3 datasets. We find that, on average, uncertainty propagation provides similar relative improvement on real and simulated data and that the proposed uncertainty estimator performs significantly better than the one in [1]. We also find that the improvement is consistent, but it depends on the signal-to-noise ratio (SNR) and the noise environment.
- Published
- 2017
30. Discriminative importance weighting of augmented training data for acoustic model training
- Author
-
Emmanuel Vincent, Sunit Sivasankaran, Irina Illina, Vincent, Emmanuel, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Grid'5000, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
[INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing ,Computer science ,Speech recognition ,02 engineering and technology ,feature simulation ,Data modeling ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,ASR ,CHiME ,Discriminative model ,0202 electrical engineering, electronic engineering, information engineering ,Parametric statistics ,Training set ,business.industry ,Acoustic model ,020206 networking & telecommunications ,Pattern recognition ,Weighting ,Noise ,Test set ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Gradient descent ,data augmentation ,DNN - Abstract
Added missing sign in equations (2) and (3) + explanation about iteration 1 in Fig. 1; International audience; DNN based acoustic models require a large amount of training data. Parametric data augmentation techniques such as adding noise, reverberation, or changing the speech rate, are often employed to boost the dataset size and the ASR performance. The choice of augmentation techniques and the associated parameters has been handled heuristically so far. In this work we propose an algorithm to automatically weight data perturbed using a variety of augmentation techniques and/or parameters. The weights are learned in a discriminative fashion so as to minimize the frame error rate using the standard gradient descent algorithm in an iterative manner. Experiments were performed using the CHiME-3 dataset. Data augmentation was done by adding noise at different SNRs. A relative WER improvement of 15% was obtained with the proposed data weighting algorithm compared to the unweighted augmented dataset. Interestingly, the resulting distribution of SNRs in the weighted training set differs significantly from that of the test set.
- Published
- 2017
- Full Text
- View/download PDF
31. A French corpus for distant-microphone speech processing in real homes
- Author
-
Stéphane Peillon, Ewen Camberlein, Ariane Tom, Sunit Sivasankaran, Emmanuel Vincent, Éric Lamandé, Frédéric Bimbot, Irina Illina, Romain Lebarbenchon, Nancy Bertin, Sylvain Fleury, Eric Jamet, Parcimonie et Nouveaux Algorithmes pour le Signal et la Modélisation Audio (PANAMA), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-SIGNAUX ET IMAGES NUMÉRIQUES, ROBOTIQUE (IRISA-D5), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), VoiceBox Technologies France, Centre de Recherches en Psychologie Cognition et Communication (CRPCC EA 1285), Université de Bretagne Sud (UBS)-Université de Brest (UBO)-MEN : EA1285-Université de Rennes 2 (UR2), Vincent, Emmanuel, Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), and Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)
- Subjects
Reverberation ,noise ,Computer science ,Microphone ,[INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing ,reverberation ,Speech recognition ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,020206 networking & telecommunications ,02 engineering and technology ,home ,distant-microphone ,robustness ,Speech processing ,localization ,ASR ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,enhancement - Abstract
International audience; We introduce a new corpus for distant-microphone speech processing in domestic environments. This corpus includes reverberated , noisy speech signals spoken by native French talkers in a lounge and recorded by an 8-microphone device at various angles and distances and in various noise conditions. Room impulse responses and noise-only signals recorded in various real rooms and homes and baseline speaker localization and enhancement software are also provided. This corpus stands apart from other corpora in the field by the number of rooms and homes considered and by the fact that it is publicly available at no cost. We describe the corpus specifications and annotations and the data recorded so far. We report baseline results.
- Published
- 2016
32. Robust ASR using neural network based speech enhancement and feature simulation
- Author
-
Irina Illina, Aditya Arie Nugraha, Siddharth Dalmia, Antoine Liutkus, Sunit Sivasankaran, Juan A. Morales-Cordovilla, Emmanuel Vincent, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), Grid'5000, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), and Sivasankaran, Sunit
- Subjects
Similarity (geometry) ,Computer science ,Speech recognition ,Context (language use) ,02 engineering and technology ,feature simulation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,ASR ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,Artificial neural network ,business.industry ,CHiME-3 ,020206 networking & telecommunications ,Pattern recognition ,[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] ,Conditional restricted boltzmann machines ,[INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD] ,Speech enhancement ,CRBM ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,speech enhancement ,Noise (video) ,Artificial intelligence ,0305 other medical science ,business - Abstract
Submitted to ICASSP 2020; International audience; We consider the problem of robust automatic speech recognition (ASR) in the context of the CHiME-3 Challenge. The proposed system combines three contributions. First, we propose a deep neural network (DNN) based multichannel speech enhancement technique, where the speech and noise spectra are estimated using a DNN based regressor and the spatial parameters are derived in an expectation-maximization (EM) like fashion. Second, a conditional restricted Boltz-mann machine (CRBM) model is trained using the obtained enhanced speech and used to generate simulated training and development datasets. The goal is to increase the similarity between simulated and real data, so as to increase the benefit of multicondition training. Finally, we make some changes to the ASR backend. Our system ranked 4th among 25 entries
- Published
- 2015
33. Robust features for environmental sound classification
- Author
-
K. M. M. Prabhu and Sunit Sivasankaran
- Subjects
business.industry ,Computer science ,Feature vector ,Speech recognition ,Pattern recognition ,Sparse approximation ,computer.software_genre ,Mixture model ,Sound recording and reproduction ,symbols.namesake ,ComputingMethodologies_PATTERNRECOGNITION ,Computer Science::Sound ,symbols ,Artificial intelligence ,Mel-frequency cepstrum ,business ,Audio signal processing ,computer ,Gaussian process ,Energy (signal processing) - Abstract
In this paper we describe algorithms to classify environmental sounds with the aim of providing contextual information to devices such as hearing aids for optimum performance. We use signal sub-band energy to construct signal-dependent dictionary and matching pursuit algorithms to obtain a sparse representation of a signal. The coefficients of the sparse vector are used as weights to compute weighted features. These features, along with mel frequency cepstral coefficients (MFCC), are used as feature vectors for classification. Experimental results show that the proposed method gives an accuracy as high as 95.6 %, while classifying 14 categories of environmental sound using a gaussian mixture model (GMM).
- Published
- 2013
- Full Text
- View/download PDF
34. The Speed Submission to DIHARD II: Contributions & Lessons Learned
- Author
-
Md Sahidullah, José María Patino Villar, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov, Alessio Brutti, Romain Serizel, Emmanuel Vincent, Nicholas Evans, Sébastien Marcel, Stefano Squartini, Claude Barras, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Eurecom [Sophia Antipolis], Polytechnic University of Marche, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE), IDIAP Research Institute, Fondazione Bruno Kessler [Trento, Italy] (FBK), GRID5000, Grid'5000, Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), and Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11)
- Subjects
FOS: Computer and information sciences ,[SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph] ,Sound (cs.SD) ,Computer Science - Machine Learning ,evaluation ,Speech Activity Detection ,DIHARD challenge ,[INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM] ,Single-channel ,Computer Science - Sound ,Machine Learning (cs.LG) ,diarization ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Speaker recognition ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Multichannel ,DIHARD 2019 ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,single-channel and multi-channel speech ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Speaker Diarization - Abstract
This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.