12 results
Search Results
2. How much does prosody help word segmentation? A simulation study on infant-directed speech
- Author
-
Emmanuel Dupoux, Reiko Mazuka, Alejandrina Cristia, Bogdan Ludusan, RIKEN Center for Brain Science [Wako] (RIKEN CBS), RIKEN - Institute of Physical and Chemical Research [Japon] (RIKEN), Department of Psychology and Neuroscience, Duke University [Durham], Laboratoire de sciences cognitives et psycholinguistique (LSCP), Département d'Etudes Cognitives - ENS Paris (DEC), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École des hautes études en sciences sociales (EHESS)-Centre National de la Recherche Scientifique (CNRS), Apprentissage machine et développement cognitif (CoML), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École des hautes études en sciences sociales (EHESS)-Centre National de la Recherche Scientifique (CNRS)-Département d'Etudes Cognitives - ENS Paris (DEC), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École des hautes études en sciences sociales (EHESS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), École des hautes études en sciences sociales (EHESS), The research reported in this paper was partly funded by JSPS Grant-in-Aid for Scientific Research (16H06319, 20H05617) and MEXT Grant-in-aid on Innovative Areas #4903 (Co-creative Language Evolution), 17H06382 to RM. It was also supported by the European Research Council (ERC-2011-AdG-295810 BOOTPHON), the Agence Nationale pour la Recherche (ANR-17-CE28-0007 LangAge, ANR-16-DATA-0004 ACLEW, ANR-14-CE30-0003 MechELex, ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA Institute). ED is further grateful to the CIFAR (Learning in Machines and Brain), BL to the Canon Foundation in Europe, and AC to the JS McDonnell Foundation., ANR-17-CE28-0007,LangAge,Différences dans l'apprenabilité du langage selon l'âge(2017), ANR-16-DATA-0004,ACLEW,Analyzing Child Language Experiences Around the World(2016), ANR-14-CE30-0003,MechELex,Méchanismes d'acquisition lexicale précoce(2014), ANR-17-EURE-0017,FrontCog,Frontières en cognition(2017), ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019), European Project: 295810,EC:FP7:ERC,ERC-2011-ADG_20110406,BOOTPHON(2012), École normale supérieure - Paris (ENS Paris), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-École normale supérieure - Paris (ENS Paris)
- Subjects
Linguistics and Language ,Cognitive Neuroscience ,Speech recognition ,Experimental and Cognitive Psychology ,Prosody ,Infant-directed speech ,050105 experimental psychology ,Language and Linguistics ,Speech Acoustics ,Task (project management) ,Infant language acquisition ,Developmental and Educational Psychology ,Humans ,Learning ,Speech ,0501 psychology and cognitive sciences ,Segmentation ,Computer Simulation ,ComputingMilieux_MISCELLANEOUS ,Computational model ,05 social sciences ,Text segmentation ,Infant ,acquisition ,[SCCO.LING]Cognitive science/Linguistics ,Word segmentation ,[SCCO.PSYC]Cognitive science/Psychology ,Speech Perception ,Infant language ,Cues ,Heuristics ,Psychology ,Word (computer architecture) ,050104 developmental & child psychology - Abstract
International audience; Infants come to learn several hundreds of word forms by two years of age, and it is possible this involves carving these forms out from continuous speech. It has been proposed that the task is facilitated by the presence of prosodic boundaries. We revisit this claim by running computational models of word segmentation, with and without prosodic information, on a corpus of infant-directed speech. We use five cognitively-based algorithms, which vary in whether they employ a sub-lexical or a lexical segmentation strategy and whether they are simple heuristics or embody an ideal learner. Results show that providing expert-annotated prosodic breaks does not uniformly help all segmentation models. The sub-lexical algorithms, which perform more poorly, benefit most, while the lexical ones show a very small gain. Moreover, when prosodic information is derived automatically from the acoustic cues infants are known to be sensitive to, errors in the detection of the boundaries lead to smaller positive effects, and even negative ones for some algorithms. This shows that even though infants could potentially use prosodic breaks, it does not necessarily follow that they should incorporate prosody into their segmentation strategies, when confronted with realistic signals.
- Published
- 2022
- Full Text
- View/download PDF
3. Language recognition on unknown conditions: the LORIA-Inria-MULTISPEECH system for AP20-OLR Challenge
- Author
-
Irina Illina, Raphaël Duroselle, Sahidullah, Denis Jouvet, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Experiments presented in this paper were carried out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). This work has been partly funded by the French Direction Genérale de l’Armement., Grid'5000, Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Channel (digital image) ,Computer science ,Generalization ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Regularization (mathematics) ,Bottleneck ,Domain (software engineering) ,Task (project management) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Robustness (computer science) ,Language recognition ,0202 electrical engineering, electronic engineering, information engineering ,[INFO]Computer Science [cs] ,domain generalization ,0305 other medical science ,Selection (genetic algorithm) ,channel mismatch - Abstract
International audience; We describe the LORIA-Inria-MULTISPEECH system submitted to the Oriental Language Recognition AP20-OLR Challenge. This system has been specifically designed to be robust to unknown conditions: channel mismatch (task 1) and noisy conditions (task 3). Three sets of studies have been carried out for elaborating the system: design of multilingual bottleneck features, selection of robust features by evaluating language recognition performance on an unobserved channel, and design of the final models with different loss functions which exploit channel diversity within the training set. Key factors for robustness to unknown conditions are data augmentation techniques, stochastic weight averaging, and regularization of TDNNs with domain robustness loss functions. The final system is the combination of four TDNNs using bottleneck features and one GMM using SDC-MFCC features. Within the AP20-OLR Challenge, it achieves the top performance for tasks 1 and 3 with a $C_{avg}$ of respectively 0.0239 and 0.0374. This validates the approach for generalization to unknown conditions.
- Published
- 2021
- Full Text
- View/download PDF
4. Explaining Deep Learning Models for Speech Enhancement
- Author
-
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr, Microsoft Corporation [Redmond], Microsoft Corporation [Redmond, Wash.], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice command adapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as otherorganizations (see https://www.grid5000.fr)., ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Artificial neural network ,business.industry ,Computer science ,Deep learning ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,explainable AI ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Noise ,feature attribution ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Robustness (computer science) ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,speech enhancement ,Artificial intelligence ,0305 other medical science ,business - Abstract
International audience; We consider the problem of explaining the robustness of neural networks used to compute time-frequency masks for speech enhancement to mismatched noise conditions. We employ the Deep SHapley Additive exPlanations (DeepSHAP) feature attribution method to quantify the contribution of every timefrequency bin in the input noisy speech signal to every timefrequency bin in the output time-frequency mask. We define an objective metric-referred to as the speech relevance scorethat summarizes the obtained SHAP values and show that it correlates with the enhancement performance, as measured by the word error rate on the CHiME-4 real evaluation dataset. We use the speech relevance score to explain the generalization ability of three speech enhancement models trained using synthetically generated speech-shaped noise, noise from a professional sound effects library, or real CHiME-4 noise. To the best of our knowledge, this is the first study on neural network explainability in the context of speech enhancement.
- Published
- 2021
- Full Text
- View/download PDF
5. DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
- Author
-
Romain Serizel, Irina Illina, Nicolas Furnon, Slim Essid, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, This work was made with the support of the French National Research Agency, in the framework of the project DiSCogs (ANR-17-CE23-0026-01). Experiments presented in this paper were partially out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000)., Grid'5000, ANR-17-CE23-0026,DiSCogs,Antennes acoustiques hétérogènes et non contraintes pour la communication parlée(2017), Institut Polytechnique de Paris (IP Paris), Département Images, Données, Signal (IDS), Télécom ParisTech, Signal, Statistique et Apprentissage (S2A), and Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris
- Subjects
Signal Processing (eess.SP) ,Microphone array ,Acoustics and Ultrasonics ,Noise measurement ,Artificial neural network ,Computer science ,Microphone ,Noise reduction ,Speech recognition ,Context (language use) ,Speech processing ,Speech enhancement ,Computational Mathematics ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Computer Science::Sound ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Electrical Engineering and Systems Science - Signal Processing - Abstract
Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array., Submitted to TASLP
- Published
- 2020
- Full Text
- View/download PDF
6. Metric learning loss functions to reduce domain mismatch in the x-vector space for language recognition
- Author
-
Raphaël Duroselle, Denis Jouvet, Irina Illina, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). This work has been partly funded bythe French Direction Générale de l'Armement., and Grid'5000
- Subjects
Computer science ,domain adaptation ,Speech recognition ,metric learning ,020206 networking & telecommunications ,02 engineering and technology ,01 natural sciences ,domain mismatch ,language recognition ,embedding ,x-vector ,Robustness (computer science) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Maximum mean discrepancy ,Embedding ,[INFO]Computer Science [cs] ,010301 acoustics ,Language recognition ,Vector space - Abstract
International audience; State-of-the-art language recognition systems are based on dis-criminative embeddings called x-vectors. Channel and gender distortions produce mismatch in such x-vector space where em-beddings corresponding to the same language are not grouped in an unique cluster. To control this mismatch, we propose to train the x-vector DNN with metric learning objective functions. Combining a classification loss with the metric learning n-pair loss allows to improve the language recognition performance. Such a system achieves a robustness comparable to a system trained with a domain adaptation loss function but without using the domain information. We also analyze the mismatch due to channel and gender, in comparison to language proximity, in the x-vector space. This is achieved using the Maximum Mean Discrepancy divergence measure between groups of x-vectors. Our analysis shows that using the metric learning loss function reduces gender and channel mismatch in the x-vector space, even for languages only observed on one channel in the train set.
- Published
- 2020
7. Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis
- Author
-
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet, Jouvet, Denis, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)., Grid'5000, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
- Subjects
Computer science ,[INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing ,Mean opinion score ,Speech recognition ,deep metric learning ,Inference ,Speech synthesis ,02 engineering and technology ,Latent variable ,010501 environmental sciences ,[INFO] Computer Science [cs] ,computer.software_genre ,expressivity ,01 natural sciences ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,0202 electrical engineering, electronic engineering, information engineering ,[INFO]Computer Science [cs] ,variational autoencoder ,text-to-speech ,0105 earth and related environmental sciences ,Acoustic model ,Speaker recognition ,inverse autoregressive flow ,Autoencoder ,Metric (mathematics) ,Embedding ,020201 artificial intelligence & image processing ,Transfer of learning ,computer - Abstract
International audience; In this paper, we present a novel deep metric learning architecture along with variational inference incorporated in a paramet-ric multispeaker expressive text-to-speech (TTS) system. We proposed inverse autoregressive flow (IAF) as a way to perform the variational inference, thus providing flexible approximate posterior distribution. The proposed approach condition the text-to-speech system on speaker embeddings so that latent space represents the emotion as semantic characteristics. For representing the speaker, we extracted speaker em-beddings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used the acoustic model conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results obtained using flow-based variational inference with variational autoencoder as a base-line model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning along with IAF model improves the transfer of expressivity in the desired speaker's voice in synthesized speech.
- Published
- 2020
8. RNN Language Model Estimation for Out-of-Vocabulary Words
- Author
-
Illina, Irina, Fohr, Dominique, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), This work is funded by the ContNomina project supported by the French National Research Agency (ANR) under contract ANR-12-BS02-0009. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria, CNRS, RENATER and other Universities and organizations (https://www.grid5000.fr)., and ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012)
- Subjects
Speech Recognition ,Neural Networks ,Vocabulary Extension ,Proper Names ,Out-Of-Vocabulary Words ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] - Abstract
International audience; One important issue of speech recognition systems is Out-of Vocabulary words (OOV). These words, often proper nouns or new words, are essential for documents to be transcribed correctly. Thus, they must be integrated in the language model (LM) and the lexicon of the speech recognition system. This article proposes new approaches to OOV proper noun probability estimation using Recurrent Neural Network Language Model (RNNLM). The proposed approaches are based on the notion of closest in-vocabulary (IV) words (list of brothers) to a given OOV proper noun. The probabilities of these words are used to estimate the probabilities of OOV proper nouns thanks to RNNLM. Three methods for retrieving the relevant list of brothers are studied. The main advantages of the proposed approaches are that the RNNLM is not retrained and the architecture of the RNNLM is kept intact. Experiments on real text data from the website of the Euronews channel show relative perplexity reductions of about 14% compared to baseline RNNLM.
- Published
- 2020
- Full Text
- View/download PDF
9. Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech
- Author
-
Vincent Colotte, Ajinkya Kulkarni, Denis Jouvet, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Experiments presented in this paper were carried out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. (see https://www.grid5000.fr), and Grid'5000
- Subjects
Computer science ,Speech recognition ,deep metric learning ,Contrast (statistics) ,020206 networking & telecommunications ,Speech synthesis ,02 engineering and technology ,expressivity ,computer.software_genre ,Speaker recognition ,Autoencoder ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Identity (music) ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Metric (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,variational autoencoder ,020201 artificial intelligence & image processing ,Expressivity (genetics) ,text-to-speech ,Representation (mathematics) ,computer - Abstract
In this paper, we propose an approach relying on multiclass N-pair loss based deep metric learning in recurrent conditional variational autoencoder (RCVAE). We used RCVAE for implementation of multispeaker expressive text-to-speech (TTS) system. The proposed approach condition text-to-speech system on speaker embeddings, and leads to clustering the latent space representation with respect to emotion. The deep metric learning helps to reduce the intra-class variance and increase the inter-class variance in latent space. Thus, we present multiclass N-pair loss to enhance the meaningful representation of the latent space. For representing the speaker, we extracted speaker embed-dings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used RCVAE for the acoustic modeling, in which the model is conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results with those of the RCVAE model without multiclass N-pair loss as baseline model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning significantly improves the transfer of expressivity in the target speaker's voice in synthesized speech.
- Published
- 2020
- Full Text
- View/download PDF
10. CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning
- Author
-
Bernd Edler, Emanuel A. P. Habets, Fabian-Robert Stöter, Soumitro Chakrabarty, Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), International Audio Laboratories Erlangen (AUDIO LABS), Friedrich-Alexander Universität Erlangen-Nürnberg (FAU)-Fraunhofer Institute for Integrated Circuits (Fraunhofer IIS), Fraunhofer (Fraunhofer-Gesellschaft)-Fraunhofer (Fraunhofer-Gesellschaft), The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE). They would like to thank A. Liutkus for his constructive criticism of the paper., and Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM)
- Subjects
Speaker count estimation ,Reverberation ,cocktail-party ,overlap detection ,Acoustics and Ultrasonics ,Artificial neural network ,Computer science ,Speech recognition ,Supervised learning ,Probabilistic logic ,Blind signal separation ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Computational Mathematics ,[MATH.MATH-LO]Mathematics [math]/Logic [math.LO] ,Recurrent neural network ,Computer Science (miscellaneous) ,number of concurrent speakers ,Point estimation ,Electrical and Electronic Engineering ,0305 other medical science - Abstract
International audience; Estimating the maximum number of concurrent speakers from single-channel mixtures is a challenging problem and an essential first step to address various audio-based tasks such as blind source separation, speaker diarization, and audio surveillance. We propose a unifying probabilistic paradigm, where deep neural network architectures are used to infer output posterior distributions. These probabilities are in turn processed to yield discrete point estimates. Designing such architectures often involves two important and complementary aspects that we investigate and discuss. First, we study how recent advances in deep architectures may be exploited for the task of speaker count estimation. In particular, we show that convolutional recurrent neural networks outperform recurrent networks used in a previous study when adequate input features are used. Even for short segments of speech mixtures, we can estimate up to five speakers, with a significantly lower error than other methods. Second, through comprehensive evaluation, we compare the best-performing method to several baselines, as well as the influence of gain variations, different data sets, and reverberation. The output of our proposed method is compared to human performance. Finally, we give insights into the strategy used by our proposed method.
- Published
- 2019
- Full Text
- View/download PDF
11. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition
- Author
-
Emmanuel Vincent, Sunit Sivasankaran, Dominique Fohr, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Grid'5000, ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016), This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice commandadapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several universities as well as other organizations (see https://www.grid5000.fr) and using the EXPLOR centre, hosted by the University of Lorraine., Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
- Subjects
Multichannel speech separation ,WSJ0-2mix reverberated ,Signal processing ,Noise measurement ,Artificial neural network ,Computer science ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,Speech processing ,Signal-to-noise ratio ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Adaptive beamformer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of $29.4$% was achieved using the ground truth localization information and $42.4$% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between the speakers has a higher impact on the ASR performance, to the extent of reducing the WER by $59$% relative for a SIR increase of $15$ dB. By contrast, increasing the spatial distance to $50^\circ$ or more improves the WER by $23$% relative only, Comment: Submitted to ICASSP 2020
- Published
- 2019
- Full Text
- View/download PDF
12. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques
- Author
-
Luis A. Hernández Gómez, José Luis Blanco Murillo, José Alcázar Ramírez, Eduardo López Gonzalo, Rubén Fernández Pozo, Doroteo Torre Toledano, [Fernández Pozo,R, Blanco Murillo,JL, Hernández Gómez,L, López Gonzalo,E] Signal, Systems and Radiocommunications Departament, Universidad Politécnica de Madrid, Madrid, Spain. [Alcázar Ramírez,J] Respiratory Departament, Hospital Torrecárdenas, Almería, Spain. [Toledano,DT] ATVS Biometric Recognition group, Universidad Autónoma de Madrid, Madrid, Spain., The activities described in this paper were funded by the Spanish Ministry of Science and Technology as part of the TEC2006-13170-C02-02 Project., UAM. Departamento de Ingeniería Informática, and Análisis y Tratamiento de Voz y Señales Biométricas (ING EPS-002)
- Subjects
Sustained speech ,Artificial intelligence ,Information Science::Information Science::Computing Methodologies::Software::Speech Recognition Software [Medical Subject Headings] ,Computer science ,Speech recognition ,0206 medical engineering ,lcsh:TK7800-8360 ,02 engineering and technology ,computer.software_genre ,Voice analysis ,Nasalization ,lcsh:Telecommunication ,Vowel ,lcsh:TK5101-6720 ,0202 electrical engineering, electronic engineering, information engineering ,Distribución normal ,Diseases::Respiratory Tract Diseases::Respiration Disorders::Apnea::Sleep Apnea Syndromes::Sleep Apnea, Obstructive [Medical Subject Headings] ,Audio signal processing ,Telecomunicaciones ,Phenomena and Processes::Mathematical Concepts::Statistical Distributions::Normal Distribution [Medical Subject Headings] ,lcsh:Electronics ,020206 networking & telecommunications ,Phonetics ,Speaker recognition ,Speech processing ,Diseases::Otorhinolaryngologic Diseases::Laryngeal Diseases::Voice Disorders [Medical Subject Headings] ,020601 biomedical engineering ,Continuous speech ,Obstructive sleep apnea ,respiratory tract diseases ,Apnea del sueño obstructiva ,Pattern recognition (psychology) ,Speech dynamics ,Gaussian mixture models ,computer ,Programa informático para el reconocimiento del lenguaje hablado ,Trastornos de la voz ,Classification and regression tree (CART) - Abstract
The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2009/1/982531, This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry., The activities described in this paper were funded by the Spanish Ministry of Science and Technology as part of the TEC2006-13170-C02-02 Project.
- Published
- 2009
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.