Author: "Laurent Girin" / Topic: 020206 networking & telecommunications - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Laurent Girin"' showing total 50 results

Start Over Author "Laurent Girin" Topic 020206 networking & telecommunications

50 results on '"Laurent Girin"'

1. Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

Author: Laurent Girin, Laurent Besacier, Brooke Stephenson, Thomas Hueber, Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Subjects: FOS: Computer and information sciences, Incremental text-to-speech, Computer science, Speech recognition, Context (language use), neural language models, 02 engineering and technology, Measure (mathematics), 030507 speech-language pathology & audiology, 03 medical and health sciences, Naturalness, prosody, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Prosody, Computer Science - Computation and Language, 020206 networking & telecommunications, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Duration (music), Language model, 0305 other medical science, Computation and Language (cs.CL), Word (computer architecture), Energy (signal processing), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test., Comment: 4 pages
Published: 2021
Full Text: View/download PDF

2. Learning robust speech representation with an articulatory-regularized variational autoencoder

Author: Laurent Girin, Jean-Luc Schwartz, Marc-Antoine Georges, Thomas Hueber, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), GIPSA - Perception, Contrôle, Multimodalité et Dynamiques de la parole (GIPSA-PCMD), Laboratoire de Psychologie et NeuroCognition (LPNC ), Université Savoie Mont Blanc (USMB [Université de Savoie] [Université de Chambéry])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA), and ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Speech production, Speech perception, Computer science, speech production, Quantitative Biology::Tissues and Organs, Speech recognition, Physics::Medical Physics, 02 engineering and technology, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, representation learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, variational autoencoder, Representation (mathematics), Computer Science - Computation and Language, 020206 networking & telecommunications, Autoencoder, Speech enhancement, Generative model, Computer Science::Graphics, Computer Science::Sound, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], speech enhancement, articulatory model, 0305 other medical science, Computation and Language (cs.CL), Feature learning, Vocal tract, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features. Then we incorporate these articulatory parameters into a variational autoencoder applied on spectral features by using a regularization technique that constrains part of the latent space to represent articulatory trajectories. We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
Published: 2021

3. High-Resolution Speaker Counting in Reverberant Rooms Using CRNN with Ambisonics Features

Author: Pierre-Amaury Grumiaux, Laurent Girin, Srdan Kitic, and Alexandre Guerin
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Reverberation, Microphone, Computer science, Ambisonics, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Computer Science - Sound, Speaker diarisation, Sound recording and reproduction, Noise, Recurrent neural network, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Audio signal processing, computer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. For that purpose, we address the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. We trained the network to predict up to 5 concurrent speakers in a multichannel mixture, with simulated data including many different conditions in terms of source and microphone positions, reverberation, and noise. The network can predict the number of speakers with good accuracy at frame resolution., 5 pages, 1 figure
Published: 2021
Full Text: View/download PDF

4. Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function

Author: Radu Horaud, Laurent Girin, Xiaofei Li, Fabien Badeig, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), IEEE, European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Microphone array, Sound (cs.SD), Computer science, Feature vector, Acoustics, 02 engineering and technology, Transfer function, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Computer Science - Robotics, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, [INFO.INFO-RB]Computer Science [cs]/Robotics [cs.RO], Impulse response, Short-time Fourier transform, Spectral density, 020206 networking & telecommunications, Noise, Fourier transform, binaural hearing, Computer Science::Sound, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], symbols, sound-source localization, 0305 other medical science, Robotics (cs.RO), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper addresses the problem of sound-source localization (SSL) with a robot head, which remains a challenge in real-world environments. In particular we are interested in locating speech sources, as they are of high interest for human-robot interaction. The microphone-pair response corresponding to the direct-path sound propagation is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function (ATF) of the two microphones, and it is an important feature for SSL. We propose a method to estimate the DP-RTF from noisy and reverberant signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function (CTF) approximation is adopted to accurately represent the impulse response of the microphone array, and the first coefficient of the CTF is mainly composed of the direct-path ATF. At each frequency, the frame-wise speech auto- and cross-power spectral density (PSD) are obtained by spectral subtraction. Then a set of linear equations is constructed by the speech auto- and cross-PSD of multiple frames, in which the DP-RTF is an unknown variable, and is estimated by solving the equations. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for SSL. Experiments with a robot, placed in various reverberant environments, show that the proposed method outperforms two state-of-the-art methods., IEEE/RSJ International Conference on Intelligent Robots and Systems
Published: 2020
Full Text: View/download PDF

5. What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

Author: Laurent Girin, Brooke Stephenson, Thomas Hueber, Laurent Besacier, Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Institut Universitaire de France (IUF), Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche (M.E.N.E.S.R.), and ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, incremental speech synthesis, Computer science, Speech recognition, Contrast (statistics), 020206 networking & telecommunications, Context (language use), Speech synthesis, 02 engineering and technology, MUSHRA, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], representation learning, deep neural networks, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Encoder, Feature learning, computer, Sentence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence., Comment: 5 pages, 4 figures
Published: 2020

6. A Recurrent Variational Autoencoder for Speech Enhancement

Author: Simon Leglaive, Laurent Girin, Radu Horaud, Xavier Alameda-Pineda, CentraleSupélec, Institut d'Électronique et des Technologies du numéRique (IETR), Université de Nantes (UN)-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Interpretation and Modelling of Images and Videos (PERCEPTION), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), IEEE, PNRIA, ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), Université de Nantes (UN)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), and Nantes Université (NU)-Université de Rennes 1 (UR1)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Computer Science - Artificial Intelligence, Speech recognition, Recurrent variational autoencoders, Speech enhancement, 02 engineering and technology, Latent variable, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Computer Science - Sound, Non-negative matrix factorization, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Nonnegative matrix factorization, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Neural and Evolutionary Computing (cs.NE), Computer Science - Neural and Evolutionary Computing, 020206 networking & telecommunications, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Autoencoder, Artificial Intelligence (cs.AI), Computer Science::Sound, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Noise (video), 0305 other medical science, Variational inference, Encoder, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Generative grammar, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: International audience; This paper presents a generative approach to speech enhancement based on a recurrent variational autoencoder (RVAE). The deep generative speech model is trained using clean speech signals only, and it is combined with a nonnegative matrix factorization noise model for speech enhancement. We propose a variational expectation-maximization algorithm where the encoder of the RVAE is finetuned at test time, to approximate the distribution of the latent variables given the noisy speech observations. Compared with previous approaches based on feed-forward fully-connected architectures, the proposed recurrent deep generative speech model induces a posterior temporal dynamic over the latent variables, which is shown to improve the speech enhancement results.
Published: 2020
Full Text: View/download PDF

7. Audio-Visual Variational Fusion for Multi-Person Tracking with Robots

Author: Yutong Ban, Guillaume Sarrazin, Xavier Alameda-Pineda, Laurent Girin, Soraya Arias, Guillaume Delorme, Radu Horaud, Xiaofei Li, Bastien Morgue, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Service Expérimentation et Développement (SED [Grenoble]), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and SED [Grenoble]
Subjects: Computer science, media_common.quotation_subject, Tracking, Scene understanding, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 020206 networking & telecommunications, 02 engineering and technology, Tracking (particle physics), Field (computer science), Presentation, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Human–computer interaction, Perception, Vision for robotics, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, Robot, Sensory cue, ComputingMilieux_MISCELLANEOUS, media_common
Abstract: Robust multi-person tracking with robots opens the door to analysing engagement and social signals in real-world environments. Multi-person scenarios are charaterised by (i) a time-varying number of people, (ii) intermittent auditory (\eg speech turns) and visual cues (\eg person appearing/disappearing) and (iii) impact of the robot actions in perception. The various sensors (cameras and microphones) available for perception, provide a rich flow of information of intermittent and complementary nature. How to jointly exploit these cues to tackle the multi-person tracking problem with an autonomous system has been an intense research line of the Perception Team in the past few years. In this demo we want to present our, now mature, achievements in the field, and demonstrate two robotic systems able to track multiple persons using auditory and visual cues, when they are available. We will bring the two robots and the necessary computing resources with us, as well as the required presentation materials to discuss the models, methods and tools supporting this technology with the attendants.
Published: 2019
Full Text: View/download PDF

8. Bayesian time-domain multiple sound source localization for a stochastic machine

Author: Emmanuel Mazer, Laurent Girin, Marvin Faix, Raphael Frisch, Jacques Droulez, Techniques of Informatics and Microelectronics for integrated systems Architecture (TIMA), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA), Interaction située avec les objets et environnements intelligents (PERVASIVE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA), Institut des Systèmes Intelligents et de Robotique (ISIR), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), CRISSP (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-GIPSA Pôle Parole et Cognition (GIPSA-PPC), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA), Techniques de l'Informatique et de la Microélectronique pour l'Architecture des systèmes intégrés (TIMA), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])
Subjects: [INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR], Signal processing, Computer science, Bayesian probability, 020206 networking & telecommunications, Statistical model, specific hardware, 02 engineering and technology, Acoustic source localization, Bayesian stochastic machine, time-domain processing, Bayesian inference, Robustness (computer science), [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Multiple sound source localization, Time domain, Algorithm, ComputingMilieux_MISCELLANEOUS
Abstract: We propose a time-domain multiple sound source localization (SSL) method based on Bayesian inference. This method is specifically designed to run on the stochastic machines (SM) that we are currently developing to perform efficient low-level sensor signal processing with ultra-low power consumption. The proposed SSL method is divided into two main parts. First, a probabilistic model is run on 50 very short time frames (3. 75ms each) of multichannel recorded signals. Second, the results obtained on the different frames are fused to obtain a final localization map. Using the system in a supervised way allows to extract estimated source locations by selecting as many maxima as there are sources in the room. We explain how this method is implemented on a SM. Experiments are presented to illustrate the performance and robustness of the resulting system.
Published: 2019
Full Text: View/download PDF

9. Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization

Author: Simon Leglaive, Laurent Girin, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Sound (cs.SD), Computer science, Gaussian, Monte Carlo method, Machine Learning (stat.ML), Multichannel speech enhancement, 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Computer Science - Sound, Non-negative matrix factorization, Matrix decomposition, non-negative matrix factorization, Statistics::Machine Learning, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, ComputingMilieux_MISCELLANEOUS, Artificial neural network, business.industry, local Gaussian modeling, 020206 networking & telecommunications, Pattern recognition, variational autoencoders, Speech enhancement, ComputingMethodologies_PATTERNRECOGNITION, Computer Science::Sound, symbols, Artificial intelligence, Noise (video), Monte Carlo expectation-maximization, 0305 other medical science, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recording environment is supposed to be unknown, so the noise spectro-temporal modeling remains unsupervised and is based on non-negative matrix factorization (NMF). We develop a Monte Carlo expectation-maximization algorithm and we experimentally show that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF., 5 pages, 2 figures, audio examples and code available online at https://team.inria.fr/perception/icassp-2019-mvae/
Published: 2019
Full Text: View/download PDF

10. Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions

Author: Laurent Girin, Radu Horaud, Antoine Liutkus, Umut Simsekli, Simon Leglaive, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Signal, Statistique et Apprentissage (S2A), Laboratoire Traitement et Communication de l'Information (LTCI), Institut Mines-Télécom [Paris] (IMT)-Télécom Paris-Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, Département Images, Données, Signal (IDS), Télécom ParisTech, Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Chaire DSAIDISThis work is supported by the ERC Advanced Grant VHIA #34, ANR-15-CE38-0003,KAMoulox,Démixage en ligne de larges archives sonores(2015), ANR-16-CE23-0014,FBIMATRIX,Méthodes distribuées et parallèles de Monte-Carlo par chaînes de Markov pour l'Inférence Bayésienne de modèles à factorisation de tenseurs(2016), European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014), and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Gaussian, Speech recognition, Speech enhancement, Monte Carlo method, Machine Learning (stat.ML), 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Intelligibility (communication), Computer Science - Sound, Matrix decomposition, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Context model, Noise measurement, business.industry, Deep learning, 020206 networking & telecommunications, Computer Science::Sound, symbols, Monte Carlo expectation-maximization, Artificial intelligence, 0305 other medical science, business, Variational autoencoders, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Alpha-stable distribution, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge of the noisy recording environment. In this context, our contribution is to propose a noise model based on alpha-stable distributions, instead of the more conventional Gaussian non-negative matrix factorization approach found in previous studies. We develop a Monte Carlo expectation-maximization algorithm for estimating the model parameters at test time. Experimental results show the superiority of the proposed approach both in terms of perceptual quality and intelligibility of the enhanced speech signal., 5 pages, 3 figures, audio examples and code available online : https://team.inria.fr/perception/research/icassp2019-asvae/. arXiv admin note: text overlap with arXiv:1811.06713
Published: 2019
Full Text: View/download PDF

11. Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Author: Xavier Alameda-Pineda, Radu Horaud, Laurent Girin, Yutong Ban, Xiaofei Li, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Bayesian variational inference, Sound (cs.SD), Reverberation, multiple target tracking, Computer science, Speaker tracking, 02 engineering and technology, Tracking (particle physics), Computer Science - Sound, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), reverberant environments, Expectation–maximization algorithm, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, multiple moving speakers, Electrical and Electronic Engineering, Online algorithm, 020206 networking & telecommunications, Acoustic source localization, Solver, online tracking, expectation-maximization, Feature (computer vision), Computer Science::Sound, Signal Processing, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], speaker tracking, sound-source localization, Algorithm, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments., IEEE Journal of Selected Topics in Signal Processing, 2019
Published: 2019
Full Text: View/download PDF

12. Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function

Author: Laurent Girin, Xiaofei Li, Sharon Gannot, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Bar-Ilan University [Israël], and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Noise power, Acoustics and Ultrasonics, Computational complexity theory, Computer science, Microphone, Inverse, 02 engineering and technology, Transfer function, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Lasso (statistics), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, MINT, 020206 networking & telecommunications, convolutive transfer function, Computational Mathematics, Noise, Fourier transform, Index Terms-Audio source separation, short-time Fourier transform, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], symbols, speech enhancement, 0305 other medical science, Algorithm, Lasso optimization, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps, consequently it has less near-common zeros among channels and less computational complexity. The work proposes three speech-source recovery methods, namely: i) the multichannel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain, and for the multi-source case, ii) a beamforming-like multichannel inverse filtering method applying single source MINT and using power minimization, which is suitable whenever the source CTFs are not all known, and iii) a constrained Lasso method, where the sources are recovered by minimizing the $\ell_1$-norm to impose their spectral sparsity, with the constraint that the $\ell_2$-norm fitting cost, between the microphone signals and the mixing model involving the unknown source signals, is less than a tolerance. The noise can be reduced by setting a tolerance onto the noise power. Experiments under various acoustic conditions are carried out to evaluate the three proposed methods. The comparison between them as well as with the baseline methods is presented., Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing
Published: 2019
Full Text: View/download PDF

13. Audio source separation into the wild

Author: Laurent Girin, Sharon Gannot, Xiaofei Li, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and Bar-Ilan University [Israël]
Subjects: Reverberation, Computer science, linear Gaussian models, audio source separation, nonnegative matrix factorization, 020206 networking & telecommunications, 02 engineering and technology, 01 natural sciences, Field (computer science), Non-negative matrix factorization, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0103 physical sciences, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Source separation, Electronic engineering, Robot, 010301 acoustics
Abstract: International audience; This review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter.
Published: 2019
Full Text: View/download PDF

14. A variance modeling framework based on variational autoencoders for speech enhancement

Author: Laurent Girin, Radu Horaud, Simon Leglaive, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Generalization, Audio source separation, Machine Learning (stat.ML), 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Computer Science - Sound, Machine Learning (cs.LG), Matrix decomposition, Non-negative matrix factorization, non-negative matrix factorization, 030507 speech-language pathology & audiology, 03 medical and health sciences, Statistics::Machine Learning, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Artificial neural network, business.industry, Deep learning, 020206 networking & telecommunications, Pattern recognition, Autoencoder, variational autoencoders, Speech enhancement, ComputingMethodologies_PATTERNRECOGNITION, Computer Science::Sound, speech enhancement, Artificial intelligence, Monte Carlo expectation-maximization, 0305 other medical science, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach., 6 pages, 3 figures
Published: 2018
Full Text: View/download PDF

15. Online Localization of Multiple Moving Speakers in Reverberant Environments

Author: Bastien Mourgue, Laurent Girin, Sharon Gannot, Xiaofei Li, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Bar-Ilan University [Israël], and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: Reverberation, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Acoustic source localization, Mixture model, Speech processing, Motion capture, Complex normal distribution, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Rate of convergence, Computer Science::Sound, Feature (computer vision), reverberant environments, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, multiple moving speakers, sound-source localization, 0305 other medical science
Abstract: International audience; This paper addresses the problem of online multiple moving speakers localization in reverberant environments. The direct-path relative transfer function (DP-RTF), as defined by the ratio between the first taps of the convolutive transfer function (CTF) of two microphones, encodes the inter-channel direct-path information and is thus used as a localization feature being robust against reverberation. The CTF estimation is based on the cross-relation method. In this work, the recursive least-square method is proposed to solve the cross-relation problem, due to its relatively low computational cost and its good convergence rate. The DP-RTF feature estimated at each time-frequency bin is assumed to correspond to a single speaker. A complex Gaussian mixture model is used to assign each observed feature to one among several speakers. The recursive expectation-maximization algorithm is adopted to update online the model parameters. The method is evaluated with a new dataset containing multiple moving speakers, where the ground-truth speaker trajectories are recorded with a motion capture system.
Published: 2018
Full Text: View/download PDF

16. Multichannel Identification and Nonnegative Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function

Author: Laurent Girin, Radu Horaud, Sharon Gannot, Xiaofei Li, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Bar-Ilan University [Israël], GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: Frequency response, Noise power, Acoustics and Ultrasonics, Computer science, Noise reduction, 02 engineering and technology, Transfer function, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Source separation, Electrical and Electronic Engineering, Impulse response, lasso optimization, Short-time Fourier transform, audio source separation, 020206 networking & telecommunications, convolutive transfer function, Speech enhancement, Computational Mathematics, Computer Science::Sound, source separation, short-time Fourier transform, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], speech enhancement, 0305 other medical science, Algorithm
Abstract: International audience; This paper addresses the problems of blind multichannel identification and equalization for joint speech dereverberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the short-time Fourier transform (STFT) domain, in which the time-domain impulse response is approximately represented by the convolutive transfer function (CTF) with much less coefficients. For the oversampled STFT, CTFs suffer from the common zeros caused by the non-flat frequency response of the STFT window. To overcome this, we propose to identify CTFs using the STFT framework with oversampled signals and critically sampled CTFs, which is a good trade-off between the frequency aliasing of the signals and the common zeros problem of CTFs. The identified complex-valued CTFs are not accurate enough for multichannel equalization due to the frequency aliasing of the CTFs. Thence, we only use the CTF magnitudes, which leads to a nonnegative multichannel equalization method based on a nonnegative convolution model between the STFT magnitude of the source signal and the CTF magnitude. Compared with the complex-valued convolution model, this nonnegative convolution model is shown to be more robust against the CTF perturbations. To recover the STFT magnitude of the source signal and to reduce the additive noise, the L2-norm fitting error between the STFT magnitude of the microphone signals and the nonnegative convolution is constrained to be less than a noise power related tolerance. Meanwhile, the L1-norm of the STFT magnitude of the source signal is minimized to impose the sparsity.
Published: 2018
Full Text: View/download PDF

17. An EM Algorithm for Audio Source Separation Based on the Convolutive Transfer Function

Author: Laurent Girin, Xiaofei Li, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: Speech recognition, Audio source separation, Multiplicative function, 020206 networking & telecommunications, 02 engineering and technology, Transfer function, convolutive transfer function, Convolution, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Fourier transform, Mixing (mathematics), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Computer Science::Sound, Expectation–maximization algorithm, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, Source separation, symbols, 0305 other medical science, Hidden Markov model, EM algorithm, Algorithm, Mathematics
Abstract: International audience; This paper addresses the problem of audio source separation from (possibly under-determined) multichannel convolutive mixtures. We propose a separation method based on the convolutive transfer function (CTF) in the short-time Fourier transform domain. For strongly reverberant signals, the CTF is a much more appropriate model than the widely-used multiplicative transfer function approximation. An Expectation-Maximization (EM) algorithm is proposed to jointly estimate the model parameters, including the CTF coefficients of the mixing filters, and infer the sources. Experiments show that the proposed method provides very satisfactory performance on highly reverberant speech mixtures.
Published: 2017
Full Text: View/download PDF

18. Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity Regularization

Author: Laurent Girin, Xiaofei Li, Sharon Gannot, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Bar-Ilan University [Israël], and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: FOS: Computer and information sciences, Reverberation, Sound (cs.SD), candidate-based GMM, Acoustics and Ultrasonics, Computer science, 02 engineering and technology, direct-path RTF, Transfer function, Computer Science - Sound, Multiple-speaker localization, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, entropy penalty, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Entropy (information theory), Electrical and Electronic Engineering, business.industry, Small number, 020206 networking & telecommunications, Pattern recognition, Mixture model, Grid, Computational Mathematics, Fourier transform, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], symbols, Artificial intelligence, 0305 other medical science, business, Binaural recording
Abstract: This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the GMM-based objective function, given an observed set of binaural features, both the number of sources and their locations are estimated by selecting the GMM components with the largest priors. This is achieved by enforcing a sparse solution, thus favoring a small number of speakers with respect to the large number of initial candidate source locations. An entropy-based penalty term is added to the likelihood, thus imposing sparsity over the set of GMM priors. In addition, the direct-path relative transfer function (DP-RTF) is used to build robust binaural features. The DP-RTF, recently proposed for single-source localization, was shown to be robust to reverberations, since it encodes inter-channel information corresponding to the direct-path of sound propagation. In this paper, we extend the DP-RTF estimation to the case of multiple sources. In the short-time Fourier transform domain, a consistency test is proposed to check whether a set of consecutive frames is associated to the same source or not. Reliable DP-RTF features are selected from the frames that pass the consistency test to be used for source localization. Experiments carried out using both simulation data and real data gathered with a robotic head confirm the efficiency of the proposed multi-source localization method., Comment: 16 pages, 4 figures, 4 tables
Published: 2017
Full Text: View/download PDF

19. Explaining the parameterized wiener filter with alpha-stable processes

Author: Laurent Girin, Antoine Liutkus, Roland Badeau, Mathieu Fontaine, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Télécom ParisTech, Projet ANR KAMoulox, ANR-15-CE38-0003,KAMoulox,Démixage en ligne de larges archives sonores(2015), Badeau, Roland, Démixage en ligne de larges archives sonores - - KAMoulox2015 - ANR-15-CE38-0003 - AAPG2015 - VALID, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: Noise (signal processing), Noise reduction, Speech recognition, Wiener filter, Spectral density, Wiener deconvolution, 020206 networking & telecommunications, 02 engineering and technology, Weighting, Speech enhancement, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Fourier transform, probability theory, denoising, 0202 electrical engineering, electronic engineering, information engineering, symbols, Wiener filtering, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Algorithm, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, alpha-stable processes, Mathematics
Abstract: International audience; This paper introduces a new method for single-channel denoising that sheds new light on classical early developments on this topic that occurred in the 70’s and 80’s with Wiener filtering and spectral subtraction. Operating both in the short-time Fourier transform domain, these methods consist in estimating the power spectral density (PSD) of the noise without speech. Then, the clean speech signal is obtained by manipulating the corrupted time-frequency bins thanks to these noise PSD estimates. Theoretically grounded when using power spectra, these methods were subsequently generalized to magnitude spectra, or shown to yield better performance by weighting the PSDs in the so-called parameterized Wiener filter. Both these strategies were long considered ad-hoc. To the best of our knowledge, while we recently proposed an interpretation of magnitude processing, there is still no theoretical result that would justify the better performance of parameterized Wiener filters. Here, we show how the α-stable probabilistic model for waveforms naturally leads to these weighted filters and we provide a grounded and fast algorithm to enhance corrupted audio that compares favorably with classical denoising methods.
Published: 2017
Full Text: View/download PDF

20. An EM Algorithm for Joint Source Separation and Diarisation of Multichannel Convolutive Speech Mixtures

Author: Laurent Girin, Radu Horaud, Dionyssos Kounades-Bastian, Sharon Gannot, Xavier Alameda-Pineda, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), University of Trento [Trento], Bar-Ilan University [Israël], European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014), and European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014)
Subjects: Computer science, Speech recognition, Audio source separation, 02 engineering and technology, Non-negative matrix factorization, Matrix decomposition, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Expectation–maximization algorithm, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Hidden Markov model, business.industry, speaker diarisation, 020206 networking & telecommunications, Statistical model, Pattern recognition, Speaker diarisation, local Gaussian model, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], symbols, Artificial intelligence, 0305 other medical science, business, Gaussian network model
Abstract: International audience; We present a probabilistic model for joint source separation and diarisation of multichannel convolutive speech mixtures. We build upon the framework of local Gaussian model (LGM) with non-negative matrix factorization (NMF). The diarisa-tion is introduced as a temporal labeling of each source in the mix as active or inactive at the short-term frame level. We devise an EM algorithm in which the source separation process is aided by the diarisation state, since the latter indicates the sources actually present in the mixture. The diarisation state is tracked with a Hidden Markov Model (HMM) with emission probabilities calculated from the estimated source signals. The proposed EM has separation performance comparable with a state-of-the-art LGM NMF method, while out-performing a state-of-the-art speaker diarisation pipeline.
Published: 2017
Full Text: View/download PDF

21. Fast and Accurate Direct MDCT to DFT Conversion With Arbitrary Window Functions

Author: Shuhua Zhang, Laurent Girin, GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and ANR-09-CORD-0006,DReaM,Le Disque Repensé pour l'Écoute Active de la Musique(2009)
Subjects: Signal processing, Acoustics and Ultrasonics, Modified discrete cosine transform, Finite impulse response, MDCT, 020206 networking & telecommunications, 02 engineering and technology, Filter (signal processing), Toeplitz matrix, DFT, Window function, Discrete Fourier transform, FIR filtering, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Discrete cosine transform, 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, Arithmetic, window function, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Algorithm, Mathematics
Abstract: International audience; In this paper, we propose a method for direct con- version of MDCT coefficients to DFT coefficients, without passing through time signal reconstruction. In contrast to previous work, this method is valid for any pair of MDCT and DFT window functions. It is based on the decomposition of the MDCT-to- DFT conversion matrices into a Toeplitz part plus a Hankel part. The latter is split, then mirrored and combined with the former to construct a global Toeplitz matrix. This leads to a fast FIR filtering implementation of the conversion process. The filter taps are DFT coefficients of window functions products, and concentrate most of their energy in a few low-frequency taps. The conversion can thus be efficiently approximated by keeping only a few most significant taps, as confirmed by numerical experiments: For example, for frame size of 2048, Hanning-windowed DFT is obtained from KBD-windowed MDCT with SNR over 60 dB when keeping only 20 taps.
Published: 2013
Full Text: View/download PDF

22. Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

Author: Radu Horaud, Laurent Girin, Sharon Gannot, Xiaofei Li, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Faculty of Engineering [Israel], Bar-Ilan University [Israël], European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014), and European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Acoustics and Ultrasonics, Microphone, Computer science, inter-frame spectral subtraction, 02 engineering and technology, Transfer function, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), direct-path relative transfer function, Electrical and Electronic Engineering, Impulse response, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Short-time Fourier transform, 020206 networking & telecommunications, Acoustic source localization, Computational Mathematics, Noise, Fourier transform, Computer Science::Sound, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], symbols, 0305 other medical science, Algorithm, Binaural recording, binaural source localization
Abstract: This paper addresses the problem of binaural localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions., 15 pages, 7 figures, 5 tables
Published: 2016
Full Text: View/download PDF

23. Voice Activity Detection Based on Statistical Likelihood Ratio With Adaptive Thresholding

Author: Laurent Girin, Xiaofei Li, Radu Horaud, Sharon Gannot, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Bar-Ilan University [Israël], and European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014)
Subjects: adaptive threshold, 02 engineering and technology, Noise (electronics), Upper and lower bounds, 030507 speech-language pathology & audiology, 03 medical and health sciences, Signal-to-noise ratio, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Statistics, 0202 electrical engineering, electronic engineering, information engineering, high non-speech hit rate, Mathematics, Voice activity detection, Noise measurement, business.industry, 020206 networking & telecommunications, Pattern recognition, likelihood ratio test, Thresholding, voice activity detection, Computer Science::Sound, Likelihood-ratio test, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Hit rate, Artificial intelligence, 0305 other medical science, business
Abstract: International audience; Statistical likelihood ratio test is a widely used voice activity detection (VAD) method, in which the likelihood ratio of the current temporal frame is compared with a threshold. A fixed threshold is always used, but this is not suitable for various types of noise. In this paper, an adaptive threshold is proposed as a function of the local statistics of the likelihood ratio. This threshold represents the upper bound of the likelihood ratio for the non-speech frames, whereas it remains generally lower than the likelihood ratio for the speech frames. As a result, a high non-speech hit rate can be achieved, while maintaining speech hit rate as large as possible.
Published: 2016
Full Text: View/download PDF

24. A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures

Author: Laurent Girin, Radu Horaud, Xavier Alameda-Pineda, Dionyssos Kounades-Bastian, Sharon Gannot, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), University of Trento [Trento], Faculty of Engineering [Israel], Bar-Ilan University [Israël], European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014), and European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Acoustics and Ultrasonics, Computer science, Audio source separation, Separation (statistics), 02 engineering and technology, Computer Science - Sound, Matrix decomposition, Moving sources, 030507 speech-language pathology & audiology, 03 medical and health sciences, Matrix (mathematics), Mixing (mathematics), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Expectation–maximization algorithm, Variational EM, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, Probabilistic framework, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Stochastic process, Estimator, 020206 networking & telecommunications, Computational Mathematics, Kalman smoother, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Time-varying mixing filters, 0305 other medical science, Algorithm
Abstract: This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a block-wise version of a state-of-the-art baseline method., Comment: 13 pages, 4 figures, 2 tables
Published: 2016
Full Text: View/download PDF

25. An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation

Author: Laurent Girin, Radu Horaud, Xavier Alameda-Pineda, Dionyssos Kounades-Bastian, Sharon Gannot, Interpretation and Modelling of Images and Videos (PERCEPTION ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Laboratoire Jean Kuntzmann (LJK ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), University of Trento [Trento], Faculty of Engineering [Israel], Bar-Ilan University [Israël], IEEE Signal Processing Society, European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014), and European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014)
Subjects: Underdetermined system, Computer science, 02 engineering and technology, Blind signal separation, Non-negative matrix factorization, PSD model, 030507 speech-language pathology & audiology, 03 medical and health sciences, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Expectation–maximization algorithm, 0202 electrical engineering, electronic engineering, information engineering, Source separation, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Audio signal, business.industry, Estimation theory, Spectral density, audio source separation, 020206 networking & telecommunications, Statistical model, Pattern recognition, local Gaussian model, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Artificial intelligence, Audio modeling, 0305 other medical science, business, Algorithm, Scale parameter
Abstract: International audience; In this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application to multichannel audio source separation (MASS). The source signal is modeled with the local Gaussian model (LGM) and we propose to model its variance with an inverse-Gamma distribution, whose scale parameter is factorized as a rank-1 model. We discuss the interest of this approach and evaluate it in a MASS task with underdetermined convolutive mixtures. For this aim, we derive a variational EM algorithm for parameter estimation and source inference. The proposed model shows a benefit in source separation performance compared to a state-of-the-art LGM NMF-based technique.
Published: 2016
Full Text: View/download PDF

26. Deep neural networks for automatic detection of screams and shouted speech in subway trains

Author: Pierre Laffitte, David Sodoyer, Charles Tatkeu, Laurent Girin, Laboratoire Électronique Ondes et Signaux pour les Transports (IFSTTAR/COSYS/LEOST), Institut Français des Sciences et Technologies des Transports, de l'Aménagement et des Réseaux (IFSTTAR)-PRES Université Lille Nord de France, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab ), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Grenoble Images Parole Signal Automatique (GIPSA-lab ), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut Polytechnique de Grenoble - Grenoble Institute of Technology-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])
Subjects: RECONNAISSANCE DE SON, Computer science, Speech recognition, 02 engineering and technology, Task (project management), 030507 speech-language pathology & audiology, 03 medical and health sciences, Deep belief network, DETECTION DE CRIS, 0202 electrical engineering, electronic engineering, information engineering, DETECTION D'INCIDENT, METRO, ENVIRONNEMENT TRANSPORT, Voice activity detection, 020206 networking & telecommunications, RESEAU DE NEURONES, DETECTION D'EVENEMENTS SONORES AUDIO, ENVIRONNEMENT SONORE, TRANSPORT FERROVIAIRE, BRUIT, Deep neural networks, DETECTION D'EVENEMENT SONORE, Train, Noise (video), 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, DEEP BELIEF NETWORKS
Abstract: IEEE ICASSP 2016 - International Conference on Acoustics, Speech and Signal Processing, Shanghai, Chine, 20-/03/2016 - 25/03/2016; International audience; Deep Neural Networks (DNNs) have recently become a popular technique for regression and classification problems. Their capacity to learn high-order correlations between input and output data proves to be very powerful for automatic speech recognition. In this paper we investigate the use of DNNs for automatic scream and shouted speech detection, within the framework of surveillance systems in public transportation. We recorded a database of sounds occurring in subway trains in real conditions of exploitation and used DNNs to classify the sounds into screams, shouts and other categories. We report encouraging results, given the difficulty of the task, especially when a high level of surrounding noise is present.; Les réseaux de neurones profonds sont devenues récemment une technique populaire pour les problèmes de régression et de classification. Leur capacité d'apprendre des corrélations d'ordre éleÎ entre des entrées et des données de sortie s'aÏre être très un puissant outil pour reconnaissance automatique de la parole. Dans cet article, nous étudions l'utilisation des réseaux de neurones profonds pour la détection automatique de cris et de parole criée dans le cadre de systèmes de surveillance dans les transports publics. Pour cela, une base de données sonores a été enregistrée dans une rame de métro en condition réelle d'exploitation. Dans ce contexte, la détection de cri est réalisée via un classement de divers types de production de la parole dont des cris. Nous obtenons des résultats encourageants étant donné la difficulté de la tâche, en particulier vis-à-vis du haut niveau de bruit sonore environnant.
Published: 2016
Full Text: View/download PDF

27. Visual voice activity detection as a help for speech source separation from convolutive mixtures

Author: Christian Jutten, Bertrand Rivet, Laurent Girin, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Signal Images Physique (GIPSA-SIGMAPHY), and Département Images et Signal (GIPSA-DIS)
Subjects: Linguistics and Language, Computer science, Speech recognition, Speech enhancement, 02 engineering and technology, 01 natural sciences, Signal, Language and Linguistics, Visual speech processing, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Interference (communication), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Source separation, 010301 acoustics, Convolutive mixtures, Signal processing, Voice activity detection, Communication, Detector, Highly non-stationary environment, 020206 networking & telecommunications, Speech processing, Computer Science Applications, Speech source separation, Modeling and Simulation, Physical Sciences, Computer Vision and Pattern Recognition, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Software
Abstract: International audience; Audio–visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio–visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18–20 dB in signal to interference ratios are obtained for a wide range of (2 × 2) and (3 × 3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio–visual separation schemes.
Published: 2007
Full Text: View/download PDF

28. A variational EM algorithm for the separation of moving sound sources

Author: Laurent Girin, Dionyssos Kounades-Bastian, Sharon Gannot, Xavier Alameda-Pineda, Radu Horaud, Interpretation and Modelling of Images and Videos (PERCEPTION), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), University of Trento [Trento], Faculty of Engineering [Israel], Bar-Ilan University [Israël], IEEE Signal Processing Society, European Project: 340113,EC:FP7:ERC,ERC-2013-ADG,VHIA(2014), European Project: 609465,EC:FP7:ICT,FP7-ICT-2013-10,EARS(2014), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)
Subjects: Mathematical optimization, Computer science, 02 engineering and technology, variational EM, Matrix decomposition, 030507 speech-language pathology & audiology, 03 medical and health sciences, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], moving sources, Expectation–maximization algorithm, 0202 electrical engineering, electronic engineering, information engineering, Source separation, [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph], Probabilistic logic, Estimator, 020206 networking & telecommunications, Kalman filter, Audio-source separation, Complex normal distribution, Kalman smoother, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Algorithm design, 0305 other medical science, time-varying mixing filters, Algorithm, [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
Abstract: International audience; This paper addresses the problem of separation of moving sound sources. We propose a probabilistic framework based on the complex Gaussian model combined with non-negative matrix factorization. The properties associated with moving sources are modeled using time-varying mixing filters described by a stochastic temporal process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the mixing filters. The sound sources are separated by means of Wiener filters, built from the estimators provided by the proposed VEM algorithm. Preliminary experiments with simulated data show that, while for static sources we obtain results comparable with the base-line method of Ozerov et al., in the case of moving source our method outperforms a piece-wise version of the baseline method.
Published: 2015
Full Text: View/download PDF

29. Real-time Control of a DNN-based Articulatory Synthesizer for Silent Speech Conversion: a pilot study

Author: Thomas Hueber, Christophe Savariaux, Laurent Girin, Blaise Yvert, Florent Bocquelet, BOCQUELET, Florent, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA-Services (GIPSA-Services), Institut de Neurosciences cognitives et intégratives d'Aquitaine (INCIA), and Université Bordeaux Segalen - Bordeaux 2-Université Sciences et Technologies - Bordeaux 1-SFR Bordeaux Neurosciences-Centre National de la Recherche Scientifique (CNRS)
Subjects: Computer science, articulatory speech synthesis, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], 030507 speech-language pathology & audiology, 03 medical and health sciences, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], deep neural networks, Real-time Control System, EMA, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, silent speech, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; This article presents a pilot study on the real-time control of an articulatory synthesizer based on deep neural network (DNN), in the context of silent speech interface. The underlying hypothesis is that a silent speaker could benefit from real-time audio feedback to regulate his/her own production. In this study, we use 3D electromagnetic-articulography (EMA) to capture speech articulation, a DNN to convert EMA to spectral trajectories in real-time, and a standard vocoder excited by white noise for audio synthesis. As shown by recent literature on silent speech, adaptation of the articulo-acoustic modeling process is needed to account for possible inconsistencies between the initial training phase and practical usage conditions. In this study, we focus on different sensor setups across sessions (for the same speaker). Model adaptation is performed by cascading another neural network to the DNN used for articulatory-to-acoustic mapping. The intelligibility of the synthetic speech signal converted in real-time is evaluated using both objective and perceptual measurements.
Published: 2015

30. Robust Articulatory Speech Synthesis using Deep Neural Networks for BCI Applications

Author: Laurent Girin, Pierre Badin, Blaise Yvert, Thomas Hueber, Florent Bocquelet, GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Clinatec - Centre de recherche biomédicale Edmond J.Safra (SCLIN), Commissariat à l'énergie atomique et aux énergies alternatives - Laboratoire d'Electronique et de Technologie de l'Information (CEA-LETI), Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction de Recherche Technologique (CEA) (DRT (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre Hospitalier Universitaire [Grenoble] (CHU)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université Grenoble Alpes (UGA), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Joseph Fourier - Grenoble 1 (UJF)-Centre Hospitalier Universitaire [Grenoble] (CHU)-Institut National de la Santé et de la Recherche Médicale (INSERM), and Badin, Pierre
Subjects: Computer science, articulatory speech synthesis, Speech recognition, [SHS.INFO]Humanities and Social Sciences/Library and information sciences, Speech synthesis, 02 engineering and technology, Intelligibility (communication), computer.software_genre, [SHS.INFO] Humanities and Social Sciences/Library and information sciences, noise robustness, brain computer interface (BCI), EMA, deep auto-encoder, 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, 050107 human factors, Brain–computer interface, dimensionality reduction, Voice activity detection, Artificial neural network, 05 social sciences, 020206 networking & telecommunications, Mixture model, deep neural networks, Deep neural networks, computer
Abstract: Brain-Computer Interfaces (BCIs) usually propose typing strategies to restore communication for paralyzed and aphasic people. A more natural way would be to use speech BCI directly controlling a speech synthesizer. Toward this goal, a prerequisite is the development a synthesizer that should i) produce intelligible speech, ii) run in real time, iii) depend on as few parameters as possible, and iv) be robust to error fluctuations on the control parameters. In this context, we describe here an articulatory-to-acoustic mapping approach based on deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded synchronously with produced speech sounds. On this corpus, the DNN-based model provided a speech synthesis quality (as assessed by automatic speech recognition and behavioral testing) comparable to a state-of-the-art Gaussian mixture model (GMM), yet showing higher robustness when noise was added to the EMA coordinates. Moreover, to envision BCI applications, this robustness was also assessed when the space covered by the 12 original articulatory parameters was reduced to 7 parameters using deep auto-encoders (DAE). Given that this method can be implemented in real time, DNN-based articulatory speech synthesis seems a good candidate for speech BCI applications. Index Terms: articulatory speech synthesis, brain computer interface (BCI), deep neural networks, deep auto-encoder, EMA, noise robustness, dimensionality reduction
Published: 2014

31. Informed Source Separation from compressed mixtures using spatial wiener filter and quantization noise estimation

Author: Laurent Girin, Antoine Liutkus, Shuhua Zhang, GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), and ANR DReaM
Subjects: Work (thermodynamics), Computer science, Speech recognition, 02 engineering and technology, Data_CODINGANDINFORMATIONTHEORY, De- noising, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Bitstream, AAC, Wiener Filter, Quantization (signal processing), Wiener filter, Short-time Fourier transform, Process (computing), Wiener deconvolution, 020206 networking & telecommunications, Uncompressed video, NTF, Informed Source Separation, symbols, 0305 other medical science, Algorithm, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; In a previous work, we proposed an Informed Source Separation sys- tem based on Wiener filtering for active listening of music from un- compressed (16-bit PCM) multichannel mix signals. In the present work, the system is improved to work with (MPEG-2 AAC) com- pressed mix signals: quantization noise is estimated from the AAC bitstream at the decoder and explicitly taken into account in the source separation process. Also a direct MDCT-to-STFT transform is used to optimize the computational efficiency of the process in the STFT domain from AAC-decoded MDCT coefficients.
Published: 2013
Full Text: View/download PDF

32. Informed source separation through spectrogram coding and data embedding

Author: Laurent Girin, Jonathan Pinel, Antoine Liutkus, Gael Richard, Roland Badeau, Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), GIPSA - Communication Information and Complex Systems (GIPSA-CICS), Département Images et Signal (GIPSA-DIS), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Département Traitement du Signal et des Images (TSI), and Télécom ParisTech-Centre National de la Recherche Scientifique (CNRS)
Subjects: Theoretical computer science, Audio source separation, Data embedding, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Electrical and Electronic Engineering, Wiener filtering, Gaussian process, Mathematics, Wiener filter, 020206 networking & telecommunications, NTF, Control and Systems Engineering, Signal Processing, symbols, Spectrogram, Embedding, Computer Vision and Pattern Recognition, 0305 other medical science, Algorithm, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Software, Decoding methods, Linear filter, Image compression
Abstract: International audience; We address the issue of underdetermined source separation in a particular informed configuration where both the sources and the mixtures are known during a so-called encoding stage. This knowledge enables the computation of a side-information which is small enough to be inaudibly embedded into the mixtures. At the decoding stage, the sources are no longer assumed to be known, only the mixtures and the extracted side-information are processed for source separation. The proposed system models the sources as independent and locally stationary Gaussian processes (GP) and the mixing process as a linear filtering. This model allows reliable estimation of the sources through generalized Wiener filtering, provided their spectrograms are known. As these spectrograms are too large to be embedded in the mixtures, we show how they can be efficiently approximated using either Nonnegative Tensor Factorization (NTF) or image compression. A high-capacity embedding method is used by the system to inaudibly embed the separation side-information into the mixtures. This method is an application of the Quantization Index Modulation technique applied to the time-frequency coefficients of the mixtures and permits to reach embedding rates of about 250 kbps. Finally, a study of the performance of the full system is presented.
Published: 2012
Full Text: View/download PDF

33. Informed source separation of linear instantaneous under-determined audio mixtures by source index embedding

Author: Laurent Girin, Mathieu Parvaix, GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), ANR, ANR DReaM, and ANR-09-CORD-0006,DReaM,Le Disque Repensé pour l'Écoute Active de la Musique(2009)
Subjects: Source code, Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, Speech recognition, 02 engineering and technology, computer.software_genre, Signal, Blind signal separation, remixing, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Electrical and Electronic Engineering, Audio signal processing, Digital watermarking, media_common, watermarking, 020206 networking & telecommunications, Watermark, audio processing, Time–frequency analysis, 020201 artificial intelligence & image processing, under-determined source separation, computer, Algorithm, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; In this paper, we address the issue of underdeter- mined source separation of I non-stationary audio sources from a J-channel linear instantaneous mixture (J < I). This problem is addressed with a specific coder-decoder configuration. At the coder, source signals are assumed to be available before the mixing is processed. A time-frequency (TF) joint analysis of each source signal and mixture signal enables to select the subset of sources (among I) leading to the best separation results in each TF region. A corresponding source(s) index code is imperceptibly embedded into the mix signal using a watermarking technique. At the decoder, where the original source signals are unknown, the extraction of the watermark enables to invert the mixture in each TF region to recover the source signals. With such informed approach, it is shown that 5 instruments and singing voice signals can be efficiently separated from 2-channel stereo mixtures, with a quality that significantly overcomes the quality obtained by a semi-blind reference method and enables separate manipulation of the source signals during stereo music restitution (i.e. remixing).
Published: 2011
Full Text: View/download PDF

34. Interactive Music with Active Audio CDs

Author: Laurent Girin, Boris Mansencal, Sylvain Marchand, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and Marchand, Sylvain
Subjects: Audio signal, Multimedia, Computer science, Magic (programming), Compact disc, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Backward compatibility, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Acoustic space, 030507 speech-language pathology & audiology, 03 medical and health sciences, High fidelity, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], 0202 electrical engineering, electronic engineering, information engineering, Source separation, 0305 other medical science, Digital watermarking, computer, ComputingMilieux_MISCELLANEOUS
Abstract: With a standard compact disc (CD) audio player, the only possibility for the user is to listen to the recorded track, passively: the interaction is limited to changing the global volume or the track. Imagine now that the listener can turn into a musician, playing with the sound sources present in the stereo mix, changing their respective volumes and locations in space. For example, a given instrument or voice can be either muted, amplified, or more generally moved in the acoustic space. This will be a kind of generalized karaoke, useful for disc jockeys and also for music pedagogy (when practicing an instrument). Our system shows that this dream has come true, with active CDs fully backward compatible while enabling interactive music. The magic is that "the music is in the sound": the structure of the mix is embedded in the sound signal itself, using audio watermarking techniques, and the embedded information is exploited by the player to perform the separation of the sources (patent pending) used in turn by a spatializer.
Published: 2011

35. Informed source separation of underdetermined instantaneous stereo mixtures using source index embedding

Author: Mathieu Parvaix, Laurent Girin, GIPSA - Machines parlantes, Gestes oro-faciaux, Interaction Face-à-face, Communication augmentée (GIPSA-MAGIC), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and Parvaix, Mathieu
Subjects: Underdetermined system, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Speech recognition, Data_CODINGANDINFORMATIONTHEORY, 02 engineering and technology, computer.software_genre, Blind signal separation, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Audio signal processing, Digital watermarking, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, speech processing, remastering, Quantization (signal processing), watermarking, 020206 networking & telecommunications, Watermark, Speech processing, audio processing, Time–frequency analysis, Embedding, under-determined source separation, 0305 other medical science, computer, Algorithm, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; In this paper, we address the issue of under-determined source separation of non-stationary audio sources from a stereo (i.e. 2-channel) linear instantaneous mixture. This problem is addressed with a specific coder-decoder configuration. At the coder, source signals are assumed to be available before the mixing is processed. A time-frequency (TF) analysis of each source enables to select the one or two predominant sources (among I>2) in each TF region, and a corresponding source(s) index code is imperceptibly embedded into the mix signals using a watermarking technique. At the decoder level, where the original sources signals are unknown, the extraction of the watermark enables to locally reduce the under-determined configuration to an (over)determined configuration. Sources signals can then be estimated using a classical (over)determined separation technique. Thereby several instruments or voice signals can be separated from stereo mixtures, enabling separate manipulation of the source signals during restitution (i.e. remastering).
Published: 2010

36. A watermarking-based method for single-channel audio source separation

Author: Mathieu Parvaix, Laurent Girin, Jean-Marc Brossier, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Communication, Signal et Sécurité (GIPSA-C2S), Département Images et Signal (GIPSA-DIS), and Girin, Laurent
Subjects: Computer science, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Speech recognition, 02 engineering and technology, computer.software_genre, Signal, Blind signal separation, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Computer vision, Audio signal processing, Digital watermarking, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, speech processing, business.industry, watermarking, 020206 networking & telecommunications, Watermark, Speech processing, audio processing, source separation, Artificial intelligence, 0305 other medical science, business, computer, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Communication channel
Abstract: International audience; In this paper, we address the issue of audio source separation with a single channel, i.e. the estimation of source signals from a single mixture of these signals. This problem is addressed with a specific configuration: source signals are assumed to be available before the mix is processed. We propose an original method that uses a watermarking technique to embed information about the source signals into the mix signal. Extracting this watermark enables an end-user who has no access to the original sources to separate these signals from their mixture. Thereby several instruments or voice signals can be segregated from a single piece of music to enable post-mixing processing such as volume control.
Published: 2009

37. A study of lip movements during spontaneous dialog and its application to voice activity detection

Author: Christian Jutten, Laurent Girin, Bertrand Rivet, David Sodoyer, Jean-Luc Schwartz, Christophe Savariaux, Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA-Services (GIPSA-Services), GIPSA - Parole, Multimodalité, Développement (GIPSA-PMD), GIPSA - Signal Images Physique (GIPSA-SIGMAPHY), and Département Images et Signal (GIPSA-DIS)
Subjects: Male, Speech perception, business.product_category, Signal Detection, Psychological, Sound Spectrography, Acoustics and Ultrasonics, Microphone, Computer science, Acoustics, Speech recognition, Movement, Lipreading, Video Recording, 02 engineering and technology, 01 natural sciences, Pattern Recognition, Automated, activity detection, Arts and Humanities (miscellaneous), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, audiovisual speech, PACS number s : 43.72.Ar, 43.72.Kb, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Source separation, Humans, Dialog box, 010301 acoustics, Headphones, Voice activity detection, lip movements, 020206 networking & telecommunications, Speech processing, Lip, Speaker diarisation, Silence, Noise, Pattern Recognition, Physiological, source separation, Speech Perception, Visual Perception, Voice, Cues, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Algorithms
Abstract: This paper presents a quantitative and comprehensiv e study of the lip movements of a given speaker in different speech / non speech co ntexts, with a particular focus on silences ( i.e ., when no sound is produced by the speaker). The a im is to characterize the relationship between “lip activity” and “speech activity”, and then to use visual speech information as a Voice Activity Detector (VAD). To this aim, an original audio-visual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate room s. Each speaker communicated with the other using a microphone, a camera, a scre en, and headphones. This system was used to capture separate audio stimuli for each speaker and to monitor each speaker’s lip movements in synchrony with the recor ded sound. A comprehensive analysis was carried out on the lip shapes and lip movements corresponding to either silence sections or non-silence sections ( i.e . speech + non-speech audible events). A single visual parameter, defined to char acterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a Visual VAD (V-VAD) that can be used in any kind of environment noise, including intricate and highly non-stationary noises, e.g. , multiple and/or moving noise sources or competing speech signals.
Published: 2009
Full Text: View/download PDF

38. Estimation of the Voicing Cut-Off Frequency Contour of Natural Speech Based on Harmonic and Aperiodic Energies

Author: Kris Hermus, Laurent Girin, H. Van Hamme, S. Irhimeh, Department of Electrical Engineering [Leuven] (ESAT), Catholic University of Leuven - Katholieke Universiteit Leuven (KU Leuven), GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and Girin, Laurent
Subjects: [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Speech recognition, speech coding, Speech coding, Speech synthesis, PSI_SPEECH, 02 engineering and technology, computer.software_genre, Voltage-controlled oscillator, speech synthesis, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, speech processing, 05 social sciences, Spectral density, 020206 networking & telecommunications, Speech processing, Cutoff frequency, spectral analysis, Aperiodic graph, Computer Science::Sound, Harmonic, Voice, Speech analysis, Algorithm, computer, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, 050203 business & management, Smoothing
Abstract: We present a new algorithm for the automatic estimation of the voicing cut-off frequency (VCO), i.e., the frequency that separates the periodic low-frequency part from the aperiodic high-frequency part in voiced segments of natural speech. Starting from the power spectrum of a two pitch period speech frame, we define the VCO to be located at the frequency for which the sum of the periodic and aperiodic energy in the spectral band below and above that frequency respectively, is maximised. By formulating the problem in terms of a score function we are able to apply a dynamic programming based smoothing technique. Remarkably smooth and accurate VCO contours were obtained, despite the simplicity of the proposed algorithm. In a formal evaluation the algorithm compares favourably to two existing VCO estimation techniques. ©2008 IEEE. Hermus K., Girin L., Van hamme H., Irhimeh S., ''Estimation of the voicing cut-off frequency contour of natural speech based on harmonic and aperiodic energies'', Proceedings IEEE international conference on acoustics, speech, and signal processing - ICASSP’2008, pp. 4473-4476, March 30 - April 4, 2008, Las Vegas, Nevada, USA. ispartof: pages:4473-4476 ispartof: Proceedings IEEE international conference on acoustics, speech, and signal processing - ICASSP’2008 pages:4473-4476 ispartof: IEEE international conference on acoustics, speech, and signal processing - ICASSP’2008 location:Las Vegas, Nevada, USA date:30 Mar - 4 Apr 2008 status: published
Published: 2008

39. Long-term flexible 2D cepstral modeling of speech spectral amplitudes

Author: Mohammad Firouzmand, Laurent Girin, Girin, Laurent, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)
Subjects: Masking (art), [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, speech coding, Speech recognition, Speech coding, Speech synthesis, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, speech synthesis, Cepstrum, 0202 electrical engineering, electronic engineering, information engineering, Discrete cosine transform, Envelope (mathematics), speech processing, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, speech analysis, 020206 networking & telecommunications, Speech processing, speech modeling, Amplitude, Spectral envelope, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, computer, Algorithm
Abstract: International audience; This paper presents a method for modeling the envelope of spectral amplitude parameters of speech signals in "two dimensions" (2D). It consists of two cascaded modelings: the first one along the frequency axis is the usual cepstrum technique, which consists of modeling the log-scaled spectral envelope with a Discrete Cosine Model (DCM). The second one, along the time axis, consists of modeling the trajectory of the envelope DCM coefficients by another similar DCM model. An iterative algorithm is proposed to optimally fit this 2D-model to the data according to a perceptual criterion based on frequency masking. This approach is shown to provide an efficient and flexible representation of spectral amplitude parameters in terms of coefficient rates, while providing good signal quality, opening new perspectives in very-low bit-rate sinusoidal speech coding.
Published: 2008
Full Text: View/download PDF

40. Using a Visual Voice Activity Detector to Regularize the Permutations in Blind Separation of Convolutive Speech Mixtures

Author: Bertrand Rivet, Christian Jutten, Laurent Girin, C. Serviere, Dinh-Tuan Pham, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Signal et Automatique pour le Diagnostic et la Surveillance (GIPSA-SA-IGA), Département Images et Signal (GIPSA-DIS), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Département Automatique (GIPSA-DA), Statistique et Modélisation Stochatisque (SMS), Laboratoire Jean Kuntzmann (LJK), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and GIPSA - Signal Images Physique (GIPSA-SIGMAPHY)
Subjects: Computer science, Speech recognition, Detector, 020206 networking & telecommunications, 02 engineering and technology, convolutive mixture, Speech processing, Tracking (particle physics), audio video, Signal, Blind signal separation, 030507 speech-language pathology & audiology, 03 medical and health sciences, Permutation, Mixing (mathematics), visual speech activity detector, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Computer Science::Sound, source separation, 0202 electrical engineering, electronic engineering, information engineering, Source separation, 0305 other medical science, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g. lip parameters tracking) with source separation methods to improve and/or simplify the extraction of a speech signal fromamixture of acoustic signals. In this paper, we present a new approach to this problem: visual information is used here as a voice activity detector (VAD). Results show that, in the difficult case of realistic convolutive mixtures, the classic problem of the permutation of the output frequency channels can be solved using the visual information with a simpler processing than when using only audio information.
Published: 2007
Full Text: View/download PDF

41. Long-term quantization of speech LSF parameters

Author: Laurent Girin, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Université Joseph Fourier - Grenoble 1 (UJF)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Centre National de la Recherche Scientifique (CNRS)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Université Joseph Fourier - Grenoble 1 (UJF)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Université Joseph Fourier - Grenoble 1 (UJF)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Centre National de la Recherche Scientifique (CNRS)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Université Joseph Fourier - Grenoble 1 (UJF)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3, IEEE, Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), and Girin, Laurent
Subjects: LSF quantization, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Quantization (signal processing), Speech recognition, Speech coding, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, 02 engineering and technology, Linear predictive coding, 030507 speech-language pathology & audiology, 03 medical and health sciences, Quantization (physics), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, long-term model, 0202 electrical engineering, electronic engineering, information engineering, Very/ultra low bit-rate speech coding, 0305 other medical science, LPC coder, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; This paper addresses the problem of coding the LSF parameters of LPC speech coders on a "long-term" basis, i.e. beyond the usual #20ms frame duration. The objective is to provide efficient LSF quantization for a speech coder with very large delay but very- to ultra-low bit-rate and good quality. To do this, a long-term model of the time-trajectory of the LSF vectors is applied on long segments of speech to capture the inter-frame correlation of the vectors over each whole segment. Using this model, it is shown that only a reduced set of LSF vectors need to be quantized to derive quantized LSF vectors at every original location. Experiments show that large gains in bit-rate over usual frame-by-frame quantization can be achieved (up to more than 50%) while preserving signal quality.
Published: 2007

42. Log-Rayleigh Distribution: A Simple and Efficient Statistical Representation of Log-Spectral Coefficients

Author: Laurent Girin, Christian Jutten, Bertrand Rivet, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Signal Images Physique (GIPSA-SIGMAPHY), Département Images et Signal (GIPSA-DIS), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)
Subjects: Acoustics and Ultrasonics, statistical modeling, Gaussian, Probability density function, 02 engineering and technology, 01 natural sciences, Discrete Fourier transform, 010104 statistics & probability, symbols.namesake, Discrete Fourier transform (DFT), [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Rayleigh distribution, 0101 mathematics, Electrical and Electronic Engineering, Gaussian process, Mathematics, speech processing, Mathematical analysis, Gaussian complex random variable, 020206 networking & telecommunications, Statistical model, symbols, Probability distribution, Random variable, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; In this paper, we study the distribution of the log-modulus of a Gaussian complex random variable. In the circular case, it is a Log-Rayleigh (LR) variable, whose probability distribution function (pdf) depends on only one parameter. In the noncircular case, the pdf is more complicated, although we show that it can be adequately modeled by an LR pdf, for which the optimal fitting parameter is derived. These results can be used in any application using the log-modulus of discrete Fourier transform coefficients, e.g., for speech/audio signals, and suggest that a mixture of LR pdf kernels is preferable to more classical models such as mixtures of Gaussian kernels, which are more costly and less efficient.
Published: 2007
Full Text: View/download PDF

43. Perceptual long-term variable-rate sinusoidal modeling of speech

Author: Laurent Girin, Sylvain Marchand, Mohammad Firouzmand, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), and Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)
Subjects: Perceptual models, Acoustics and Ultrasonics, Iterative method, Computer science, Speech recognition, Speech coding, Speech synthesis, Harmonic (mathematics), 02 engineering and technology, computer.software_genre, sinusoidal model, 01 natural sciences, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Discrete cosine transform, Electrical and Electronic Engineering, 010301 acoustics, speech processing, variable rate, 020206 networking & telecommunications, Sinusoidal model, Speech processing, speech modeling, Computer Science::Sound, computer, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Data compression
Abstract: International audience; In this paper, the problem of modeling the time-trajectory of the sinusoidal components of voiced speech signals is addressed. A new global approach is presented: a single so-called Long-Term (LT) model, based on discrete cosine functions, is used to model the overall trajectories of amplitude and phase parameters, for each entire voiced section of speech, differing from usual (Short-Term) models defined on a frame-by-frame basis. The complete analysis-modeling-synthesis process is presented, including an iterative algorithm for optimal fitting between LT model and measures. A major issue of this paper concerns the use of perceptual criteria in the LT model fitting process (both for amplitude and phase modeling). The adaptation of perceptual criteria usually defined in the Short-Term and/or stationary cases to the Long-Term processing is proposed. Experiments dealing with the ten first harmonics of voiced signals show that the proposed approach provides an efficient variable-rate representation of voiced speech signals. Promising results are given in terms of modeling accuracy, synthesis quality, and data compression. The interest of the presented approach for speech coding and speech watermarking is discussed.
Published: 2007
Full Text: View/download PDF

44. Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

Author: Laurent Girin, Bertrand Rivet, C. Jutten, GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face (GIPSA-MPACIF), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), GIPSA - Signal Images Physique (GIPSA-SIGMAPHY), and Département Images et Signal (GIPSA-DIS)
Subjects: Acoustics and Ultrasonics, Computer science, statistical modeling, Speech recognition, Speech coding, Audiovisual coherence, 02 engineering and technology, convolutive mixture, computer.software_genre, Blind signal separation, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Frequency separation, blind source separation, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Audio signal processing, Signal processing, 020206 networking & telecommunications, Speech processing, Speech enhancement, Computer Science::Sound, Frequency domain, 020201 artificial intelligence & image processing, speech enhancement, computer, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing
Abstract: International audience; Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures.
Published: 2007
Full Text: View/download PDF

45. Comparing Several Models for Perceptual Long-Term Modeling of Amplitudes and Phase Trajectories of Sinusoidal Speech

Author: Laurent Girin, Sylvain Marchand, Mohammad Firouzmand, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), and Marchand, Sylvain
Subjects: Computer science, Speech recognition, media_common.quotation_subject, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Phase (waves), 020206 networking & telecommunications, 02 engineering and technology, Term (time), [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], 030507 speech-language pathology & audiology, 03 medical and health sciences, Amplitude, Perception, 0202 electrical engineering, electronic engineering, information engineering, 0305 other medical science, ComputingMilieux_MISCELLANEOUS, media_common
Abstract: International audience
Published: 2005

46. Long Term Modeling of Phase Trajectories within the Speech Sinusoidal Model Framework

Author: Laurent Girin, Mohammad Firouzmand, Sylvain Marchand, Institut de la communication parlée (ICP), Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), and Marchand, Sylvain
Subjects: Audio signal, Computer science, Speech recognition, Speech coding, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], 020206 networking & telecommunications, Sinusoidal model, 02 engineering and technology, [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer Science::Sound, 0202 electrical engineering, electronic engineering, information engineering, Discrete cosine transform, 0305 other medical science, Digital watermarking, ComputingMilieux_MISCELLANEOUS
Abstract: In this paper, the problem of modeling the trajectory of the phase of speech signal is addressed within the context of the sinusoidal model of speech. A global or long-term model of the trajectory of the phase of the partials is proposed for each entire voiced section of speech, contrary to standard models, which are defined on a frame-by-frame basis. The complete analysis-modeling-synthesis process is presented. We compare two basic long-term models, namely a polynomial and a DCT-based model, with classical (frame-by-frame) interpolation schemes, given that the analysis process is the same in all cases. Promising results are given and the interest of the presented models for speech coding and speech watermarking applications is discussed. 1. Introduction Sinusoidal modeling of audio signals has been extensively studied since the eighties and successfully applied to a wide range of applications, such as coding or time- and frequency-stretching [1-5]. The signal is modeled as the sum of a small number
Published: 2004

47. Characterizing and classifying Cued Speech vowels from labial parameters

Author: Laurent Girin, Thomas Burger, Denis Beautemps, Institut de la communication parlée (ICP), Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3, and Burger, Thomas
Subjects: Cued speech, business.industry, Computer science, Speech recognition, media_common.quotation_subject, [SCCO.COMP]Cognitive science/Computer science, 020206 networking & telecommunications, 02 engineering and technology, Transcoding, [SCCO.LING]Cognitive science/Linguistics, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, [SCCO.COMP] Cognitive science/Computer science, Vowel, Perception, 0202 electrical engineering, electronic engineering, information engineering, Telephony, [SCCO.LING] Cognitive science/Linguistics, 0305 other medical science, business, computer, Spoken language, media_common
Abstract: International audience; As part of the THIMP project (Telephony for Hearing- IMpaired People), we aim at automatically analyzing Cued Speech [1] and translating it into oral spoken language. This work focuses on vowel classification and will be part of this transcoding process as a preprocessing step of the input data analysis. Its objective is to identify vowels produced by a speaker pronouncing and coding in Cued Speech a set of French sentences, knowing: - The Cued Speech Hand Placement, - The analysis of defined Labial Parameters. Here, we will show that the crossing of these two sources of information allows to automatically identify vowels. These results have to be compared to performances of hearingimpaired people in perception of Cued Speech.
Published: 2004

48. Comparing the Order of a Polynomial Phase Model for the Synthesis of Quasi-Harmonic Audio Signals

Author: Axel Röbel, J. di Martino, Laurent Girin, G. Peeters, Sylvain Marchand, Institut de la communication parlée (ICP), Université Stendhal - Grenoble 3-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Analysis, perception and recognition of speech (PAROLE), INRIA Lorraine, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS)-Université Henri Poincaré - Nancy 1 (UHP)-Université Nancy 2-Institut National Polytechnique de Lorraine (INPL)-Centre National de la Recherche Scientifique (CNRS), Sciences et Technologies de la Musique et du Son (STMS), Institut de Recherche et Coordination Acoustique/Musique (IRCAM)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS), Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3, Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Lorraine (INPL)-Université Nancy 2-Université Henri Poincaré - Nancy 1 (UHP), and Marchand, Sylvain
Subjects: Polynomial, Decimation, Audio signal, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Bilinear interpolation, 020206 networking & telecommunications, Sinusoidal model, 02 engineering and technology, Linear interpolation, computer.software_genre, sinusoidal modeling of speech, [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], 030507 speech-language pathology & audiology, 03 medical and health sciences, Polynomial and rational function modeling, 0202 electrical engineering, electronic engineering, information engineering, Electronic engineering, 0305 other medical science, Audio signal processing, polynomial model for phase interpolation, Algorithm, computer, quasi-harmonic signals, ComputingMilieux_MISCELLANEOUS, Mathematics
Abstract: Sinusoidal modeling has been successfully applied to a wide range of audio signal processing problems, such as coding or time and frequency stretching. While many methods have been proposed for the analysis part of the process, it seems that there is some general agreement concerning the synthesis part in the nonoverlapping case: It is very often achieved by using the well-known McAulay-Quatieri method, which consists of an order 3 polynomial reconstruction of the phases of the sinusoidal model partials. We compare this "classical" approach with both a simpler (order 1, that is, linear interpolation) and a more complex (order 5) polynomial model for phase interpolation of quasi-harmonic signals. A gain has been measured in the signal-to-noise ratio at the synthesis stage, although the performance is limited by the amplitude model and by the imprecision in the analysis stage.
Published: 2003

49. A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling

Author: Xiaoyu Bie, Laurent Girin, Thomas Hueber, Simon Leglaive, Xavier Alameda-Pineda, Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), CentraleSupélec [campus de Rennes], ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020), CentraleSupélec, and European Project: H2020,SPRING
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, 02 engineering and technology, Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 030507 speech-language pathology & audiology, 03 medical and health sciences, Speech signals modeling, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Complex data type, Sequence, Series (mathematics), business.industry, speech analysisresynthesis, 020206 networking & telecommunications, Pattern recognition, dynamical variational autoencoders, Autoencoder, Generative model, Recurrent neural network, speech spectrograms, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], Benchmark (computing), Spectrogram, Artificial intelligence, 0305 other medical science, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling., Comment: Accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2008.12595
Full Text: View/download PDF

50. An analysis of visual speech information applied to voice activity detection

Author: Laurent Girin, Christian Jutten, David Sodoyer, Jean-Luc Schwartz, Bertrand Rivet, Institut de la communication parlée (ICP), Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3, Laboratoire des images et des signaux (LIS), Institut National Polytechnique de Grenoble (INPG)-Université Joseph Fourier - Grenoble 1 (UJF)-Centre National de la Recherche Scientifique (CNRS), and Girin, Laurent
Subjects: Signal processing, Voice activity detection, [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Speech processing, Facial recognition system, Voice analysis, Speech enhancement, Background noise, Silence, Noise, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Gesture recognition, otorhinolaryngologic diseases, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, ComputingMilieux_MISCELLANEOUS, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: We present a new approach to the voice activity detection (VAD) problem for speech signals embedded in non-stationary noise. The method is based on automatic lipreading: the objective is to detect voice activity or non-activity by exploiting the coherence between the speech acoustic signal and the speaker's lip movements. From a comprehensive analysis of lip shape parameters during speech and non-speech events, we show that a single appropriate visual parameter, defined to characterize the lip movements, can be used for the detection of sections of voice activity or more precisely, for the detection of silence sections. Detection scores obtained on spontaneous speech confirm the efficiency of the visual voice activity detector (VVAD).

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

50 results on '"Laurent Girin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources