1. Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders
- Author
-
Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), CentraleSupélec, Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), ANR-19-P3IA-0003, ANR-3IA MIAI, ANR-19-CE33-0008-01, ANR-JCJC ML3RI, GA #871245, EC, ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020), and European Project: H2020,SPRING
- Subjects
Computer Science::Machine Learning ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Acoustics and Ultrasonics ,Computer Science - Artificial Intelligence ,Noise measurement ,Speech enhancement ,Time series analysis ,Computer Science - Sound ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Machine Learning (cs.LG) ,Time-domain analysis ,Computational Mathematics ,Artificial Intelligence (cs.AI) ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Computer Science::Sound ,Audio and Speech Processing (eess.AS) ,Recording ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Training ,Inference algorithms ,Electrical and Electronic Engineering ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
International audience; Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
- Published
- 2022
- Full Text
- View/download PDF