1. Learning and controlling the source-filter representation of speech with a variational autoencoder
- Author
-
Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier, Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), CentraleSupélec [campus de Rennes], GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), ANR-19-CE33-0008,ML3RI,Apprentissage de bas-niveau d'ineractions robotiques multi-modales avec plusieurs personnes(2019), European Project: 871245,H2020-EU.2.1.1. - INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT),SPRING(2020), Société Française d'Acoustique (SFA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, and Leglaive, Simon
- Subjects
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI] ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Linguistics and Language ,Deep generative models ,Variational autoencoder ,Computer Science - Sound ,Language and Linguistics ,Representation learning ,Machine Learning (cs.LG) ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,[SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing ,Communication ,[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] ,[INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD] ,Computer Science Applications ,Source-filter model ,Modeling and Simulation ,[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] ,Computer Vision and Pattern Recognition ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,Software ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$., 23 pages, 7 figures, companion website: https://samsad35.github.io/site-sfvae/
- Published
- 2023
- Full Text
- View/download PDF