Back to Search Start Over

Learning robust speech representation with an articulatory-regularized variational autoencoder

Authors :
Laurent Girin
Jean-Luc Schwartz
Marc-Antoine Georges
Thomas Hueber
GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP)
GIPSA Pôle Parole et Cognition (GIPSA-PPC)
Grenoble Images Parole Signal Automatique (GIPSA-lab)
Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )
Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )
Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab)
Université Grenoble Alpes (UGA)
GIPSA - Perception, Contrôle, Multimodalité et Dynamiques de la parole (GIPSA-PCMD)
Laboratoire de Psychologie et NeuroCognition (LPNC )
Université Savoie Mont Blanc (USMB [Université de Savoie] [Université de Chambéry])-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)
ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Source :
Proccedings of Interspeech 2021, Interspeech 2021-22nd Annual Conference of the International Speech Communication Association, Interspeech 2021-22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic
Publication Year :
2021

Abstract

International audience; It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features. Then we incorporate these articulatory parameters into a variational autoencoder applied on spectral features by using a regularization technique that constrains part of the latent space to represent articulatory trajectories. We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.

Details

Language :
English
Database :
OpenAIRE
Journal :
Proccedings of Interspeech 2021, Interspeech 2021-22nd Annual Conference of the International Speech Communication Association, Interspeech 2021-22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic
Accession number :
edsair.doi.dedup.....803fc115688f2ca8ffbb2f8b90c7739f