Back to Search Start Over

Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

Authors :
Petra Wagner
Reinhold Haeb-Umbach
Janek Ebbers
Tobias Gburrek
Thomas Glarner
Source :
10th ISCA Workshop on Speech Synthesis (SSW 10).
Publication Year :
2019
Publisher :
ISCA, 2019.

Abstract

This paper presents an approach to voice conversion, whichdoes neither require parallel data nor speaker or phone labels fortraining. It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers in the encoder and decoder blocks, which is shown toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms. Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance.

Details

Database :
OpenAIRE
Journal :
10th ISCA Workshop on Speech Synthesis (SSW 10)
Accession number :
edsair.doi...........2e391b01283bde42693daa6e0c771d13
Full Text :
https://doi.org/10.21437/ssw.2019-15