Back to Search
Start Over
Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech
- Source :
- Statistical Language and Speech Processing ISBN: 9783030594299, SLSP, SLSP 2020-8th International Conference on Statistical Language and Speech Processing, SLSP 2020-8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom
- Publication Year :
- 2020
- Publisher :
- Springer International Publishing, 2020.
-
Abstract
- In this paper, we propose an approach relying on multiclass N-pair loss based deep metric learning in recurrent conditional variational autoencoder (RCVAE). We used RCVAE for implementation of multispeaker expressive text-to-speech (TTS) system. The proposed approach condition text-to-speech system on speaker embeddings, and leads to clustering the latent space representation with respect to emotion. The deep metric learning helps to reduce the intra-class variance and increase the inter-class variance in latent space. Thus, we present multiclass N-pair loss to enhance the meaningful representation of the latent space. For representing the speaker, we extracted speaker embed-dings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used RCVAE for the acoustic modeling, in which the model is conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results with those of the RCVAE model without multiclass N-pair loss as baseline model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning significantly improves the transfer of expressivity in the target speaker's voice in synthesized speech.
- Subjects :
- Computer science
Speech recognition
deep metric learning
Contrast (statistics)
020206 networking & telecommunications
Speech synthesis
02 engineering and technology
expressivity
computer.software_genre
Speaker recognition
Autoencoder
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Identity (music)
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing
Metric (mathematics)
0202 electrical engineering, electronic engineering, information engineering
variational autoencoder
020201 artificial intelligence & image processing
Expressivity (genetics)
text-to-speech
Representation (mathematics)
computer
Subjects
Details
- ISBN :
- 978-3-030-59429-9
- ISBNs :
- 9783030594299
- Database :
- OpenAIRE
- Journal :
- Statistical Language and Speech Processing ISBN: 9783030594299, SLSP, SLSP 2020-8th International Conference on Statistical Language and Speech Processing, SLSP 2020-8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom
- Accession number :
- edsair.doi.dedup.....56935c6443b934c97f625ca3326142e6
- Full Text :
- https://doi.org/10.1007/978-3-030-59430-5_13