Back to Search Start Over

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler.

Authors :
Chen, Bo
Xu, Zhihang
Yu, Kai
Source :
Speech Communication. Jan2022, Vol. 136, p14-22. 9p.
Publication Year :
2022

Abstract

Non-parallel data voice conversion is a popular and challenging research area. The main task is to build acoustic mappings from the source speaker to the target speaker in different units (e.g., frame, phoneme, cluster, sentence). With the help of the recent high-quality speech synthesis techniques, it is possible to directly produce parallel speech using non-parallel data. This paper proposes ParaGen: a data augmentation based technique for non-parallel data voice conversion. The system consists of a speaker disentangler based text-to-speech model and a simple frame-to-frame spectrogram conversion model. The text-to-speech model takes the text and reference audio as input to produce the speech in the target speaker identity with the time-aligned local speaking style from the reference audio. The spectrogram conversion model directly converts the source spectrogram to the target speaker framewisely. The local speaking style is extracted by an acoustic encoder while the speaker identity is eliminated by a conditional convolutional disentangler. The local style encodings are time-aligned with the text encodings by the attention mechanism. The attention contexts are decoded by a conditional recurrent decoder. The experiment shows that the speaker identity of the source speech is converted to the target speaker while the local speaking style (e.g., prosody) is preserved after the augmentation. The method is compared to the augmentation model with typical statistical parameter speech synthesis (SPSS) with pre-aligned phoneme duration. The result shows that the converted speech has better speech naturalness than the SPSS system, while the speaker similarities of the converted speech are close. • We propose a data augmentation based technique for non-parallel voice conversion. • It produces time-aligned parallel data with the same frame-level speaking style. • We use the frame-level adversarial loss to reduce the speaker identity. • We propose two separate speaker embeddings before and after the attention mechanism. • We use stacked 2D CNNs with conditional 1D CNNs to extract local speaking style. • We can use a simple network to build voice conversion model with the augmented data. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
01676393
Volume :
136
Database :
Academic Search Index
Journal :
Speech Communication
Publication Type :
Academic Journal
Accession number :
154658955
Full Text :
https://doi.org/10.1016/j.specom.2021.10.001