Back to Search Start Over

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Authors :
Weizhao Zhang
Hongwu Yang
Xiaolong Bu
Lili Wang
Source :
IEEE Access, Vol 7, Pp 167884-167894 (2019)
Publication Year :
2019
Publisher :
IEEE, 2019.

Abstract

This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multi-speaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus.

Details

Language :
English
ISSN :
21693536
Volume :
7
Database :
Directory of Open Access Journals
Journal :
IEEE Access
Publication Type :
Academic Journal
Accession number :
edsdoj.90efe70147fe4bdbae708d94b981f946
Document Type :
article
Full Text :
https://doi.org/10.1109/ACCESS.2019.2954342