1. Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features
- Author
-
Zhizheng Wu and Simon King
- Subjects
Mean squared error ,Estimation theory ,business.industry ,Computer science ,Speech recognition ,Acoustic model ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,Context (language use) ,Speech synthesis ,Pattern recognition ,computer.software_genre ,Bottleneck ,Computer Science::Sound ,Artificial intelligence ,business ,computer ,Parametric statistics - Abstract
Recently, Deep Neural Networks (DNNs) have shown promise as an acoustic model for statistical parametric speech synthesis. Their ability to learn complex mappings from linguistic features to acoustic features has advanced the naturalness of synthesis speech significantly. However, because DNN parameter estimation methods typically attempt to minimise the mean squared error of each individual frame in the training data, the dynamic and continuous nature of speech parameters is neglected. In this paper, we propose a training criterion that minimises speech parameter trajectory errors, and so takes dynamic constraints from a wide acoustic context into account during training. We combine this novel training criterion with our previously proposed stacked bottleneck features, which provide wide linguistic context. Both objective and subjective evaluation results confirm the effectiveness of the proposed training criterion for improving model accuracy and naturalness of synthesised speech.
- Published
- 2015