1. End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer
- Author
-
Xiaojun Zhu and Heming Huang
- Subjects
General Computer Science ,Computer science ,Speech recognition ,02 engineering and technology ,transfer learning ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Discriminative model ,Amdo-Tibetan ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,low resource language speech recognition ,Artificial neural network ,End-to-end ,General Engineering ,Acoustic model ,020206 networking & telecommunications ,Mutual information ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Language model ,LAS model ,0305 other medical science ,Transfer of learning ,lcsh:TK1-9971 ,Encoder ,Smoothing - Abstract
The end-to-end speech recognition technology solves the problem that each component is independent and models cannot be jointly optimized in the traditional speech recognition model. It incorporates such components as the acoustic model, language model, and decoding unit of the hybrid model into a single neural network, that can avoid the inherent defects of multiple modules and greatly reduces the complexity of the speech recognition model. In this research, an Amdo-Tibetan speech recognition system is constructed based on Listen, Attend and Spell (LAS) model by the end-to-end speech recognition technology. It can realize the direct conversion from Amdo-Tibetan speech sequence to the corresponding character sequence and greatly reduces the difficulty of building the Amdo-Tibetan speech recognition model. To further improve the performance of the proposed system, the following improvements have been made: firstly, the Multi-Head Attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder; secondly, the label smoothing technique is adopted to solve the problem of over-fitting; thirdly, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition and the maximum mutual information (MMI) criterion is employed for discriminative training; and finally, transfer learning is utilized to overcome the problem of insufficient training data. Experimental results show that the proposed model can significantly enhance the performance of Amdo-Tibetan speech recognition.
- Published
- 2020