Training Speech Recognition Model with Speech Synthesis and Text Discriminator.

Authors :: HOU-AN LIN
CHIA-PING CHEN
Source :: Journal of Information Science & Engineering; Mar2024, Vol. 40 Issue 2, p359-373, 15p
Publication Year :: 2024
Abstract: In this paper, we build neural-network model-based automatic speech recognition (ASR) systems incrementally for performance improvement. First, we add an adversarial text discriminator module to train the speech recognition model to correct typos in recognition results. Experiments show that the character error rate (CER) and word error rate (WER) of the ASR system achieved 12.3% and 31.4%. Second, we insert a pre-trained speech synthesis (text-to-speech, TTS) module to the ASR model. When we exploit a pre-trained TTS in ASR training, the CER and WER are reduced from 12.6% and 31.7% to 10.8% and 24.4%, demonstrating that pre-trained TTS can improve ASR. Finally, we include both pre-trained TTS and text discriminator in ASR training. The performance of this ASR system is further improved, achieving the CER and WER of 9.9% and 22.7% respectively. On Formosa Speech Recognition Challenge task using Taibun Hàn-jī transcription, the proposed method also achieves better CER than a system based on hybrid DNN-HMM chain model. [ABSTRACT FROM AUTHOR]