Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Authors :: Futami, Hayato
Inaguma, Hirofumi
Ueno, Sei
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
Publication Year :: 2022
Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slows down the inference speed. In this study, we propose an error correction method with phone-conditioned masked LM (PC-MLM). In the proposed method, less confident word tokens in a greedy decoded output from CTC are masked. PC-MLM then predicts these masked word tokens given unmasked words and phones supplementally predicted from CTC. We further extend it to Deletable PC-MLM in order to address insertion errors. Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 in domain adaptation setting shows that our proposed method outperformed rescoring and shallow fusion in terms of inference speed, and also in terms of recognition accuracy on CSJ.<br />Comment: Accepted in Interspeech2022

Subjects :: Computer Science - Computation and Language
Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing

Tools