1. Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
- Author
-
Zhao, Huaibo, Higuchi, Yosuke, Kida, Yusuke, Ogawa, Tetsuji, and Kobayashi, Tetsunori
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing., Comment: Accepted to EUSIPCO 2023
- Published
- 2023