Streaming End-to-end Speech Recognition For Mobile Devices

Authors :: He, Yanzhang
Sainath, Tara N.
Prabhavalkar, Rohit
McGraw, Ian
Alvarez, Raziel
Zhao, Ding
Rybach, David
Kannan, Anjuli
Wu, Yonghui
Pang, Ruoming
Liang, Qiao
Bhatia, Deepti
Shangguan, Yuan
Li, Bo
Pundak, Golan
Sim, Khe Chai
Bagby, Tom
Chang, Shuo-yiin
Rao, Kanishka
Gruenstein, Alexander
Publication Year :: 2018
Abstract: End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.