Start Over

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Authors :: Amodei, Dario
Anubhai, Rishita
Battenberg, Eric
Case, Carl
Casper, Jared
Catanzaro, Bryan
Chen, Jingdong
Chrzanowski, Mike
Coates, Adam
Diamos, Greg
Elsen, Erich
Engel, Jesse
Fan, Linxi
Fougner, Christopher
Han, Tony
Hannun, Awni
Jun, Billy
LeGresley, Patrick
Lin, Libby
Narang, Sharan
Ng, Andrew
Ozair, Sherjil
Prenger, Ryan
Raiman, Jonathan
Satheesh, Sanjeev
Seetapun, David
Sengupta, Shubho
Wang, Yi
Wang, Zhiqian
Wang, Chong
Xiao, Bo
Yogatama, Dani
Zhan, Jun
Zhu, Zhenyao
Publication Year :: 2015
Abstract: We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.