1. COEXE: An Efficient Co-execution Architecture for Real-Time Neural Network Services.
- Author
-
Chubo Liu, Kenli Li, Mingcong Song, Jiechen Zhao, Keqin Li, Tao Li, and Zihao Zeng
- Subjects
ARTIFICIAL intelligence ,CLOUD computing ,DATA flow computing ,ELECTRON accelerators ,DATA analysis - Abstract
End-to-end latency is sensitive for user-interactive neural network (NN) services on clouds. For periods of high request load, co-locating multiple NN requests has the potential to reduce end-to-end latency. However, current batch-based accelerators lack request-level parallelism support, leaving the queuing time non-optimized. Meanwhile, naively partitioning resources for simultaneous requests suffers from longer execution time as well as lower resource efficiency because different applications utilize separate resources without sharing. To effectively reduce the end-to-end latency for real-time NN requests, we propose COEXE architecture, equipped with a pipeline implementation of a sparsity-driven real-time co-execution model. By leveraging the non-trivial amount of sparse operations during concurrent NNs execution, the end-to-end latency is decreased by up to 12.3x and 2.4 x over Eyeriss-like and SCNN at peak workload mode. Besides, we propose row cross (RC) dataflow to reduce data movement cost, and avoid memory duplication. [ABSTRACT FROM AUTHOR]
- Published
- 2020