1. CoExe: An Efficient Co-execution Architecture for Real-Time Neural Network Services
- Author
-
Chubo Liu, Zihao Zeng, Tao Li, Kenli Li, Keqin Li, Mingcong Song, and Jiechen Zhao
- Subjects
010302 applied physics ,Queueing theory ,Artificial neural network ,Dataflow ,Computer science ,Distributed computing ,Workload ,02 engineering and technology ,01 natural sciences ,Execution time ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Architecture ,Latency (engineering) - Abstract
End-to-end latency is sensitive for user-interactive neural network (NN) services on clouds. For periods of high request load, co-locating multiple NN requests has the potential to reduce end-to-end latency. However, current batch-based accelerators lack request-level parallelism support, leaving the queuing time non-optimized. Meanwhile, naively partitioning resources for simultaneous requests suffers from longer execution time as well as lower resource efficiency because different applications utilize separate resources without sharing. To effectively reduce the end-to-end latency for real-time NN requests, we propose CoExe architecture, equipped with a pipeline implementation of a sparsity-driven real-time co-execution model. By leveraging the non-trivial amount of sparse operations during concurrent NNs execution, the end-to-end latency is decreased by up to 12.3× and 2.4× over Eyeriss-like and SCNN at peak workload mode. Besides, we propose row cross (RC) dataflow to reduce data movement cost, and avoid memory duplication.
- Published
- 2020