Back to Search Start Over

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers.

Authors :
Wen, Yuanbo
Guo, Qi
Du, Zidong
Xu, Jianxing
Zhang, Zhenxing
Hu, Xing
Li, Wei
Zhang, Rui
Wang, Chao
Zhou, Xuehai
Chen, Tianshi
Source :
IEEE Transactions on Computers; Sep2022, Vol. 71 Issue 9, p2313-2326, 14p
Publication Year :
2022

Abstract

Machine Learning Computers (MLCs) with tensor functional units (e.g., NVIDIA’s Tensor Core, Google’s TPU and Habana’s Tensor Processor Core) have emerged significantly over recent years. The broad diversity of MLCs makes it hard to deploy machine learning workloads with optimized performance. Though deep learning compilers (e.g., TVM) are effective to produce optimized code for different hardware back-ends, when deploying to a new MLC, it is tedious to implement platform-specific compilation optimizations by thoroughly understanding system/architectural details. To address this problem, we propose a holistic approach to achieve one-size-fits-all compilation optimization for inference across different MLCs. The key observation is that diverse MLCs share multiple key architectural characteristics (e.g., tensor primitives and on-chip scratchpad memory) for tensor processing, which can be generalized for conducting cross-platform compilation optimizations. Concretely, we propose the Tensor Abstract Machine (TAM), which features such common architectural characteristics, as the abstraction of a broad range of MLCs. To leverage architectural characteristics of the TAM, we propose the Tensor Scheduling Language (TSL) consisting of tensor computation description and tensor scheduling primitives for implementing operations with portable optimization. By implementing tensor operations with TSL, the related optimized code for different MLCs can be automatically generated. To validate our proposal, we conduct experiments on 3 commodity MLCs including GPU with Tensor Cores, VTA (on FPGA), and Cloud TPU. Experimental results demonstrate that the code generated from the same optimization schedule achieves 1.05x to 2.05x better performance than hand-tuned libraries and deep learning compilers across different platforms. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00189340
Volume :
71
Issue :
9
Database :
Complementary Index
Journal :
IEEE Transactions on Computers
Publication Type :
Academic Journal
Accession number :
158561785
Full Text :
https://doi.org/10.1109/TC.2021.3128266