Back to Search Start Over

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Authors :
Niu Dimin
Yuan Xie
Peng Gu
Shuangchen Li
Zheng Hongzhong
Xinfeng Xie
Malladi Krishna T
Source :
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 40:1586-1599
Publication Year :
2021
Publisher :
Institute of Electrical and Electronics Engineers (IEEE), 2021.

Abstract

The frequent data movement between the processor and the memory has become a severe performance bottleneck for deep neural network (DNN) training workloads in data centers. To solve this off-chip memory access challenge, the 3-D stacking processing-in-memory (3D-PIM) architecture provides a viable solution. However, existing 3D-PIM designs for DNN training suffer from the limited memory bandwidth in the base logic die. To overcome this obstacle, integrating the DNN related logic near each memory bank becomes a promising yet challenging solution, since naively implementing the floating-point (FP) unit and the cache in the memory die incurs a large area overhead. To address these problems, we propose DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture. From the hardware perspective, to support the FP multiplier with low area overhead, an in-DRAM lookup table (LUT) mechanism is invented. Then, we propose to use a small scratchpad buffer together with a lightweight transformation engine to exploit the locality and enable flexible data layout without the expensive cache. From the software aspect, we split the mapping/scheduling tasks during DNN training into intralayer and interlayer phases. During the intralayer phase, to maximize data reuse in the LUT buffer and the scratchpad buffer, achieve high concurrency, and reduce data movement among banks, a 3D-PIM customized loop tiling technique is adopted. During the interlayer phase, efficient techniques are invented to ensure the input–output data layout consistency and realize the forward–backward layout transposition. Experiment results show that DLUX can reduce FP32 multiplier area overhead by 60% against the direct implementation. Compared with a Tesla V100 GPU, end-to-end evaluations show that DLUX can provide on average $6.3\times $ speedup and $42\times $ energy efficiency improvement.

Details

ISSN :
19374151 and 02780070
Volume :
40
Database :
OpenAIRE
Journal :
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Accession number :
edsair.doi...........ddcb2870c3358b74fe8b53d1501da9b1