Start Over

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Authors :: Niu Dimin
Yuan Xie
Peng Gu
Shuangchen Li
Zheng Hongzhong
Xinfeng Xie
Malladi Krishna T
Source :: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 40:1586-1599
Publication Year :: 2021
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2021.
Abstract: The frequent data movement between the processor and the memory has become a severe performance bottleneck for deep neural network (DNN) training workloads in data centers. To solve this off-chip memory access challenge, the 3-D stacking processing-in-memory (3D-PIM) architecture provides a viable solution. However, existing 3D-PIM designs for DNN training suffer from the limited memory bandwidth in the base logic die. To overcome this obstacle, integrating the DNN related logic near each memory bank becomes a promising yet challenging solution, since naively implementing the floating-point (FP) unit and the cache in the memory die incurs a large area overhead. To address these problems, we propose DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture. From the hardware perspective, to support the FP multiplier with low area overhead, an in-DRAM lookup table (LUT) mechanism is invented. Then, we propose to use a small scratchpad buffer together with a lightweight transformation engine to exploit the locality and enable flexible data layout without the expensive cache. From the software aspect, we split the mapping/scheduling tasks during DNN training into intralayer and interlayer phases. During the intralayer phase, to maximize data reuse in the LUT buffer and the scratchpad buffer, achieve high concurrency, and reduce data movement among banks, a 3D-PIM customized loop tiling technique is adopted. During the interlayer phase, efficient techniques are invented to ensure the input–output data layout consistency and realize the forward–backward layout transposition. Experiment results show that DLUX can reduce FP32 multiplier area overhead by 60% against the direct implementation. Compared with a Tesla V100 GPU, end-to-end evaluations show that DLUX can provide on average $6.3\times $ speedup and $42\times $ energy efficiency improvement.

Subjects :: Hardware_MEMORYSTRUCTURES
Speedup
Computer science
Concurrency
Locality
Memory bandwidth
02 engineering and technology
Parallel computing
Loop tiling
Computer Graphics and Computer-Aided Design
Bottleneck
020202 computer hardware & architecture
Memory bank
Lookup table
0202 electrical engineering, electronic engineering, information engineering
Cache
Electrical and Electronic Engineering
Software

Details

ISSN :: 19374151 and 02780070
Volume :: 40
Database :: OpenAIRE
Journal :: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Accession number :: edsair.doi...........ddcb2870c3358b74fe8b53d1501da9b1

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources