基于脉动阵列的层融合注意力模型加速器结构.

Authors :: 刘晓航
 姜晶菲
 许金伟
Source :: Computer Engineering & Science / Jisuanji Gongcheng yu Kexue. May2023, Vol. 45 Issue 5, p802-809. 8p.
Publication Year :: 2023
Abstract: Attention mechanism has recently shown superior performance in deep neural networks, its computation generates complex data flow and requires high computation and memory overheads. Therefore, customized accelerators are required to optimize the inference computing. This paper proposes an accelerator architecture for attention mechanism computation. A flexible partitioning method based on hardware control is used to divide the huge matrices in the attention model into hardwarefriendly computing blocks, which realizes the systolic array in accelerator matched by the block computation match. A layer fusion computing structure based on two-step softmax function decomposition is proposed, which effectively reduces the memory access of attention mechanism computation. A fusedlayer attention model accelerator based on fine-grained computational scheduling is designed and implemented by HDL. The performance was evaluated based on the XLINIX FPGA device and HLS tool. Compared with the CPU and GPU implementation under the same settings, the delay of accelerator was improved by 4.91 times, the efficiency of accelerator was improved by 1.24 times. [ABSTRACT FROM AUTHOR]