Author: "Tu, Fengbin" / Publication Year Range: Last 10 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tu, Fengbin"' showing total 140 results

Start Over Author "Tu, Fengbin" Publication Year Range Last 10 years

140 results on '"Tu, Fengbin"'

1. A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge Applications

Author: Wang, Xiaomeng, Tian, Fengshi, Chen, Xizi, Zheng, Jiakun, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Cheng, Kwang-Ting, and Tsui, Chi-Ying
Subjects: Computer Science - Hardware Architecture, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Signal Processing
Abstract: In this paper, we propose a high-precision SRAM-based CIM macro that can perform 4x4-bit MAC operations and yield 9-bit signed output. The inherent discharge branches of SRAM cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations on two bit-line capacitors. The same principle is used for both MAC and A-to-D conversion ensuring high linearity and thus supporting large number of analog MAC accumulations. The memory cell-embedded ADC eliminates the use of separate ADCs and enhances energy and area efficiency. Additionally, two signal margin enhancement techniques, namely the MAC-folding and boosted-clipping schemes, are proposed to further improve the CIM computation accuracy., Comment: Submitted to IEEE ASSCC 2023
Published: 2023

2. Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane

Author: Deng, Jinyi, Tang, Xinru, Zhang, Jiahao, Li, Yuxuan, Zhang, Linyun, Han, Boxiao, He, Hongjun, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Hu, Yang, and Yin, Shouyi
Subjects: Computer Science - Hardware Architecture, C.1.3, F.1.2
Abstract: Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.
Published: 2023
Full Text: View/download PDF

3. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
Subjects: Computer Science - Machine Learning
Abstract: To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
Published: 2023
Full Text: View/download PDF

4. SWG: an architecture for sparse weight gradient computation

Author: Wu, Weiwei, Tu, Fengbin, Li, Xiangyu, Wei, Shaojun, and Yin, Shouyi
Published: 2024
Full Text: View/download PDF

5. Alleviating Datapath Conflicts and Design Centralization in Graph Analytics Acceleration

Author: Lin, Haiyang, Yan, Mingyu, Wang, Duo, Zou, Mo, Tu, Fengbin, Ye, Xiaochun, Fan, Dongrui, and Xie, Yuan
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2x speedup (1.5x on average) as well as better scalability., Comment: To Appear in 59th Design Automation Conference (DAC 2022)
Published: 2022

6. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Liang, Ling, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
Subjects: Computer Science - Neural and Evolutionary Computing
Abstract: Although spiking neural networks (SNNs) take benefits from the bio-plausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and the weight update, we first design lookup table (LUT) based processing elements in Forward Engine and Weight Update Engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware Backward Engine which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38x area saving, 5.74-10.20x speedup, and 5.25-7.12x energy saving on several benchmark datasets.
Published: 2021

7. Towards efficient generative AI and beyond-AI computing: New trends on ISSCC 2024 machine learning accelerators

Author: Yang, Bohan, primary, Chen, Jia, additional, and Tu, Fengbin, additional
Published: 2024
Full Text: View/download PDF

8. 20.2 A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based Accelerator Exploiting Denoising-Similarity for Diffusion Models

Author: Guo, Ruiqi, primary, Wang, Lei, additional, Chen, Xiaofeng, additional, Sun, Hao, additional, Yue, Zhiheng, additional, Qin, Yubin, additional, Han, Huiming, additional, Wang, Yang, additional, Tu, Fengbin, additional, Wei, Shaojun, additional, Hu, Yang, additional, and Yin, Shouyi, additional
Published: 2024
Full Text: View/download PDF

9. 15.1 A 0.795fJ/bit Physically-Unclonable Function-Protected TCAM for a Software-Defined Networking Switch

Author: Yue, Zhiheng, primary, Xiang, Xujiang, additional, Tu, Fengbin, additional, Wang, Yang, additional, Wang, Yiming, additional, Wei, Shaojun, additional, Hu, Yang, additional, and Yin, Shouyi, additional
Published: 2024
Full Text: View/download PDF

10. AdaP-CIM: Compute-in-Memory Based Neural Network Accelerator using Adaptive Posit

Author: He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, Tsui, Chi Ying, He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, and Tsui, Chi Ying
Abstract: This study proposes two novel approaches to address memory wall issues in AI accelerator designs for large neural networks. The first approach introduces a new format called adaptive Posit (AdaP) with two exponent encoding schemes that dynamically extend the dynamic range of its representation at run time with minimal hardware overhead. The second approach proposes using compute-in-memory (CIM) with speculative input alignment (SAU) to implement the AdaP multiply-and-accumulate (MAC) computation, significantly reducing the delay, area, and power consumption for the max exponent computation. The proposed approaches outperform state-of-the-art quantization methods and achieve significant energy and area efficiency improvements.
Published: 2024

11. Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane

Author: Deng, Jinyi, primary, Tang, Xinru, additional, Zhang, Jiahao, additional, Li, Yuxuan, additional, Zhang, Linyun, additional, Han, Boxiao, additional, He, Hongjun, additional, Tu, Fengbin, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Hu, Yang, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

12. PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators

Author: Zhu, Yu, primary, Zhu, Zhenhua, additional, Dai, Guohao, additional, Tu, Fengbin, additional, Sun, Hanbo, additional, Cheng, Kwang-Ting, additional, Yang, Huazhong, additional, and Wang, Yu, additional
Published: 2023
Full Text: View/download PDF

13. AutoDCIM: An Automated Digital CIM Compiler

Author: Chen, Jia, primary, Tu, Fengbin, additional, Shao, Kunming, additional, Tian, Fengshi, additional, Huo, Xiao, additional, Tsui, Chi-Ying, additional, and Cheng, Kwang-Ting, additional
Published: 2023
Full Text: View/download PDF

14. ECSSD: Hardware/Data Layout Co-Designed In-Storage-Computing Architecture for Extreme Classification

Author: Li, Siqi, primary, Tu, Fengbin, additional, Liu, Liu, additional, Lin, Jilan, additional, Wang, Zheng, additional, Kang, Yangwook, additional, Ding, Yufei, additional, and Xie, Yuan, additional
Published: 2023
Full Text: View/download PDF

15. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition

Author: Wu, Weiwei, primary, Tu, Fengbin, additional, Niu, Mengqi, additional, Yue, Zhiheng, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Li, Xiangyu, additional, Hu, Yang, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

16. SPG: Structure-Private Graph Database via SqueezePIR

Author: Liang, Ling, primary, Lin, Jilan, additional, Qu, Zheng, additional, Ahmad, Ishtiyaque, additional, Tu, Fengbin, additional, Gupta, Trinabh, additional, Ding, Yufei, additional, and Xie, Yuan, additional
Published: 2023
Full Text: View/download PDF

17. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Author: Wei, Shaojun, primary, Lin, Xinhan, additional, Tu, Fengbin, additional, Wang, Yang, additional, Liu, Leibo, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

18. 16.1 MuITCIM: A 28nm $2.24 \mu\mathrm{J}$/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers

Author: Tu, Fengbin, primary, Wu, Zihan, additional, Wang, Yiqi, additional, Wu, Weiwei, additional, Liu, Leibo, additional, Hu, Yang, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

19. 16.4 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration

Author: Tu, Fengbin, primary, Wang, Yiqi, additional, Wu, Zihan, additional, Wu, Weiwei, additional, Liu, Leibo, additional, Hu, Yang, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

20. BIOS: A 40nm Bionic Sensor-defined 0.47pJ/SOP, 268.7TSOPs/W Configurable Spiking Neuron-in-Memory Processor for Wearable Healthcare

Author: Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, Cheng, Kwang Ting, Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, and Cheng, Kwang Ting
Abstract: This work presents the first configurable spiking neuron-in-memory (BIOS) processor that leverages the characteristics of bionic sensors to enable ultra-efficient wearable healthcare applications. The BIOS processor offers four key features: 1) A sensor-defined architecture that supports level-crossing sampling and sparse processing; 2) A spike-triggered neural circuit that saves processing energy; 3) High robustness for spike operations (SOPs) enabled by current-based, instead of charge-based, in-memory integration; 4) A configurable neuron-in-memory cell array that supports various network models and firing threshold values. Using a 5bit analog-to-spike converter (ASC), the proposed BIOS processor achieves state-of-the-art energy efficiency of 0.47pJ/SOP, 0.48uJ/Inference, and 268.7TSOPs/W with 95.31% accuracy for arrhythmia detection on MIT-BIH dataset. These results compare favorably in terms of accuracy, efficiency and overall FoM with recent works. © 2023 IEEE.
Published: 2023

21. AutoDCIM: An Automated Digital CIM Compiler

Author: Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, Cheng, Kwang Ting, Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, and Cheng, Kwang Ting
Abstract: Digital Computing-in-Memory (DCIM) is an emerging architecture that integrates digital logic into memory for efficient AI computing. However, current DCIM designs heavily rely on manual efforts. This increases DCIM design time and limits the optimization space, making it challenging to satisfy the user specifications of diverse AI applications. This paper presents AutoDCIM, the first automated DCIM compiler. Au-toDCIM takes the user specifications as inputs and generates a DCIM macro architecture with an optimized layout. AutoDCIM's template-based generation balances handcrafted cell design and agile macro development. AutoDCIM's layout exploration loop analyzes diverse DCIM array partitioning schemes to satisfy user specifications. The auto-generated DCIM macros present competitive efficiency results in comparison with state-of-the-art silicon-verified DCIM macros. © 2023 IEEE.
Published: 2023

22. PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators

Author: Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, Wang, Yu, Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, and Wang, Yu
Abstract: Processing-in-memory (PIM) architectures have shown great abilities for neural network (NN) acceleration on edge devices that demand low latency under severe area constraints. Heterogeneous PIM architectures with different PIM implementation approaches such as RRAM-based PIM and SRAM-based PIM can further improve the performance. However, the automatic generation of heterogeneous PIM architectures faces the following two unresolved problems. First, existing work has not considered the design for heterogeneous PIM-based NN accelerators with multiple memory technologies. Second, for PIM with insufficient memory on edge devices, it is challenging to find the optimal runtime weight scheduling strategy in an O(L!) optimization space for the NN with L layers.In this paper, we propose PIM-HLS, an automatic hardware generation tool for heterogeneous PIM-based NN accelerators. Aiming at the problems above, we first point out that heterogeneous PIM can improve the performance under severe area constraints. Then we optimize the architectures for each NN layer by taking the advantage of different memory technologies. We also define the optimization problem of runtime weight scheduling and mapping for the first time, and propose a dynamic-programming-based weight scheduling algorithm to reduce the optimization space to O(L2). We implement PIM-HLS to automatically generate the hardware code and the instructions. Results show that we achieve an averagely 5.9× speedup with 72.8% less area compared with state-of-the-art PIM designs. © 2023 IEEE.
Published: 2023

23. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

Author: Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
Abstract: Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only 2.24 mu J/Token for the ViLBERT-base model, achieving 2.50x-5.91x lower energy than previous Transformer accelerators and digital CIM accelerators.
Published: 2023

24. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Xie, Yuan, Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, and Xie, Yuan
Abstract: Processing-in-memory (PIM) is a promising architecture for neural network (NN) acceleration. Most previous PIMs are based on analog computing, so their accuracy and memory cell array utilization are limited by analog deviation and ADC overhead. Digital PIM is an emerging type of PIM architecture that integrates digital logic in memory cells, which can make full utilization of the cell array without accuracy loss. However, digital PIM's rigid crossbar architecture and full array activation raise new challenges in sparse NN acceleration. Conventional unstructured or structured sparsity cannot perform well on both the weight and input side of digital PIM. We take the opportunities from digital PIM's bit-serial processing and in-memory customization, to tackle the above challenges by the co-designing sparse algorithm, multiplication dataflow, and PIM architecture. At the algorithm level, we propose double-broadcast hybrid-grained pruning to exploit weight sparsity with better accuracy and efficiency balance. At the dataflow level, we propose a bit-serial Booth in-SRAM multiplication dataflow for stable acceleration from the input side. At the architecture level, we design a sparse digital PIM (SDP) accelerator with customized SRAM-PIM macros to support the proposed techniques. SDP achieves 3.59\times , 8.15\times , 3.11\times area efficiency, and 6.95\times , 29.44\times , 39.40\times energy savings, over state-of-the-art sparse NN architectures SIGMA, SRE, and Bit Prudent. © 1982-2012 IEEE.
Published: 2023

25. ECSSD: Hardware/Data Layout Co-Designed In-Storage-Computing Architecture for Extreme Classification

Author: Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, Xie, Yuan, Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, and Xie, Yuan
Abstract: With the rapid growth of classification scale in deep learning systems, the final classification layer becomes extreme classification with a memory footprint exceeding the main memory capacity of the CPU or GPU. The emerging in-storage-computing technique offers an opportunity on account of the fact that SSD has enough storage capacity for the parameters of extreme classification. However, the limited performance of naive in-storage-computing schemes is insufficient to support the heavy workload of extreme classification. To this end, we propose ECSSD, the first hardware/data layout co-designed in-storage-computing architecture for extreme classification, based on the approximate screening algorithm. We propose an alignment-free floating-point MAC circuit technique to improve the computational ability under the limited area budget of in-storage-computing schemes so that the computational ability can match SSD's high internal bandwidth. We present a heterogeneous data layout design for the 4/32-bit weight data in the approximate screening algorithm to avoid data transfer interference and further utilize the internal DRAM bandwidth of SSD. Moreover, we propose a learning-based adaptive interleaving framework to balance the access workload in each flash channel and improve channel-level bandwidth utilization. Putting them together, our ECSSD achieves 3.24--49.87x performance improvements compared with state-of-the-art baselines.
Published: 2023

26. ReDCIM: Reconfigurable Digital Computing-in-Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Cloud AI acceleration has drawn great attention in recent years, as big models are becoming a popular trend in deep learning. Cloud AI runs high-efficiency inference, high-accuracy inference and training, in demand of flexible floating-point (FP)/integer (INT) multiply-accumulation (MAC) support. Many computing-in-memory (CIM) processors have been proposed for efficient AI acceleration. They usually rely on analog CIM techniques that are only suitable for high-efficiency neural network (NN) inference with low-precision INT MAC support. Since cloud AI demands high efficiency, high accuracy, and high flexibility simultaneously, we propose an innovative architecture reconfigurable digital CIM (ReDCIM) that meets all three requirements. We design the first CIM-based cloud AI processor, ReDCIM, which constructs a unified FP/INT pipeline architecture based on exponent pre-alignment and reconfigurable in-memory accumulation. Bitwise in-memory Booth multiplication is proposed to reduce computation on CIM. The fabricated ReDCIM chip achieves a state-of-the-art energy efficiency of 29.2 TFLOPS/W at BF16 and 36.5 TOPS/W at INT8.
Published: 2023

27. 16.4 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
Abstract: Applications such as Graph Convolutional Networks (GCNs) and Deep Learning Recommendation Models (DLRMs) have computational and data-movement requirements beyond those seen in typical NN processing. Such beyond-NN applications typically consist of Sparse Gathering (SpG) and Sparse Algebra (SpA). SpG comprises gathering and reducing tensors from sparsely distributed addresses (in GCN's aggregation phase and DLRM's embedding layer). SpA refers to NN-based sparse tensor multiplication for the gathered tensors (in GCN's combination phase and DLRM's fully-connected layer). Due to the large application size, data movement is the main bottleneck for beyond-NN acceleration. Digital Computing-In-Memory (CIM) is an efficient and precise architecture for reducing data movement [1-3]. Large-scale beyond-NN acceleration motivates the demand for scaling out digital CIM processors. However, a large monolithic chip has low-yield issues due to manufacturing defects [4], which are more severe for CIM's memory-intensive logic. A Multi-Chip-Module (MCM) provides a high-yield solution for CIM scaling by integrating multiple smaller chiplets in one package [5]. Fig. 16.4.1 shows a typical MCM-CIM system with 4 CIM chiplets, but it has two challenges for beyond-NN acceleration: 1) SpG involves repeated off-chip DRAM access, inter-chiplet access and redundant reduction operations, which increases inter-chiplet bandwidth requirements and processing latency. 2) SpA suffers from (2a) inter-CIM workload imbalance and (2b) intra-CIM under-utilization, due to irregular tensor sparsity. © 2023 IEEE.
Published: 2023

28. SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization

Author: Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM's small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros' different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster's workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves 2.94\× speedup and 2.86\× energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by 1.92\× and 5.59\× , respectively. © 2004-2012 IEEE.
Published: 2023

29. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition

Author: Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, and Yin, Shouyi
Abstract: Skeleton-based human action cognition (HAR) has drawn increasing attention recently. As an emerging approach for skeleton-based HAR tasks, Spatial-Temporal Graph Convolution Network (STGCN) achieves remarkable performance by fully exploiting the skeleton topology information via graph convolution. Unfortunately, existing GCN accelerators lose efficiency when processing STGCN models due to two limitations To overcome the limitations, this paper proposes STAR, an STGCN architecture for skeleton-based human action recognition. STAR is designed based on the characteristics of different computation phases in STGCN. For limitation (1), a spatial-temporal dimension consistent (STDC) dataflow is proposed to fully exploit the data reuse opportunities in all the different dimensions of STGCN. For limitation (2), we propose a node-wise exponent sharing scheme and a temporal-structured redundancy elimination mechanism, to exploit the inherent temporal redundancy specially introduced by STGCN. To further address the under-utilization induced by redundancy elimination, we design a dynamic data scheduler to manage the feature data storage and schedule the features and weights for valid computation in real time. STAR achieves 4.48 , 5.98 , 2.54 , and 103.88 energy savings on average over the HyGCN, AWB-GCN, TPU, and Jetson TX2 GPU. IEEE
Published: 2023

30. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Author: Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, Yin, Shouyi, Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, and Yin, Shouyi
Abstract: Nowadays, artificial intelligence (AI) technologies, especially deep neural networks (DNNs), play an vital role in solving many problems in both academia and industry. In order to simultaneously meet the demand of performance, energy efficiency and flexibility in DNN processing, various reconfigurable AI chips have been proposed in the past several years. They are based on FPGA or CGRA platforms and have domain-specific reconfigurability to customize the computing units and data paths for different DNN tasks without re-produce the chips. This paper surveys typical reconfigurable AI chips from three reconfiguration hierarchies: processing element level, processing element array level, and chip level. Each reconfiguration hierarchy covers a set of important optimization techniques for DNN computation which are frequently adopted in real life. This paper lists the reconfigurable AI chip works in chronological order, discusses the hardware development process for each optimization techniques, and analyzes the necessity of reconfigurability in AI tasks processing. The trends of each reconfiguration hierarchy and insights about the cooperation of techniques from different hierarchies are also proposed.
Published: 2023

31. SPG: Structure-Private Graph Database via SqueezePIR

Author: Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
Abstract: Many relational data in our daily life are represented as graphs, making graph application an important workload. Because of the large scale of graph datasets, moving graph data to the cloud becomes a popular option. To keep the confidential and private graph secure from an untrusted cloud server, many cryptographic techniques are leveraged to hide the content of the data. However, protecting only the data content is not enough for a graph database. Because the structural information of the graph can be revealed through the database accessing track. In this work, we study the graph neural network (GNN), an important graph workload to mine information from a graph database. We find that the server is able to infer which node is processing during the edge retrieving phase and also learn its neighbor indices during GNN's aggregation phase. This leads to the leakage of the information of graph structure data. In this work, we present SPG, a structure-private graph database with SqueezePIR. Our SPG is built on top of Private Information Retrieval (PIR), which securely hides which nodes/neighbors are accessed. In addition, we propose SqueezePIR, a compression technique to overcome the computation overhead of PIR. Based on our evaluation, our SqueezePIR achieves 11.85× speedup on average with less than 2% accuracy loss when compared to the state-of-the-art FastPIR protocol.
Published: 2023

32. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
Abstract: To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths $( < 8$ bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-fields to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to tradeoff the inference accuracy and speedup. Experimental results demonstrate that the ImageNet inference accuracy via DyBit is 1.97% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to $8.1\times $ speedup compared with the original ResNet-50 model.
Published: 2024
Full Text: View/download PDF

33. HDSuper: High-Quality and High Computational Utilization Edge Super-Resolution Accelerator With Hardware-Algorithm Co-Design Techniques

Author: Zhao, Xin, Chang, Liang, Fan, Dongqi, Hu, Zhicheng, Yue, Ting, Tu, Fengbin, and Zhou, Jun
Abstract: Super-resolution (SR) techniques have been employed to construct high-definition images from low-quality images. Various neural networks have demonstrated excellent image-reconstruction quality in SR accelerators. However, deploying SR networks on edge devices is limited by resources and power consumption induced by significant algorithm parameters, computation complexity, and external memory accesses. This work explores the hardware algorithm co-design techniques to provide an end-to-end platform with a lightweight super-resolution network (LSR) and an efficient, high-quality SR accelerator HDSuper. For algorithm design, the improved depth-wise separable convolution and pixelshuffle layers are developed to reduce network size and computation complexity by considering the hardware constraints. Also, the improved channel attention (CA) blocks enhance the image reconstruction quality. For hardware accelerator design, we design a unified computing core (UCC) combined with an efficient flattening-and-allocation (F-A) mapping strategy to support various operators with high computational utilization. In addition, we design the patch computing scheme to reduce the external memory access of the hardware architecture. Based on the evaluation, the proposed algorithm achieves high-quality image reconstruction with $37.44dB$ PSNR. Finally, the FPGA demonstration and ASIC layout under UMC 55nm are achieved with low power consumption ( $2.08 W$ and $152 mW$ ) under the lowest hardware resources compared to the state-of-the-art works.
Published: 2024
Full Text: View/download PDF

34. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

Author: Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
Abstract: Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only $2.24~\mu \text{J}$ /Token for the ViLBERT-base model, achieving $2.50\times - 5.91\times $ lower energy than previous Transformer accelerators and digital CIM accelerators.
Published: 2024
Full Text: View/download PDF

35. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, primary, Wu, Jiajun, additional, Gao, Yizhao, additional, Ding, Yuhao, additional, Tao, Chaofan, additional, Li, Boyu, additional, Tu, Fengbin, additional, Cheng, Kwang-Ting, additional, So, Hayden Kwok-Hay, additional, and Wong, Ngai, additional
Published: 2023
Full Text: View/download PDF

36. ReDCIM: Reconfigurable Digital Computing- In -Memory Processor With Unified FP/INT Pipeline for Cloud AI Acceleration

Author: Tu, Fengbin, primary, Wang, Yiqi, additional, Wu, Zihan, additional, Liang, Ling, additional, Ding, Yufei, additional, Kim, Bongjin, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Xie, Yuan, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

37. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Author: Tu, Fengbin, primary, Wang, Yiqi, additional, Liang, Ling, additional, Ding, Yufei, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Yin, Shouyi, additional, and Xie, Yuan, additional
Published: 2023
Full Text: View/download PDF

38. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

Author: Tu, Fengbin, primary, Wu, Zihan, additional, Wang, Yiqi, additional, Wu, Weiwei, additional, Liu, Leibo, additional, Hu, Yang, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

39. SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization

Author: Wang, Yiqi, primary, Tu, Fengbin, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Xie, Yuan, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

40. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Liang, Ling, primary, Qu, Zheng, additional, Chen, Zhaodong, additional, Tu, Fengbin, additional, Wu, Yujie, additional, Deng, Lei, additional, Li, Guoqi, additional, Li, Peng, additional, and Xie, Yuan, additional
Published: 2022
Full Text: View/download PDF

41. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating

Author: Yang, Jianxun, primary, Tu, Fengbin, additional, Li, Yixuan, additional, Wang, Yiqi, additional, Liu, Leibo, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
Published: 2022
Full Text: View/download PDF

42. Alleviating datapath conflicts and design centralization in graph analytics acceleration

Author: Lin, Haiyang, primary, Yan, Mingyu, additional, Wang, Duo, additional, Zou, Mo, additional, Tu, Fengbin, additional, Ye, Xiaochun, additional, Fan, Dongrui, additional, and Xie, Yuan, additional
Published: 2022
Full Text: View/download PDF

43. INSPIRE

Author: Lin, Jilan, primary, Liang, Ling, additional, Qu, Zheng, additional, Ahmad, Ishtiyaque, additional, Liu, Liu, additional, Tu, Fengbin, additional, Gupta, Trinabh, additional, Ding, Yufei, additional, and Xie, Yuan, additional
Published: 2022
Full Text: View/download PDF

44. A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Many computing-in-memory (CIM) processors have been proposed for edge deep learning (DL) acceleration. They usually rely on analog CIM techniques to achieve high-efficiency NN inference with low-precision INT multiply-accumulation (MAC) support [1]. Different from edge DL, cloud DL has higher accuracy requirements for NN inference and training, which demands extra support for high-precision floating-point (FP) MAC. As shown in Fig. 15.5.1, applying CIM techniques to cloud DL has three main limitations: 1) FP MAC has tightly coupled exponent alignment and INT mantissa MAC. Implementing complex exponent alignment in memory will harm CIM's direct accumulation structure and reduce efficiency. 2) FP MAC's energy is dominated by INT mantissa MAC. Further acceleration on CIM-based INT MAC is critical for processor efficiency. 3) Previous cloud DL processors usually have separate FP and INT engines, but only activate one engine at once [2], which causes high area overhead and low resource utilization.
Published: 2022

45. INSPIRE: IN-Storage Private Information REtrieval via Protocol and Architecture Co-design

Author: Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
Abstract: Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We frst design the INSPIRE protocol with a multi-stage fltering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous instorage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Published: 2022

46. Accelerating Spatiotemporal Supervised Training of Large-Scale Spiking Neural Networks on GPU

Author: Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, Xie, Yuan, Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, and Xie, Yuan
Abstract: Spiking neural networks (SNNs) have great potential to achieve brain-like intelligence, however, it suffers low accuracy of conventional synaptic plasticity rules and low training efficiency on GPUs. Recently, the emerging backpropagation through time (BPTT) inspired learning algorithms bring new opportunities to boost the accuracy of SNNs, while training on GPUs still remains inefficient due to the complex spatiotemporal dynamics and huge memory consumption, which restricts the model exploration for SNNs and prevents the advance of neuromorphic computing. In this work, we build a framework to solve the inefficiency of BPTT-based SNN training on modern GPUs. To reduce the memory consumption, we optimize the dataflow by saving CONV/FC results only in the forward pass and recomputing other intermediate results in the backward pass. Then, we customize kernel functions to accelerate the neural dynamics for all training stages. Finally, we provide a Pytorch interface to make our framework easy-to-deploy in real systems. Compared to vanilla Pytorch implementation, our framework can achieve up to 2.13 x end-to-end speedup and consume only 0.41 x peak memory on the CIFAR10 dataset. Moreover, for the distributed training on the large ImageNet dataset, we can achieve up to 1.81 x end-to-end speedup and consume only 0.38 x peak memory. © 2022 EDAA.
Published: 2022

47. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, Xie, Yuan, Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
Abstract: Although spiking neural networks (SNNs) take benefits from the bioplausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs, and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning, which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and weight update, we first design look-up table (LUT)-based processing elements in the forward engine and weight update engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware backward engine, which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38× area saving, 5.74-10.20× speedup, and 5.25-7.12× energy saving on several benchmark datasets. © 1982-2012 IEEE.
Published: 2022

48. DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration

Author: Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, Xie, Yuan, Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, and Xie, Yuan
Abstract: Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-Attention. In this paper, we present DOTA, an algorithm-Architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-To-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively. © 2022 Owner/Author.
Published: 2022

49. Dynamic Sparse Attention for Scalable Transformer Acceleration

Author: Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, Xie, Yuan, Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, and Xie, Yuan
Abstract: Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution. © 1968-2012 IEEE.
Published: 2022

50. A 28nm 15.59J/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

Author: Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QKT, A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QKT-pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]-[2] with leq 8b precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement. © 2022 IEEE.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

140 results on '"Tu, Fengbin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources