Author: "Tu, Fengbin" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tu, Fengbin"' showing total 85 results

Start Over Author "Tu, Fengbin" Search Limiters Available in Library Collection

85 results on '"Tu, Fengbin"'

1. A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge Applications

Author: Wang, Xiaomeng, Tian, Fengshi, Chen, Xizi, Zheng, Jiakun, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Cheng, Kwang-Ting, and Tsui, Chi-Ying
Subjects: Computer Science - Hardware Architecture, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Signal Processing
Abstract: In this paper, we propose a high-precision SRAM-based CIM macro that can perform 4x4-bit MAC operations and yield 9-bit signed output. The inherent discharge branches of SRAM cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations on two bit-line capacitors. The same principle is used for both MAC and A-to-D conversion ensuring high linearity and thus supporting large number of analog MAC accumulations. The memory cell-embedded ADC eliminates the use of separate ADCs and enhances energy and area efficiency. Additionally, two signal margin enhancement techniques, namely the MAC-folding and boosted-clipping schemes, are proposed to further improve the CIM computation accuracy., Comment: Submitted to IEEE ASSCC 2023
Published: 2023

2. Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane

Author: Deng, Jinyi, Tang, Xinru, Zhang, Jiahao, Li, Yuxuan, Zhang, Linyun, Han, Boxiao, He, Hongjun, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Hu, Yang, and Yin, Shouyi
Subjects: Computer Science - Hardware Architecture, C.1.3, F.1.2
Abstract: Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.
Published: 2023
Full Text: View/download PDF

3. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
Subjects: Computer Science - Machine Learning
Abstract: To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
Published: 2023
Full Text: View/download PDF

4. Alleviating Datapath Conflicts and Design Centralization in Graph Analytics Acceleration

Author: Lin, Haiyang, Yan, Mingyu, Wang, Duo, Zou, Mo, Tu, Fengbin, Ye, Xiaochun, Fan, Dongrui, and Xie, Yuan
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2x speedup (1.5x on average) as well as better scalability., Comment: To Appear in 59th Design Automation Conference (DAC 2022)
Published: 2022

5. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Liang, Ling, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
Subjects: Computer Science - Neural and Evolutionary Computing
Abstract: Although spiking neural networks (SNNs) take benefits from the bio-plausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and the weight update, we first design lookup table (LUT) based processing elements in Forward Engine and Weight Update Engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware Backward Engine which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38x area saving, 5.74-10.20x speedup, and 5.25-7.12x energy saving on several benchmark datasets.
Published: 2021

6. SWG: an architecture for sparse weight gradient computation

Author: Wu, Weiwei, Tu, Fengbin, Li, Xiangyu, Wei, Shaojun, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Li, Xiangyu, Wei, Shaojun, and Yin, Shouyi
Abstract: On-device training for deep neural networks (DNN) has become a trend due to various user preferences and scenarios. The DNN training process consists of three phases, feedforward (FF), backpropagation (BP), and weight gradient (WG) update. WG takes about one-third of the computation in the whole training process. Current training accelerators usually ignore the special computation property of WG and process it in a way similar to FF/BP. Besides, the extensive data sparsity existing in WG, which brings opportunities to save computation, is not well explored. Nevertheless, exploiting the optimization opportunities would meet three underutilization problems, which are caused by (1) the mismatch between WG data dimensions and hardware parallelism, (2) the full sparsity, i.e., the sparsity of feature map (Fmap), error map (Emap), and gradient, and (3) the workload imbalance resulting from irregular sparsity. In this paper, we propose a specific architecture for sparse weight gradient (SWG) computation. The architecture is designed based on hierarchical unrolling and sparsity-aware (HUSA) dataflow to exploit the optimization opportunities of the special computation property and full data sparsity. In HUSA dataflow, the data dimensions are unrolled hierarchically on the hardware architecture. A valid-data trace (VDT) mechanism is embedded in the dataflow to avoid the underutilization caused by the two-sided input sparsity. The gradient is unrolled in PE to alleviate the underutilization induced by output sparsity while maintaining the data reuse opportunities. Besides, we design an intra- and inter-column balancer (IIBLC) to dynamically tackle the workload imbalance problem resulting from the irregular sparsity. Experimental results show that with HUSA dataflow exploiting the full sparsity, SWG achieves a speedup of 12.23x over state-of-the-art gradient computation architecture, TrainWare. SWG helps to improve the energy efficiency of the state-of-the-art training accelera
Published: 2024

7. AdaP-CIM: Compute-in-Memory Based Neural Network Accelerator using Adaptive Posit

Author: He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, Tsui, Chi Ying, He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, and Tsui, Chi Ying
Abstract: This study proposes two novel approaches to address memory wall issues in AI accelerator designs for large neural networks. The first approach introduces a new format called adaptive Posit (AdaP) with two exponent encoding schemes that dynamically extend the dynamic range of its representation at run time with minimal hardware overhead. The second approach proposes using compute-in-memory (CIM) with speculative input alignment (SAU) to implement the AdaP multiply-and-accumulate (MAC) computation, significantly reducing the delay, area, and power consumption for the max exponent computation. The proposed approaches outperform state-of-the-art quantization methods and achieve significant energy and area efficiency improvements.
Published: 2024

8. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition

Author: Wu, Weiwei, primary, Tu, Fengbin, additional, Niu, Mengqi, additional, Yue, Zhiheng, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Li, Xiangyu, additional, Hu, Yang, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

9. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Author: Wei, Shaojun, primary, Lin, Xinhan, additional, Tu, Fengbin, additional, Wang, Yang, additional, Liu, Leibo, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

10. BIOS: A 40nm Bionic Sensor-defined 0.47pJ/SOP, 268.7TSOPs/W Configurable Spiking Neuron-in-Memory Processor for Wearable Healthcare

Author: Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, Cheng, Kwang Ting, Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, and Cheng, Kwang Ting
Abstract: This work presents the first configurable spiking neuron-in-memory (BIOS) processor that leverages the characteristics of bionic sensors to enable ultra-efficient wearable healthcare applications. The BIOS processor offers four key features: 1) A sensor-defined architecture that supports level-crossing sampling and sparse processing; 2) A spike-triggered neural circuit that saves processing energy; 3) High robustness for spike operations (SOPs) enabled by current-based, instead of charge-based, in-memory integration; 4) A configurable neuron-in-memory cell array that supports various network models and firing threshold values. Using a 5bit analog-to-spike converter (ASC), the proposed BIOS processor achieves state-of-the-art energy efficiency of 0.47pJ/SOP, 0.48uJ/Inference, and 268.7TSOPs/W with 95.31% accuracy for arrhythmia detection on MIT-BIH dataset. These results compare favorably in terms of accuracy, efficiency and overall FoM with recent works. © 2023 IEEE.
Published: 2023

11. AutoDCIM: An Automated Digital CIM Compiler

Author: Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, Cheng, Kwang Ting, Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, and Cheng, Kwang Ting
Abstract: Digital Computing-in-Memory (DCIM) is an emerging architecture that integrates digital logic into memory for efficient AI computing. However, current DCIM designs heavily rely on manual efforts. This increases DCIM design time and limits the optimization space, making it challenging to satisfy the user specifications of diverse AI applications. This paper presents AutoDCIM, the first automated DCIM compiler. Au-toDCIM takes the user specifications as inputs and generates a DCIM macro architecture with an optimized layout. AutoDCIM's template-based generation balances handcrafted cell design and agile macro development. AutoDCIM's layout exploration loop analyzes diverse DCIM array partitioning schemes to satisfy user specifications. The auto-generated DCIM macros present competitive efficiency results in comparison with state-of-the-art silicon-verified DCIM macros. © 2023 IEEE.
Published: 2023

12. PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators

Author: Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, Wang, Yu, Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, and Wang, Yu
Abstract: Processing-in-memory (PIM) architectures have shown great abilities for neural network (NN) acceleration on edge devices that demand low latency under severe area constraints. Heterogeneous PIM architectures with different PIM implementation approaches such as RRAM-based PIM and SRAM-based PIM can further improve the performance. However, the automatic generation of heterogeneous PIM architectures faces the following two unresolved problems. First, existing work has not considered the design for heterogeneous PIM-based NN accelerators with multiple memory technologies. Second, for PIM with insufficient memory on edge devices, it is challenging to find the optimal runtime weight scheduling strategy in an O(L!) optimization space for the NN with L layers.In this paper, we propose PIM-HLS, an automatic hardware generation tool for heterogeneous PIM-based NN accelerators. Aiming at the problems above, we first point out that heterogeneous PIM can improve the performance under severe area constraints. Then we optimize the architectures for each NN layer by taking the advantage of different memory technologies. We also define the optimization problem of runtime weight scheduling and mapping for the first time, and propose a dynamic-programming-based weight scheduling algorithm to reduce the optimization space to O(L2). We implement PIM-HLS to automatically generate the hardware code and the instructions. Results show that we achieve an averagely 5.9× speedup with 72.8% less area compared with state-of-the-art PIM designs. © 2023 IEEE.
Published: 2023

13. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

Author: Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
Abstract: Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only 2.24 mu J/Token for the ViLBERT-base model, achieving 2.50x-5.91x lower energy than previous Transformer accelerators and digital CIM accelerators.
Published: 2023

14. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Xie, Yuan, Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, and Xie, Yuan
Abstract: Processing-in-memory (PIM) is a promising architecture for neural network (NN) acceleration. Most previous PIMs are based on analog computing, so their accuracy and memory cell array utilization are limited by analog deviation and ADC overhead. Digital PIM is an emerging type of PIM architecture that integrates digital logic in memory cells, which can make full utilization of the cell array without accuracy loss. However, digital PIM's rigid crossbar architecture and full array activation raise new challenges in sparse NN acceleration. Conventional unstructured or structured sparsity cannot perform well on both the weight and input side of digital PIM. We take the opportunities from digital PIM's bit-serial processing and in-memory customization, to tackle the above challenges by the co-designing sparse algorithm, multiplication dataflow, and PIM architecture. At the algorithm level, we propose double-broadcast hybrid-grained pruning to exploit weight sparsity with better accuracy and efficiency balance. At the dataflow level, we propose a bit-serial Booth in-SRAM multiplication dataflow for stable acceleration from the input side. At the architecture level, we design a sparse digital PIM (SDP) accelerator with customized SRAM-PIM macros to support the proposed techniques. SDP achieves 3.59\times , 8.15\times , 3.11\times area efficiency, and 6.95\times , 29.44\times , 39.40\times energy savings, over state-of-the-art sparse NN architectures SIGMA, SRE, and Bit Prudent. © 1982-2012 IEEE.
Published: 2023

15. ECSSD: Hardware/Data Layout Co-Designed In-Storage-Computing Architecture for Extreme Classification

Author: Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, Xie, Yuan, Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, and Xie, Yuan
Abstract: With the rapid growth of classification scale in deep learning systems, the final classification layer becomes extreme classification with a memory footprint exceeding the main memory capacity of the CPU or GPU. The emerging in-storage-computing technique offers an opportunity on account of the fact that SSD has enough storage capacity for the parameters of extreme classification. However, the limited performance of naive in-storage-computing schemes is insufficient to support the heavy workload of extreme classification. To this end, we propose ECSSD, the first hardware/data layout co-designed in-storage-computing architecture for extreme classification, based on the approximate screening algorithm. We propose an alignment-free floating-point MAC circuit technique to improve the computational ability under the limited area budget of in-storage-computing schemes so that the computational ability can match SSD's high internal bandwidth. We present a heterogeneous data layout design for the 4/32-bit weight data in the approximate screening algorithm to avoid data transfer interference and further utilize the internal DRAM bandwidth of SSD. Moreover, we propose a learning-based adaptive interleaving framework to balance the access workload in each flash channel and improve channel-level bandwidth utilization. Putting them together, our ECSSD achieves 3.24--49.87x performance improvements compared with state-of-the-art baselines.
Published: 2023

16. ReDCIM: Reconfigurable Digital Computing-in-Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Cloud AI acceleration has drawn great attention in recent years, as big models are becoming a popular trend in deep learning. Cloud AI runs high-efficiency inference, high-accuracy inference and training, in demand of flexible floating-point (FP)/integer (INT) multiply-accumulation (MAC) support. Many computing-in-memory (CIM) processors have been proposed for efficient AI acceleration. They usually rely on analog CIM techniques that are only suitable for high-efficiency neural network (NN) inference with low-precision INT MAC support. Since cloud AI demands high efficiency, high accuracy, and high flexibility simultaneously, we propose an innovative architecture reconfigurable digital CIM (ReDCIM) that meets all three requirements. We design the first CIM-based cloud AI processor, ReDCIM, which constructs a unified FP/INT pipeline architecture based on exponent pre-alignment and reconfigurable in-memory accumulation. Bitwise in-memory Booth multiplication is proposed to reduce computation on CIM. The fabricated ReDCIM chip achieves a state-of-the-art energy efficiency of 29.2 TFLOPS/W at BF16 and 36.5 TOPS/W at INT8.
Published: 2023

17. 16.4 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
Abstract: Applications such as Graph Convolutional Networks (GCNs) and Deep Learning Recommendation Models (DLRMs) have computational and data-movement requirements beyond those seen in typical NN processing. Such beyond-NN applications typically consist of Sparse Gathering (SpG) and Sparse Algebra (SpA). SpG comprises gathering and reducing tensors from sparsely distributed addresses (in GCN's aggregation phase and DLRM's embedding layer). SpA refers to NN-based sparse tensor multiplication for the gathered tensors (in GCN's combination phase and DLRM's fully-connected layer). Due to the large application size, data movement is the main bottleneck for beyond-NN acceleration. Digital Computing-In-Memory (CIM) is an efficient and precise architecture for reducing data movement [1-3]. Large-scale beyond-NN acceleration motivates the demand for scaling out digital CIM processors. However, a large monolithic chip has low-yield issues due to manufacturing defects [4], which are more severe for CIM's memory-intensive logic. A Multi-Chip-Module (MCM) provides a high-yield solution for CIM scaling by integrating multiple smaller chiplets in one package [5]. Fig. 16.4.1 shows a typical MCM-CIM system with 4 CIM chiplets, but it has two challenges for beyond-NN acceleration: 1) SpG involves repeated off-chip DRAM access, inter-chiplet access and redundant reduction operations, which increases inter-chiplet bandwidth requirements and processing latency. 2) SpA suffers from (2a) inter-CIM workload imbalance and (2b) intra-CIM under-utilization, due to irregular tensor sparsity. © 2023 IEEE.
Published: 2023

18. SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization

Author: Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM's small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros' different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster's workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves 2.94\× speedup and 2.86\× energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by 1.92\× and 5.59\× , respectively. © 2004-2012 IEEE.
Published: 2023

19. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition

Author: Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, and Yin, Shouyi
Abstract: Skeleton-based human action cognition (HAR) has drawn increasing attention recently. As an emerging approach for skeleton-based HAR tasks, Spatial-Temporal Graph Convolution Network (STGCN) achieves remarkable performance by fully exploiting the skeleton topology information via graph convolution. Unfortunately, existing GCN accelerators lose efficiency when processing STGCN models due to two limitations To overcome the limitations, this paper proposes STAR, an STGCN architecture for skeleton-based human action recognition. STAR is designed based on the characteristics of different computation phases in STGCN. For limitation (1), a spatial-temporal dimension consistent (STDC) dataflow is proposed to fully exploit the data reuse opportunities in all the different dimensions of STGCN. For limitation (2), we propose a node-wise exponent sharing scheme and a temporal-structured redundancy elimination mechanism, to exploit the inherent temporal redundancy specially introduced by STGCN. To further address the under-utilization induced by redundancy elimination, we design a dynamic data scheduler to manage the feature data storage and schedule the features and weights for valid computation in real time. STAR achieves 4.48 , 5.98 , 2.54 , and 103.88 energy savings on average over the HyGCN, AWB-GCN, TPU, and Jetson TX2 GPU. IEEE
Published: 2023

20. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Author: Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, Yin, Shouyi, Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, and Yin, Shouyi
Abstract: Nowadays, artificial intelligence (AI) technologies, especially deep neural networks (DNNs), play an vital role in solving many problems in both academia and industry. In order to simultaneously meet the demand of performance, energy efficiency and flexibility in DNN processing, various reconfigurable AI chips have been proposed in the past several years. They are based on FPGA or CGRA platforms and have domain-specific reconfigurability to customize the computing units and data paths for different DNN tasks without re-produce the chips. This paper surveys typical reconfigurable AI chips from three reconfiguration hierarchies: processing element level, processing element array level, and chip level. Each reconfiguration hierarchy covers a set of important optimization techniques for DNN computation which are frequently adopted in real life. This paper lists the reconfigurable AI chip works in chronological order, discusses the hardware development process for each optimization techniques, and analyzes the necessity of reconfigurability in AI tasks processing. The trends of each reconfiguration hierarchy and insights about the cooperation of techniques from different hierarchies are also proposed.
Published: 2023

21. SPG: Structure-Private Graph Database via SqueezePIR

Author: Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
Abstract: Many relational data in our daily life are represented as graphs, making graph application an important workload. Because of the large scale of graph datasets, moving graph data to the cloud becomes a popular option. To keep the confidential and private graph secure from an untrusted cloud server, many cryptographic techniques are leveraged to hide the content of the data. However, protecting only the data content is not enough for a graph database. Because the structural information of the graph can be revealed through the database accessing track. In this work, we study the graph neural network (GNN), an important graph workload to mine information from a graph database. We find that the server is able to infer which node is processing during the edge retrieving phase and also learn its neighbor indices during GNN's aggregation phase. This leads to the leakage of the information of graph structure data. In this work, we present SPG, a structure-private graph database with SqueezePIR. Our SPG is built on top of Private Information Retrieval (PIR), which securely hides which nodes/neighbors are accessed. In addition, we propose SqueezePIR, a compression technique to overcome the computation overhead of PIR. Based on our evaluation, our SqueezePIR achieves 11.85× speedup on average with less than 2% accuracy loss when compared to the state-of-the-art FastPIR protocol.
Published: 2023

22. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
Abstract: To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths $( < 8$ bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-fields to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to tradeoff the inference accuracy and speedup. Experimental results demonstrate that the ImageNet inference accuracy via DyBit is 1.97% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to $8.1\times $ speedup compared with the original ResNet-50 model.
Published: 2024
Full Text: View/download PDF

23. HDSuper: High-Quality and High Computational Utilization Edge Super-Resolution Accelerator With Hardware-Algorithm Co-Design Techniques

Author: Zhao, Xin, Chang, Liang, Fan, Dongqi, Hu, Zhicheng, Yue, Ting, Tu, Fengbin, and Zhou, Jun
Abstract: Super-resolution (SR) techniques have been employed to construct high-definition images from low-quality images. Various neural networks have demonstrated excellent image-reconstruction quality in SR accelerators. However, deploying SR networks on edge devices is limited by resources and power consumption induced by significant algorithm parameters, computation complexity, and external memory accesses. This work explores the hardware algorithm co-design techniques to provide an end-to-end platform with a lightweight super-resolution network (LSR) and an efficient, high-quality SR accelerator HDSuper. For algorithm design, the improved depth-wise separable convolution and pixelshuffle layers are developed to reduce network size and computation complexity by considering the hardware constraints. Also, the improved channel attention (CA) blocks enhance the image reconstruction quality. For hardware accelerator design, we design a unified computing core (UCC) combined with an efficient flattening-and-allocation (F-A) mapping strategy to support various operators with high computational utilization. In addition, we design the patch computing scheme to reduce the external memory access of the hardware architecture. Based on the evaluation, the proposed algorithm achieves high-quality image reconstruction with $37.44dB$ PSNR. Finally, the FPGA demonstration and ASIC layout under UMC 55nm are achieved with low power consumption ( $2.08 W$ and $152 mW$ ) under the lowest hardware resources compared to the state-of-the-art works.
Published: 2024
Full Text: View/download PDF

24. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Author: Zhou, Jiajun, primary, Wu, Jiajun, additional, Gao, Yizhao, additional, Ding, Yuhao, additional, Tao, Chaofan, additional, Li, Boyu, additional, Tu, Fengbin, additional, Cheng, Kwang-Ting, additional, So, Hayden Kwok-Hay, additional, and Wong, Ngai, additional
Published: 2023
Full Text: View/download PDF

25. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Author: Tu, Fengbin, primary, Wang, Yiqi, additional, Liang, Ling, additional, Ding, Yufei, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Yin, Shouyi, additional, and Xie, Yuan, additional
Published: 2023
Full Text: View/download PDF

26. SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization

Author: Wang, Yiqi, primary, Tu, Fengbin, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Xie, Yuan, additional, and Yin, Shouyi, additional
Published: 2023
Full Text: View/download PDF

27. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Liang, Ling, primary, Qu, Zheng, additional, Chen, Zhaodong, additional, Tu, Fengbin, additional, Wu, Yujie, additional, Deng, Lei, additional, Li, Guoqi, additional, Li, Peng, additional, and Xie, Yuan, additional
Published: 2022
Full Text: View/download PDF

28. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating

Author: Yang, Jianxun, primary, Tu, Fengbin, additional, Li, Yixuan, additional, Wang, Yiqi, additional, Liu, Leibo, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
Published: 2022
Full Text: View/download PDF

29. A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

Author: Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Many computing-in-memory (CIM) processors have been proposed for edge deep learning (DL) acceleration. They usually rely on analog CIM techniques to achieve high-efficiency NN inference with low-precision INT multiply-accumulation (MAC) support [1]. Different from edge DL, cloud DL has higher accuracy requirements for NN inference and training, which demands extra support for high-precision floating-point (FP) MAC. As shown in Fig. 15.5.1, applying CIM techniques to cloud DL has three main limitations: 1) FP MAC has tightly coupled exponent alignment and INT mantissa MAC. Implementing complex exponent alignment in memory will harm CIM's direct accumulation structure and reduce efficiency. 2) FP MAC's energy is dominated by INT mantissa MAC. Further acceleration on CIM-based INT MAC is critical for processor efficiency. 3) Previous cloud DL processors usually have separate FP and INT engines, but only activate one engine at once [2], which causes high area overhead and low resource utilization.
Published: 2022

30. INSPIRE: IN-Storage Private Information REtrieval via Protocol and Architecture Co-design

Author: Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
Abstract: Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We frst design the INSPIRE protocol with a multi-stage fltering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous instorage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Published: 2022

31. Accelerating Spatiotemporal Supervised Training of Large-Scale Spiking Neural Networks on GPU

Author: Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, Xie, Yuan, Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, and Xie, Yuan
Abstract: Spiking neural networks (SNNs) have great potential to achieve brain-like intelligence, however, it suffers low accuracy of conventional synaptic plasticity rules and low training efficiency on GPUs. Recently, the emerging backpropagation through time (BPTT) inspired learning algorithms bring new opportunities to boost the accuracy of SNNs, while training on GPUs still remains inefficient due to the complex spatiotemporal dynamics and huge memory consumption, which restricts the model exploration for SNNs and prevents the advance of neuromorphic computing. In this work, we build a framework to solve the inefficiency of BPTT-based SNN training on modern GPUs. To reduce the memory consumption, we optimize the dataflow by saving CONV/FC results only in the forward pass and recomputing other intermediate results in the backward pass. Then, we customize kernel functions to accelerate the neural dynamics for all training stages. Finally, we provide a Pytorch interface to make our framework easy-to-deploy in real systems. Compared to vanilla Pytorch implementation, our framework can achieve up to 2.13 x end-to-end speedup and consume only 0.41 x peak memory on the CIFAR10 dataset. Moreover, for the distributed training on the large ImageNet dataset, we can achieve up to 1.81 x end-to-end speedup and consume only 0.38 x peak memory. © 2022 EDAA.
Published: 2022

32. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks

Author: Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, Xie, Yuan, Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
Abstract: Although spiking neural networks (SNNs) take benefits from the bioplausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs, and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning, which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and weight update, we first design look-up table (LUT)-based processing elements in the forward engine and weight update engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware backward engine, which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38× area saving, 5.74-10.20× speedup, and 5.25-7.12× energy saving on several benchmark datasets. © 1982-2012 IEEE.
Published: 2022

33. DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration

Author: Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, Xie, Yuan, Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, and Xie, Yuan
Abstract: Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-Attention. In this paper, we present DOTA, an algorithm-Architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-To-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively. © 2022 Owner/Author.
Published: 2022

34. Dynamic Sparse Attention for Scalable Transformer Acceleration

Author: Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, Xie, Yuan, Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, and Xie, Yuan
Abstract: Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution. © 1968-2012 IEEE.
Published: 2022

35. A 28nm 15.59J/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

Author: Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QKT, A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QKT-pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]-[2] with leq 8b precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement. © 2022 IEEE.
Published: 2022

36. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating

Author: Yang, Jianxun, Tu, Fengbin, Li, Yixuan, Wang, Yiqi, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Yang, Jianxun, Tu, Fengbin, Li, Yixuan, Wang, Yiqi, Liu, Leibo, Wei, Shaojun, and Yin, Shouyi
Abstract: Quantization is a prominent approach to compress model sizes of deep neural networks (DNNs), which clusters high-precision weights into a smaller set of quantization levels and represents high-precision weights by low-precision indexes. To achieve the same accuracy, nonuniform quantized DNNs (NUQ-DNNs) with unequal quantization intervals need lower index precision than uniform quantized DNNs (UQ-DNNs) with equal intervals, achieving smaller model sizes. Hence, deploying NUQ-DNNs on accelerators costs less on- and off-chip memory accesses than UQ-DNNs, which are more valuable for edge devices. However, accelerating NUQ-DNNs is nontrivial, since weight indexes cannot be directly used for computations. Previous NUQ-DNN accelerators adopt standard convolutions by decoding weight indexes into actual-weights multiplied with activations, causing abundant look-up overhead and redundant computations. In this work, we propose a weight-repetition-aware activation aggregating (WPAA) convolution approach to accelerate inference of variable-precision NUQ- and UQ-DNNs. By merging convolutions of multiple kernels, WPAA requires no look-up operation and removes redundant computations. Based on WPAA, we design a generic quantized DNN accelerator (GQNA). Furthermore, we propose a layer-adaptive kernel-reordering merging scheme to off-line adjust merging order of kernels for minimizing energy consumption of GQNA. Implemented under TSMC 28-nm technology, GQNA achieves 31.9 and 32.6 TOPS/W energy efficiency for 1-b UQ- and NUQ-VGG-16, respectively. © 2004-2012 IEEE.
Published: 2022

37. Dynamic Sparse Attention for Scalable Transformer Acceleration

Author: Liu, Liu, primary, Qu, Zheng, additional, Chen, Zhaodong, additional, Tu, Fengbin, additional, Ding, Yufei, additional, and Xie, Yuan, additional
Published: 2022
Full Text: View/download PDF

38. ADROIT: An Adaptive Dynamic Refresh Optimization Framework for DRAM Energy Saving in DNN Training

Author: Lin, Xinhan, Sun, Liang, Tu, Fengbin, Liu, Leibo, Li, Xiangyu, Wei, Shaojun, Yin, Shouyi, Lin, Xinhan, Sun, Liang, Tu, Fengbin, Liu, Leibo, Li, Xiangyu, Wei, Shaojun, and Yin, Shouyi
Abstract: To achieve high accuracy, DNN training usually consumes and generates myriads of data, which requires a large DRAM for efficient processing. The refresh power consumption in large DRAM has become a severe problem. Previous refresh energy saving methods have drawbacks on usability, flexibility or training supporting. We propose ADROIT, an adaptive dynamic refresh optimization framework for various DNNs and processing platforms. ADROIT dynamically adjusts the refresh rates for different types of data according to runtime loss feedback in DNN training. Data idle time, lifetime and size are taken into consideration to reduce the search space of refresh rate and remove most refresh operations. Experimental results show that ADROIT can reduce the refresh energy and total DRAM energy in DNN training by up to 98.9% and 24.7% respectively, while maintaining the accuracy. Moreover, ADROIT can automatically apply to different DNNs and hardware platforms without tedious manual configuration. © 2021 IEEE.
Published: 2021

39. Evolver: A Deep Learning Processor with On-Device Quantization-Voltage-Frequency Tuning

Author: Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
Abstract: When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage-frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization-voltage-frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment. © 1966-2012 IEEE.
Published: 2021

40. Brain-Inspired Computing: Adventure from Beyond CMOS Technologies to Beyond von Neumann Architectures

Author: Amrouch, Hussam, Chen, Jian Jia, Roy, Kaushik, Xie, Yuan, Chakraborty, Indranil, Huangfu, Wenqin, Liang, Ling, Tu, Fengbin, Wang, Cheng, Yayla, Mikail, Amrouch, Hussam, Chen, Jian Jia, Roy, Kaushik, Xie, Yuan, Chakraborty, Indranil, Huangfu, Wenqin, Liang, Ling, Tu, Fengbin, Wang, Cheng, and Yayla, Mikail
Abstract: The goal of this special session paper is to introduce and discuss different breakthrough technologies as well as novel architectures and how they together may reshape the future of Artificial Intelligent. Our aim is to provide a comprehensive overview on the latest advances in brain-inspired computing and how the latter can be realized when emerging technologies, using beyond-CMOS devices, are coupled with novel computing paradigms that go beyond von Neumann architectures. Different emerging technologies like Ferroelectric Field-Effect Transistor (FeFET), Phase Change Memory (PCM), and Resistive RAM (ReRAM) are discussed, demonstrating their promising capability in building neuromorphic computing architectures that are inspired by nature. In addition, this special session paper discusses various novel concepts such as Logic-in-Memory (LIM), Processing-in-Memory (PIM), and Spiking Neural Networks (SNNs) towards exploring the far-reaching consequences of beyond von Neumann computing on accelerating deep learning. Finally, the latest trends in brain-inspired computing are summarized into algorithm, technology, and application-driven innovations towards comparing different PIM architectures. ©2021 IEEE
Published: 2021

41. STC: Significance-aware transform-based codec framework for external memory access reduction

Author: Xiong, Feng, Tu, Fengbin, Shi, Man, Wang, Yang, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Xiong, Feng, Tu, Fengbin, Shi, Man, Wang, Yang, Liu, Leibo, Wei, Shaojun, and Yin, Shouyi
Abstract: Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC's reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%. © 2020 IEEE.
Published: 2020

42. DUET: Boosting deep neural network efficiency on dual-module architecture

Author: Liu, Liu, Qu, Zheng, Deng, Lei, Tu, Fengbin, Li, Shuangchen, Hu, Xing, Gu, Zhenyu, Ding, Yufei, Xie, Yuan, Liu, Liu, Qu, Zheng, Deng, Lei, Tu, Fengbin, Li, Shuangchen, Hu, Xing, Gu, Zhenyu, Ding, Yufei, and Xie, Yuan
Abstract: Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs). © 2020 IEEE.
Published: 2020

43. Reconfigurable Architecture for Neural Approximation in Multimedia Computing

Author: Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, Wei, Shaojun, Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, and Wei, Shaojun
Abstract: Due to inherent error resiliency, many high performance multimedia applications can be approximated by multi-layer perceptrons (MLPs), with little quality loss. An MLP accelerator can be designed to improve the power efficiency of multimedia systems. However, previous MLP accelerators' fixed computational pattern lowers the performance when the MLP topology varies for different applications. In this paper, we propose a scheduling framework to guide mapping MLPs onto limited hardware resources. The scheduling framework adjusts the computational patterns for various MLP topologies, obtaining 30% higher performance than the conventional scheduling. We implement a reconfigurable neural architecture (RNA) to support different patterns in the framework and further improve the performance and efficiency. RNA achieves a speedup of 572 ×on the approximable part, whole application speedup of 7.9 × and energy savings of 6.3 ×, with little quality loss on the benchmarks. © 1991-2012 IEEE.
Published: 2019

44. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory

Author: Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Liu, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, Wei, Shaojun, Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Liu, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, and Wei, Shaojun
Abstract: Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A '3D + 2.5D' integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 °C and the steady temperature by up to 32.27 °C over state-of-the-art accelerators with 3D memory without performance degradation. © 1990-2012 IEEE.
Published: 2019

45. A High Throughput Acceleration for Hybrid Neural Networks with Efficient Resource Management on FPGA

Author: Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, and Wei, Shaojun
Abstract: Deep learning is the amazing technology which has promoted the development of artificial intelligence and achieved many amazing successes in intelligent fields. Convolution-based layers (CLs), fully connected layers (FLs) and recurrent layers (RLs) are three types of layers in classic neural networks. Most intelligent tasks are implemented by the hybrid neural networks (hybrid-NNs), which are commonly composed of different layer-blocks (LBs) of CLs, FLs, and RLs. Because the CLs require the most computation in hybrid-NNs, many field-programmable gate array (FPGA)-based accelerators focus on CLs acceleration and have demonstrated great performance. However, the CLs accelerators lead to an underutilization of FPGA resources in the acceleration of the whole hybrid-NN. To fully exploit the logic resources and the memory bandwidth in the acceleration of CLs/FLs/RLs, we propose an FPGA resource efficient mapping mechanism for hybrid-NNs. The mechanism first improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP. Then the LB-level spatial mapping is used to exploit the complementary features between different neural networks in the hybrid-NN. We evaluate the mapping mechanism by implementing four hybrid-NNs on Xilinx Virtex7 690T FPGA. The proposed mechanism achieves a peak performance of 1805.8 giga operations per second (GOPs). With the analysis on resource utilization and throughput, the proposed method exploits more computing power in FPGA and achieves up to 4.13 × higher throughput than the state-of-the-art acceleration. © 1982-2012 IEEE.
Published: 2019

46. Towards Efficient Compact Network Training on Edge-Devices

Author: Xiong, Feng, Tu, Fengbin, Yin, Shouyi, Wei, Shaojun, Xiong, Feng, Tu, Fengbin, Yin, Shouyi, and Wei, Shaojun
Abstract: Currently, there is a trend to deploy training on edge devices, which is crucial to future AI applications in various scenarios with transfer and online learning demands. Specifically, there may be a severe degradation of accuracy when directly deploying the trained models on edge devices, because the local environment forms an edge local dataset that is often different from the generic dataset. However, training on edge devices with limited computing and memory capability is a challenge problem. In this paper, we propose a novel quantization training framework for efficient compact network training on edge devices. Firstly, training-aware symmetric quantization is introduced to quantize all of the data types in the training process. Then, channel-wise quantization method is adopted for comapact network quantization, which has significantly high tolerance to quantization errors and can make the training process more stable. For further efficient training, we build a hardware evaluation platform to evaluate different settings of the network, so as to achieve a better trade-off among accuracy, energy and latency. Finally, we evaluate two widely used compact networks on a domain adaptation dataset for image classification, and the results demonstrate that the proposed methods can allow us achieve an improvement of 8.4 ×-17.2× in energy reduction and 11.9 ×-16.3× in latency reduction compared with 32-bit implementations, while maintaining the classification accuracy. © 2019 IEEE.
Published: 2019

47. A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA

Author: Yin, Shouyi, primary, Tang, Shibin, additional, Lin, Xinhan, additional, Ouyang, Peng, additional, Tu, Fengbin, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
Published: 2019
Full Text: View/download PDF

48. Reconfigurable Architecture for Neural Approximation in Multimedia Computing

Author: Tu, Fengbin, primary, Yin, Shouyi, additional, Ouyang, Peng, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
Published: 2019
Full Text: View/download PDF

49. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory

Author: Yin, Shouyi, primary, Tang, Shibin, additional, Lin, Xinhan, additional, Ouyang, Peng, additional, Tu, Fengbin, additional, Liu, Leibo, additional, Zhao, Jishen, additional, Xu, Cong, additional, Li, Shuangcheng, additional, Xie, Yuan, additional, and Wei, Shaojun, additional
Published: 2019
Full Text: View/download PDF

50. GNA: Reconfigurable and Efficient Architecture for Generative Network Acceleration

Author: Yan, Jiale, primary, Yin, Shouyi, additional, Tu, Fengbin, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

85 results on '"Tu, Fengbin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources