85 results on '"Tu, Fengbin"'
Search Results
2. Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane
- Author
-
Deng, Jinyi, Tang, Xinru, Zhang, Jiahao, Li, Yuxuan, Zhang, Linyun, Han, Boxiao, He, Hongjun, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Hu, Yang, and Yin, Shouyi
- Subjects
Computer Science - Hardware Architecture ,C.1.3 ,F.1.2 - Abstract
Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.
- Published
- 2023
- Full Text
- View/download PDF
3. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
- Author
-
Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
- Subjects
Computer Science - Machine Learning - Abstract
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
- Published
- 2023
- Full Text
- View/download PDF
4. Alleviating Datapath Conflicts and Design Centralization in Graph Analytics Acceleration
- Author
-
Lin, Haiyang, Yan, Mingyu, Wang, Duo, Zou, Mo, Tu, Fengbin, Ye, Xiaochun, Fan, Dongrui, and Xie, Yuan
- Subjects
Computer Science - Hardware Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2x speedup (1.5x on average) as well as better scalability., Comment: To Appear in 59th Design Automation Conference (DAC 2022)
- Published
- 2022
5. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks
- Author
-
Liang, Ling, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
- Subjects
Computer Science - Neural and Evolutionary Computing - Abstract
Although spiking neural networks (SNNs) take benefits from the bio-plausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and the weight update, we first design lookup table (LUT) based processing elements in Forward Engine and Weight Update Engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware Backward Engine which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38x area saving, 5.74-10.20x speedup, and 5.25-7.12x energy saving on several benchmark datasets.
- Published
- 2021
6. SWG: an architecture for sparse weight gradient computation
- Author
-
Wu, Weiwei, Tu, Fengbin, Li, Xiangyu, Wei, Shaojun, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Li, Xiangyu, Wei, Shaojun, and Yin, Shouyi
- Abstract
On-device training for deep neural networks (DNN) has become a trend due to various user preferences and scenarios. The DNN training process consists of three phases, feedforward (FF), backpropagation (BP), and weight gradient (WG) update. WG takes about one-third of the computation in the whole training process. Current training accelerators usually ignore the special computation property of WG and process it in a way similar to FF/BP. Besides, the extensive data sparsity existing in WG, which brings opportunities to save computation, is not well explored. Nevertheless, exploiting the optimization opportunities would meet three underutilization problems, which are caused by (1) the mismatch between WG data dimensions and hardware parallelism, (2) the full sparsity, i.e., the sparsity of feature map (Fmap), error map (Emap), and gradient, and (3) the workload imbalance resulting from irregular sparsity. In this paper, we propose a specific architecture for sparse weight gradient (SWG) computation. The architecture is designed based on hierarchical unrolling and sparsity-aware (HUSA) dataflow to exploit the optimization opportunities of the special computation property and full data sparsity. In HUSA dataflow, the data dimensions are unrolled hierarchically on the hardware architecture. A valid-data trace (VDT) mechanism is embedded in the dataflow to avoid the underutilization caused by the two-sided input sparsity. The gradient is unrolled in PE to alleviate the underutilization induced by output sparsity while maintaining the data reuse opportunities. Besides, we design an intra- and inter-column balancer (IIBLC) to dynamically tackle the workload imbalance problem resulting from the irregular sparsity. Experimental results show that with HUSA dataflow exploiting the full sparsity, SWG achieves a speedup of 12.23x over state-of-the-art gradient computation architecture, TrainWare. SWG helps to improve the energy efficiency of the state-of-the-art training accelera
- Published
- 2024
7. AdaP-CIM: Compute-in-Memory Based Neural Network Accelerator using Adaptive Posit
- Author
-
He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, Tsui, Chi Ying, He, Jingyu, Tu, Fengbin, Cheng, Kwang Ting, and Tsui, Chi Ying
- Abstract
This study proposes two novel approaches to address memory wall issues in AI accelerator designs for large neural networks. The first approach introduces a new format called adaptive Posit (AdaP) with two exponent encoding schemes that dynamically extend the dynamic range of its representation at run time with minimal hardware overhead. The second approach proposes using compute-in-memory (CIM) with speculative input alignment (SAU) to implement the AdaP multiply-and-accumulate (MAC) computation, significantly reducing the delay, area, and power consumption for the max exponent computation. The proposed approaches outperform state-of-the-art quantization methods and achieve significant energy and area efficiency improvements.
- Published
- 2024
8. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition
- Author
-
Wu, Weiwei, primary, Tu, Fengbin, additional, Niu, Mengqi, additional, Yue, Zhiheng, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Li, Xiangyu, additional, Hu, Yang, additional, and Yin, Shouyi, additional
- Published
- 2023
- Full Text
- View/download PDF
9. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips
- Author
-
Wei, Shaojun, primary, Lin, Xinhan, additional, Tu, Fengbin, additional, Wang, Yang, additional, Liu, Leibo, additional, and Yin, Shouyi, additional
- Published
- 2023
- Full Text
- View/download PDF
10. BIOS: A 40nm Bionic Sensor-defined 0.47pJ/SOP, 268.7TSOPs/W Configurable Spiking Neuron-in-Memory Processor for Wearable Healthcare
- Author
-
Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, Cheng, Kwang Ting, Tian, Fengshi, Wang, Xiaomeng, Chen, Jinbo, Zheng, Jiakun, Wu, Hui, Liu, Xuejiao, Tu, Fengbin, Yang, Jie, Sawan, Mohamad, Tsui, Chi-Ying, and Cheng, Kwang Ting
- Abstract
This work presents the first configurable spiking neuron-in-memory (BIOS) processor that leverages the characteristics of bionic sensors to enable ultra-efficient wearable healthcare applications. The BIOS processor offers four key features: 1) A sensor-defined architecture that supports level-crossing sampling and sparse processing; 2) A spike-triggered neural circuit that saves processing energy; 3) High robustness for spike operations (SOPs) enabled by current-based, instead of charge-based, in-memory integration; 4) A configurable neuron-in-memory cell array that supports various network models and firing threshold values. Using a 5bit analog-to-spike converter (ASC), the proposed BIOS processor achieves state-of-the-art energy efficiency of 0.47pJ/SOP, 0.48uJ/Inference, and 268.7TSOPs/W with 95.31% accuracy for arrhythmia detection on MIT-BIH dataset. These results compare favorably in terms of accuracy, efficiency and overall FoM with recent works. © 2023 IEEE.
- Published
- 2023
11. AutoDCIM: An Automated Digital CIM Compiler
- Author
-
Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, Cheng, Kwang Ting, Chen, Jia, Tu, Fengbin, Shao, Kunming, Tian, Fengshi, Huo, Xiao, Tsui, Chi-Ying, and Cheng, Kwang Ting
- Abstract
Digital Computing-in-Memory (DCIM) is an emerging architecture that integrates digital logic into memory for efficient AI computing. However, current DCIM designs heavily rely on manual efforts. This increases DCIM design time and limits the optimization space, making it challenging to satisfy the user specifications of diverse AI applications. This paper presents AutoDCIM, the first automated DCIM compiler. Au-toDCIM takes the user specifications as inputs and generates a DCIM macro architecture with an optimized layout. AutoDCIM's template-based generation balances handcrafted cell design and agile macro development. AutoDCIM's layout exploration loop analyzes diverse DCIM array partitioning schemes to satisfy user specifications. The auto-generated DCIM macros present competitive efficiency results in comparison with state-of-the-art silicon-verified DCIM macros. © 2023 IEEE.
- Published
- 2023
12. PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators
- Author
-
Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, Wang, Yu, Zhu, Yu, Zhu, Zhenhua, Dai, Guohao, Tu, Fengbin, Sun, Hanbo, Cheng, Kwang Ting, Yang, Huazhong, and Wang, Yu
- Abstract
Processing-in-memory (PIM) architectures have shown great abilities for neural network (NN) acceleration on edge devices that demand low latency under severe area constraints. Heterogeneous PIM architectures with different PIM implementation approaches such as RRAM-based PIM and SRAM-based PIM can further improve the performance. However, the automatic generation of heterogeneous PIM architectures faces the following two unresolved problems. First, existing work has not considered the design for heterogeneous PIM-based NN accelerators with multiple memory technologies. Second, for PIM with insufficient memory on edge devices, it is challenging to find the optimal runtime weight scheduling strategy in an O(L!) optimization space for the NN with L layers.In this paper, we propose PIM-HLS, an automatic hardware generation tool for heterogeneous PIM-based NN accelerators. Aiming at the problems above, we first point out that heterogeneous PIM can improve the performance under severe area constraints. Then we optimize the architectures for each NN layer by taking the advantage of different memory technologies. We also define the optimization problem of runtime weight scheduling and mapping for the first time, and propose a dynamic-programming-based weight scheduling algorithm to reduce the optimization space to O(L2). We implement PIM-HLS to automatically generate the hardware code and the instructions. Results show that we achieve an averagely 5.9× speedup with 72.8% less area compared with state-of-the-art PIM designs. © 2023 IEEE.
- Published
- 2023
13. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
- Author
-
Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
- Abstract
Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only 2.24 mu J/Token for the ViLBERT-base model, achieving 2.50x-5.91x lower energy than previous Transformer accelerators and digital CIM accelerators.
- Published
- 2023
14. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration
- Author
-
Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Xie, Yuan, Tu, Fengbin, Wang, Yiqi, Liang, Ling, Ding, Yufei, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, and Xie, Yuan
- Abstract
Processing-in-memory (PIM) is a promising architecture for neural network (NN) acceleration. Most previous PIMs are based on analog computing, so their accuracy and memory cell array utilization are limited by analog deviation and ADC overhead. Digital PIM is an emerging type of PIM architecture that integrates digital logic in memory cells, which can make full utilization of the cell array without accuracy loss. However, digital PIM's rigid crossbar architecture and full array activation raise new challenges in sparse NN acceleration. Conventional unstructured or structured sparsity cannot perform well on both the weight and input side of digital PIM. We take the opportunities from digital PIM's bit-serial processing and in-memory customization, to tackle the above challenges by the co-designing sparse algorithm, multiplication dataflow, and PIM architecture. At the algorithm level, we propose double-broadcast hybrid-grained pruning to exploit weight sparsity with better accuracy and efficiency balance. At the dataflow level, we propose a bit-serial Booth in-SRAM multiplication dataflow for stable acceleration from the input side. At the architecture level, we design a sparse digital PIM (SDP) accelerator with customized SRAM-PIM macros to support the proposed techniques. SDP achieves 3.59\times , 8.15\times , 3.11\times area efficiency, and 6.95\times , 29.44\times , 39.40\times energy savings, over state-of-the-art sparse NN architectures SIGMA, SRE, and Bit Prudent. © 1982-2012 IEEE.
- Published
- 2023
15. ECSSD: Hardware/Data Layout Co-Designed In-Storage-Computing Architecture for Extreme Classification
- Author
-
Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, Xie, Yuan, Li, Siqi, Tu, Fengbin, Liu, Liu, Lin, Jilan, Wang, Zheng, Kang, Yangwook, Ding, Yufei, and Xie, Yuan
- Abstract
With the rapid growth of classification scale in deep learning systems, the final classification layer becomes extreme classification with a memory footprint exceeding the main memory capacity of the CPU or GPU. The emerging in-storage-computing technique offers an opportunity on account of the fact that SSD has enough storage capacity for the parameters of extreme classification. However, the limited performance of naive in-storage-computing schemes is insufficient to support the heavy workload of extreme classification. To this end, we propose ECSSD, the first hardware/data layout co-designed in-storage-computing architecture for extreme classification, based on the approximate screening algorithm. We propose an alignment-free floating-point MAC circuit technique to improve the computational ability under the limited area budget of in-storage-computing schemes so that the computational ability can match SSD's high internal bandwidth. We present a heterogeneous data layout design for the 4/32-bit weight data in the approximate screening algorithm to avoid data transfer interference and further utilize the internal DRAM bandwidth of SSD. Moreover, we propose a learning-based adaptive interleaving framework to balance the access workload in each flash channel and improve channel-level bandwidth utilization. Putting them together, our ECSSD achieves 3.24--49.87x performance improvements compared with state-of-the-art baselines.
- Published
- 2023
16. ReDCIM: Reconfigurable Digital Computing-in-Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration
- Author
-
Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Abstract
Cloud AI acceleration has drawn great attention in recent years, as big models are becoming a popular trend in deep learning. Cloud AI runs high-efficiency inference, high-accuracy inference and training, in demand of flexible floating-point (FP)/integer (INT) multiply-accumulation (MAC) support. Many computing-in-memory (CIM) processors have been proposed for efficient AI acceleration. They usually rely on analog CIM techniques that are only suitable for high-efficiency neural network (NN) inference with low-precision INT MAC support. Since cloud AI demands high efficiency, high accuracy, and high flexibility simultaneously, we propose an innovative architecture reconfigurable digital CIM (ReDCIM) that meets all three requirements. We design the first CIM-based cloud AI processor, ReDCIM, which constructs a unified FP/INT pipeline architecture based on exponent pre-alignment and reconfigurable in-memory accumulation. Bitwise in-memory Booth multiplication is proposed to reduce computation on CIM. The fabricated ReDCIM chip achieves a state-of-the-art energy efficiency of 29.2 TFLOPS/W at BF16 and 36.5 TOPS/W at INT8.
- Published
- 2023
17. 16.4 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based Beyond-NN Acceleration
- Author
-
Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Wu, Weiwei, Liu, Leibo, Hu, Yang, Wei, Shaojun, and Yin, Shouyi
- Abstract
Applications such as Graph Convolutional Networks (GCNs) and Deep Learning Recommendation Models (DLRMs) have computational and data-movement requirements beyond those seen in typical NN processing. Such beyond-NN applications typically consist of Sparse Gathering (SpG) and Sparse Algebra (SpA). SpG comprises gathering and reducing tensors from sparsely distributed addresses (in GCN's aggregation phase and DLRM's embedding layer). SpA refers to NN-based sparse tensor multiplication for the gathered tensors (in GCN's combination phase and DLRM's fully-connected layer). Due to the large application size, data movement is the main bottleneck for beyond-NN acceleration. Digital Computing-In-Memory (CIM) is an efficient and precise architecture for reducing data movement [1-3]. Large-scale beyond-NN acceleration motivates the demand for scaling out digital CIM processors. However, a large monolithic chip has low-yield issues due to manufacturing defects [4], which are more severe for CIM's memory-intensive logic. A Multi-Chip-Module (MCM) provides a high-yield solution for CIM scaling by integrating multiple smaller chiplets in one package [5]. Fig. 16.4.1 shows a typical MCM-CIM system with 4 CIM chiplets, but it has two challenges for beyond-NN acceleration: 1) SpG involves repeated off-chip DRAM access, inter-chiplet access and redundant reduction operations, which increases inter-chiplet bandwidth requirements and processing latency. 2) SpA suffers from (2a) inter-CIM workload imbalance and (2b) intra-CIM under-utilization, due to irregular tensor sparsity. © 2023 IEEE.
- Published
- 2023
18. SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization
- Author
-
Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Wang, Yiqi, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Abstract
Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM's small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros' different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster's workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves 2.94\× speedup and 2.86\× energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by 1.92\× and 5.59\× , respectively. © 2004-2012 IEEE.
- Published
- 2023
19. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition
- Author
-
Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, and Yin, Shouyi
- Abstract
Skeleton-based human action cognition (HAR) has drawn increasing attention recently. As an emerging approach for skeleton-based HAR tasks, Spatial-Temporal Graph Convolution Network (STGCN) achieves remarkable performance by fully exploiting the skeleton topology information via graph convolution. Unfortunately, existing GCN accelerators lose efficiency when processing STGCN models due to two limitations To overcome the limitations, this paper proposes STAR, an STGCN architecture for skeleton-based human action recognition. STAR is designed based on the characteristics of different computation phases in STGCN. For limitation (1), a spatial-temporal dimension consistent (STDC) dataflow is proposed to fully exploit the data reuse opportunities in all the different dimensions of STGCN. For limitation (2), we propose a node-wise exponent sharing scheme and a temporal-structured redundancy elimination mechanism, to exploit the inherent temporal redundancy specially introduced by STGCN. To further address the under-utilization induced by redundancy elimination, we design a dynamic data scheduler to manage the feature data storage and schedule the features and weights for valid computation in real time. STAR achieves 4.48 , 5.98 , 2.54 , and 103.88 energy savings on average over the HyGCN, AWB-GCN, TPU, and Jetson TX2 GPU. IEEE
- Published
- 2023
20. Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips
- Author
-
Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, Yin, Shouyi, Wei, Shaojun, Lin, Xinhan, Tu, Fengbin, Wang, Yang, Liu, Leibo, and Yin, Shouyi
- Abstract
Nowadays, artificial intelligence (AI) technologies, especially deep neural networks (DNNs), play an vital role in solving many problems in both academia and industry. In order to simultaneously meet the demand of performance, energy efficiency and flexibility in DNN processing, various reconfigurable AI chips have been proposed in the past several years. They are based on FPGA or CGRA platforms and have domain-specific reconfigurability to customize the computing units and data paths for different DNN tasks without re-produce the chips. This paper surveys typical reconfigurable AI chips from three reconfiguration hierarchies: processing element level, processing element array level, and chip level. Each reconfiguration hierarchy covers a set of important optimization techniques for DNN computation which are frequently adopted in real life. This paper lists the reconfigurable AI chip works in chronological order, discusses the hardware development process for each optimization techniques, and analyzes the necessity of reconfigurability in AI tasks processing. The trends of each reconfiguration hierarchy and insights about the cooperation of techniques from different hierarchies are also proposed.
- Published
- 2023
21. SPG: Structure-Private Graph Database via SqueezePIR
- Author
-
Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Liang, Ling, Lin, Jilan, Qu, Zheng, Ahmad, Ishtiyaque, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
- Abstract
Many relational data in our daily life are represented as graphs, making graph application an important workload. Because of the large scale of graph datasets, moving graph data to the cloud becomes a popular option. To keep the confidential and private graph secure from an untrusted cloud server, many cryptographic techniques are leveraged to hide the content of the data. However, protecting only the data content is not enough for a graph database. Because the structural information of the graph can be revealed through the database accessing track. In this work, we study the graph neural network (GNN), an important graph workload to mine information from a graph database. We find that the server is able to infer which node is processing during the edge retrieving phase and also learn its neighbor indices during GNN's aggregation phase. This leads to the leakage of the information of graph structure data. In this work, we present SPG, a structure-private graph database with SqueezePIR. Our SPG is built on top of Private Information Retrieval (PIR), which securely hides which nodes/neighbors are accessed. In addition, we propose SqueezePIR, a compression technique to overcome the computation overhead of PIR. Based on our evaluation, our SqueezePIR achieves 11.85× speedup on average with less than 2% accuracy loss when compared to the state-of-the-art FastPIR protocol.
- Published
- 2023
22. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
- Author
-
Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
- Abstract
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths
$( < 8$ $8.1\times $ - Published
- 2024
- Full Text
- View/download PDF
23. HDSuper: High-Quality and High Computational Utilization Edge Super-Resolution Accelerator With Hardware-Algorithm Co-Design Techniques
- Author
-
Zhao, Xin, Chang, Liang, Fan, Dongqi, Hu, Zhicheng, Yue, Ting, Tu, Fengbin, and Zhou, Jun
- Abstract
Super-resolution (SR) techniques have been employed to construct high-definition images from low-quality images. Various neural networks have demonstrated excellent image-reconstruction quality in SR accelerators. However, deploying SR networks on edge devices is limited by resources and power consumption induced by significant algorithm parameters, computation complexity, and external memory accesses. This work explores the hardware algorithm co-design techniques to provide an end-to-end platform with a lightweight super-resolution network (LSR) and an efficient, high-quality SR accelerator HDSuper. For algorithm design, the improved depth-wise separable convolution and pixelshuffle layers are developed to reduce network size and computation complexity by considering the hardware constraints. Also, the improved channel attention (CA) blocks enhance the image reconstruction quality. For hardware accelerator design, we design a unified computing core (UCC) combined with an efficient flattening-and-allocation (F-A) mapping strategy to support various operators with high computational utilization. In addition, we design the patch computing scheme to reduce the external memory access of the hardware architecture. Based on the evaluation, the proposed algorithm achieves high-quality image reconstruction with
$37.44dB$ $2.08 W$ $152 mW$ - Published
- 2024
- Full Text
- View/download PDF
24. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
- Author
-
Zhou, Jiajun, primary, Wu, Jiajun, additional, Gao, Yizhao, additional, Ding, Yuhao, additional, Tao, Chaofan, additional, Li, Boyu, additional, Tu, Fengbin, additional, Cheng, Kwang-Ting, additional, So, Hayden Kwok-Hay, additional, and Wong, Ngai, additional
- Published
- 2023
- Full Text
- View/download PDF
25. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration
- Author
-
Tu, Fengbin, primary, Wang, Yiqi, additional, Liang, Ling, additional, Ding, Yufei, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Yin, Shouyi, additional, and Xie, Yuan, additional
- Published
- 2023
- Full Text
- View/download PDF
26. SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization
- Author
-
Wang, Yiqi, primary, Tu, Fengbin, additional, Liu, Leibo, additional, Wei, Shaojun, additional, Xie, Yuan, additional, and Yin, Shouyi, additional
- Published
- 2023
- Full Text
- View/download PDF
27. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks
- Author
-
Liang, Ling, primary, Qu, Zheng, additional, Chen, Zhaodong, additional, Tu, Fengbin, additional, Wu, Yujie, additional, Deng, Lei, additional, Li, Guoqi, additional, Li, Peng, additional, and Xie, Yuan, additional
- Published
- 2022
- Full Text
- View/download PDF
28. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating
- Author
-
Yang, Jianxun, primary, Tu, Fengbin, additional, Li, Yixuan, additional, Wang, Yiqi, additional, Liu, Leibo, additional, Wei, Shaojun, additional, and Yin, Shouyi, additional
- Published
- 2022
- Full Text
- View/download PDF
29. A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration
- Author
-
Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wang, Yiqi, Wu, Zihan, Liang, Ling, Ding, Yufei, Kim, Bongjin, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Abstract
Many computing-in-memory (CIM) processors have been proposed for edge deep learning (DL) acceleration. They usually rely on analog CIM techniques to achieve high-efficiency NN inference with low-precision INT multiply-accumulation (MAC) support [1]. Different from edge DL, cloud DL has higher accuracy requirements for NN inference and training, which demands extra support for high-precision floating-point (FP) MAC. As shown in Fig. 15.5.1, applying CIM techniques to cloud DL has three main limitations: 1) FP MAC has tightly coupled exponent alignment and INT mantissa MAC. Implementing complex exponent alignment in memory will harm CIM's direct accumulation structure and reduce efficiency. 2) FP MAC's energy is dominated by INT mantissa MAC. Further acceleration on CIM-based INT MAC is critical for processor efficiency. 3) Previous cloud DL processors usually have separate FP and INT engines, but only activate one engine at once [2], which causes high area overhead and low resource utilization.
- Published
- 2022
30. INSPIRE: IN-Storage Private Information REtrieval via Protocol and Architecture Co-design
- Author
-
Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, Xie, Yuan, Lin, Jilan, Liang, Ling, Qu, Zheng, Ahmad, Ishtiyaque, Liu, Liu, Tu, Fengbin, Gupta, Trinabh, Ding, Yufei, and Xie, Yuan
- Abstract
Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment. In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We frst design the INSPIRE protocol with a multi-stage fltering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE's protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous instorage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
- Published
- 2022
31. Accelerating Spatiotemporal Supervised Training of Large-Scale Spiking Neural Networks on GPU
- Author
-
Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, Xie, Yuan, Liang, Ling, Chen, Zhaodong, Deng, Lei, Tu, Fengbin, Li, Guoqi, and Xie, Yuan
- Abstract
Spiking neural networks (SNNs) have great potential to achieve brain-like intelligence, however, it suffers low accuracy of conventional synaptic plasticity rules and low training efficiency on GPUs. Recently, the emerging backpropagation through time (BPTT) inspired learning algorithms bring new opportunities to boost the accuracy of SNNs, while training on GPUs still remains inefficient due to the complex spatiotemporal dynamics and huge memory consumption, which restricts the model exploration for SNNs and prevents the advance of neuromorphic computing. In this work, we build a framework to solve the inefficiency of BPTT-based SNN training on modern GPUs. To reduce the memory consumption, we optimize the dataflow by saving CONV/FC results only in the forward pass and recomputing other intermediate results in the backward pass. Then, we customize kernel functions to accelerate the neural dynamics for all training stages. Finally, we provide a Pytorch interface to make our framework easy-to-deploy in real systems. Compared to vanilla Pytorch implementation, our framework can achieve up to 2.13 x end-to-end speedup and consume only 0.41 x peak memory on the CIFAR10 dataset. Moreover, for the distributed training on the large ImageNet dataset, we can achieve up to 1.81 x end-to-end speedup and consume only 0.38 x peak memory. © 2022 EDAA.
- Published
- 2022
32. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks
- Author
-
Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, Xie, Yuan, Ling, Liang, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
- Abstract
Although spiking neural networks (SNNs) take benefits from the bioplausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs, and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning, which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and weight update, we first design look-up table (LUT)-based processing elements in the forward engine and weight update engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware backward engine, which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves 7.38× area saving, 5.74-10.20× speedup, and 5.25-7.12× energy saving on several benchmark datasets. © 1982-2012 IEEE.
- Published
- 2022
33. DOTA: Detect and OmitWeak Attentions for Scalable Transformer Acceleration
- Author
-
Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, Xie, Yuan, Qu, Zheng, Liu, Liu, Tu, Fengbin, Chen, Zhaodong, Ding, Yufei, and Xie, Yuan
- Abstract
Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-Attention. In this paper, we present DOTA, an algorithm-Architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-To-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively. © 2022 Owner/Author.
- Published
- 2022
34. Dynamic Sparse Attention for Scalable Transformer Acceleration
- Author
-
Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, Xie, Yuan, Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, and Xie, Yuan
- Abstract
Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution. © 1968-2012 IEEE.
- Published
- 2022
35. A 28nm 15.59J/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes
- Author
-
Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wu, Zihan, Wang, Yiqi, Liang, Ling, Liu, Liu, Ding, Yufei, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Abstract
Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QKT, A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QKT-pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]-[2] with leq 8b precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement. © 2022 IEEE.
- Published
- 2022
36. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating
- Author
-
Yang, Jianxun, Tu, Fengbin, Li, Yixuan, Wang, Yiqi, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Yang, Jianxun, Tu, Fengbin, Li, Yixuan, Wang, Yiqi, Liu, Leibo, Wei, Shaojun, and Yin, Shouyi
- Abstract
Quantization is a prominent approach to compress model sizes of deep neural networks (DNNs), which clusters high-precision weights into a smaller set of quantization levels and represents high-precision weights by low-precision indexes. To achieve the same accuracy, nonuniform quantized DNNs (NUQ-DNNs) with unequal quantization intervals need lower index precision than uniform quantized DNNs (UQ-DNNs) with equal intervals, achieving smaller model sizes. Hence, deploying NUQ-DNNs on accelerators costs less on- and off-chip memory accesses than UQ-DNNs, which are more valuable for edge devices. However, accelerating NUQ-DNNs is nontrivial, since weight indexes cannot be directly used for computations. Previous NUQ-DNN accelerators adopt standard convolutions by decoding weight indexes into actual-weights multiplied with activations, causing abundant look-up overhead and redundant computations. In this work, we propose a weight-repetition-aware activation aggregating (WPAA) convolution approach to accelerate inference of variable-precision NUQ- and UQ-DNNs. By merging convolutions of multiple kernels, WPAA requires no look-up operation and removes redundant computations. Based on WPAA, we design a generic quantized DNN accelerator (GQNA). Furthermore, we propose a layer-adaptive kernel-reordering merging scheme to off-line adjust merging order of kernels for minimizing energy consumption of GQNA. Implemented under TSMC 28-nm technology, GQNA achieves 31.9 and 32.6 TOPS/W energy efficiency for 1-b UQ- and NUQ-VGG-16, respectively. © 2004-2012 IEEE.
- Published
- 2022
37. Dynamic Sparse Attention for Scalable Transformer Acceleration
- Author
-
Liu, Liu, primary, Qu, Zheng, additional, Chen, Zhaodong, additional, Tu, Fengbin, additional, Ding, Yufei, additional, and Xie, Yuan, additional
- Published
- 2022
- Full Text
- View/download PDF
38. ADROIT: An Adaptive Dynamic Refresh Optimization Framework for DRAM Energy Saving in DNN Training
- Author
-
Lin, Xinhan, Sun, Liang, Tu, Fengbin, Liu, Leibo, Li, Xiangyu, Wei, Shaojun, Yin, Shouyi, Lin, Xinhan, Sun, Liang, Tu, Fengbin, Liu, Leibo, Li, Xiangyu, Wei, Shaojun, and Yin, Shouyi
- Abstract
To achieve high accuracy, DNN training usually consumes and generates myriads of data, which requires a large DRAM for efficient processing. The refresh power consumption in large DRAM has become a severe problem. Previous refresh energy saving methods have drawbacks on usability, flexibility or training supporting. We propose ADROIT, an adaptive dynamic refresh optimization framework for various DNNs and processing platforms. ADROIT dynamically adjusts the refresh rates for different types of data according to runtime loss feedback in DNN training. Data idle time, lifetime and size are taken into consideration to reduce the search space of refresh rate and remove most refresh operations. Experimental results show that ADROIT can reduce the refresh energy and total DRAM energy in DNN training by up to 98.9% and 24.7% respectively, while maintaining the accuracy. Moreover, ADROIT can automatically apply to different DNNs and hardware platforms without tedious manual configuration. © 2021 IEEE.
- Published
- 2021
39. Evolver: A Deep Learning Processor with On-Device Quantization-Voltage-Frequency Tuning
- Author
-
Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, Yin, Shouyi, Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Abstract
When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage-frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization-voltage-frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment. © 1966-2012 IEEE.
- Published
- 2021
40. Brain-Inspired Computing: Adventure from Beyond CMOS Technologies to Beyond von Neumann Architectures
- Author
-
Amrouch, Hussam, Chen, Jian Jia, Roy, Kaushik, Xie, Yuan, Chakraborty, Indranil, Huangfu, Wenqin, Liang, Ling, Tu, Fengbin, Wang, Cheng, Yayla, Mikail, Amrouch, Hussam, Chen, Jian Jia, Roy, Kaushik, Xie, Yuan, Chakraborty, Indranil, Huangfu, Wenqin, Liang, Ling, Tu, Fengbin, Wang, Cheng, and Yayla, Mikail
- Abstract
The goal of this special session paper is to introduce and discuss different breakthrough technologies as well as novel architectures and how they together may reshape the future of Artificial Intelligent. Our aim is to provide a comprehensive overview on the latest advances in brain-inspired computing and how the latter can be realized when emerging technologies, using beyond-CMOS devices, are coupled with novel computing paradigms that go beyond von Neumann architectures. Different emerging technologies like Ferroelectric Field-Effect Transistor (FeFET), Phase Change Memory (PCM), and Resistive RAM (ReRAM) are discussed, demonstrating their promising capability in building neuromorphic computing architectures that are inspired by nature. In addition, this special session paper discusses various novel concepts such as Logic-in-Memory (LIM), Processing-in-Memory (PIM), and Spiking Neural Networks (SNNs) towards exploring the far-reaching consequences of beyond von Neumann computing on accelerating deep learning. Finally, the latest trends in brain-inspired computing are summarized into algorithm, technology, and application-driven innovations towards comparing different PIM architectures. ©2021 IEEE
- Published
- 2021
41. STC: Significance-aware transform-based codec framework for external memory access reduction
- Author
-
Xiong, Feng, Tu, Fengbin, Shi, Man, Wang, Yang, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Xiong, Feng, Tu, Fengbin, Shi, Man, Wang, Yang, Liu, Leibo, Wei, Shaojun, and Yin, Shouyi
- Abstract
Deep convolutional neural networks (DCNNs), with extensive computation, require considerable external memory bandwidth and storage for intermediate feature maps. External memory accesses for feature maps become a significant energy bottleneck for DCNN accelerators. Many works have been done on quantizing feature maps into low precision to decrease the costs for computation and storage. There is an opportunity that the large amount of correlation among channels in feature maps can be exploited to further reduce external memory access. Towards this end, we propose a novel compression framework called Significance-aware Transform-based Codec (STC). In its compression process, significance-aware transform is introduced to obtain low-correlated feature maps in an orthogonal space, as the intrinsic representations of original feature maps. The transformed feature maps are quantized and encoded to compress external data transmission. For the next layer computation, the data will be reloaded with STC's reconstruction process. The STC framework can be supported with a small set of extensions to current DCNN accelerators. We implement STC extensions to the baseline TPU architecture for hardware evaluation. The strengthened TPU achieves average reduction of 2.57x in external memory access, 1.95x~2.78x improvement of system-level energy efficiency, with a negligible accuracy loss of only 0.5%. © 2020 IEEE.
- Published
- 2020
42. DUET: Boosting deep neural network efficiency on dual-module architecture
- Author
-
Liu, Liu, Qu, Zheng, Deng, Lei, Tu, Fengbin, Li, Shuangchen, Hu, Xing, Gu, Zhenyu, Ding, Yufei, Xie, Yuan, Liu, Liu, Qu, Zheng, Deng, Lei, Tu, Fengbin, Li, Shuangchen, Hu, Xing, Gu, Zhenyu, Ding, Yufei, and Xie, Yuan
- Abstract
Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs). © 2020 IEEE.
- Published
- 2020
43. Reconfigurable Architecture for Neural Approximation in Multimedia Computing
- Author
-
Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, Wei, Shaojun, Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, and Wei, Shaojun
- Abstract
Due to inherent error resiliency, many high performance multimedia applications can be approximated by multi-layer perceptrons (MLPs), with little quality loss. An MLP accelerator can be designed to improve the power efficiency of multimedia systems. However, previous MLP accelerators' fixed computational pattern lowers the performance when the MLP topology varies for different applications. In this paper, we propose a scheduling framework to guide mapping MLPs onto limited hardware resources. The scheduling framework adjusts the computational patterns for various MLP topologies, obtaining 30% higher performance than the conventional scheduling. We implement a reconfigurable neural architecture (RNA) to support different patterns in the framework and further improve the performance and efficiency. RNA achieves a speedup of 572 ×on the approximable part, whole application speedup of 7.9 × and energy savings of 6.3 ×, with little quality loss on the benchmarks. © 1991-2012 IEEE.
- Published
- 2019
44. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory
- Author
-
Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Liu, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, Wei, Shaojun, Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Liu, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, and Wei, Shaojun
- Abstract
Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A '3D + 2.5D' integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 °C and the steady temperature by up to 32.27 °C over state-of-the-art accelerators with 3D memory without performance degradation. © 1990-2012 IEEE.
- Published
- 2019
45. A High Throughput Acceleration for Hybrid Neural Networks with Efficient Resource Management on FPGA
- Author
-
Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, Wei, Shaojun, Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, and Wei, Shaojun
- Abstract
Deep learning is the amazing technology which has promoted the development of artificial intelligence and achieved many amazing successes in intelligent fields. Convolution-based layers (CLs), fully connected layers (FLs) and recurrent layers (RLs) are three types of layers in classic neural networks. Most intelligent tasks are implemented by the hybrid neural networks (hybrid-NNs), which are commonly composed of different layer-blocks (LBs) of CLs, FLs, and RLs. Because the CLs require the most computation in hybrid-NNs, many field-programmable gate array (FPGA)-based accelerators focus on CLs acceleration and have demonstrated great performance. However, the CLs accelerators lead to an underutilization of FPGA resources in the acceleration of the whole hybrid-NN. To fully exploit the logic resources and the memory bandwidth in the acceleration of CLs/FLs/RLs, we propose an FPGA resource efficient mapping mechanism for hybrid-NNs. The mechanism first improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP. Then the LB-level spatial mapping is used to exploit the complementary features between different neural networks in the hybrid-NN. We evaluate the mapping mechanism by implementing four hybrid-NNs on Xilinx Virtex7 690T FPGA. The proposed mechanism achieves a peak performance of 1805.8 giga operations per second (GOPs). With the analysis on resource utilization and throughput, the proposed method exploits more computing power in FPGA and achieves up to 4.13 × higher throughput than the state-of-the-art acceleration. © 1982-2012 IEEE.
- Published
- 2019
46. Towards Efficient Compact Network Training on Edge-Devices
- Author
-
Xiong, Feng, Tu, Fengbin, Yin, Shouyi, Wei, Shaojun, Xiong, Feng, Tu, Fengbin, Yin, Shouyi, and Wei, Shaojun
- Abstract
Currently, there is a trend to deploy training on edge devices, which is crucial to future AI applications in various scenarios with transfer and online learning demands. Specifically, there may be a severe degradation of accuracy when directly deploying the trained models on edge devices, because the local environment forms an edge local dataset that is often different from the generic dataset. However, training on edge devices with limited computing and memory capability is a challenge problem. In this paper, we propose a novel quantization training framework for efficient compact network training on edge devices. Firstly, training-aware symmetric quantization is introduced to quantize all of the data types in the training process. Then, channel-wise quantization method is adopted for comapact network quantization, which has significantly high tolerance to quantization errors and can make the training process more stable. For further efficient training, we build a hardware evaluation platform to evaluate different settings of the network, so as to achieve a better trade-off among accuracy, energy and latency. Finally, we evaluate two widely used compact networks on a domain adaptation dataset for image classification, and the results demonstrate that the proposed methods can allow us achieve an improvement of 8.4 ×-17.2× in energy reduction and 11.9 ×-16.3× in latency reduction compared with 32-bit implementations, while maintaining the classification accuracy. © 2019 IEEE.
- Published
- 2019
47. A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA
- Author
-
Yin, Shouyi, primary, Tang, Shibin, additional, Lin, Xinhan, additional, Ouyang, Peng, additional, Tu, Fengbin, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
- Published
- 2019
- Full Text
- View/download PDF
48. Reconfigurable Architecture for Neural Approximation in Multimedia Computing
- Author
-
Tu, Fengbin, primary, Yin, Shouyi, additional, Ouyang, Peng, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
- Published
- 2019
- Full Text
- View/download PDF
49. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory
- Author
-
Yin, Shouyi, primary, Tang, Shibin, additional, Lin, Xinhan, additional, Ouyang, Peng, additional, Tu, Fengbin, additional, Liu, Leibo, additional, Zhao, Jishen, additional, Xu, Cong, additional, Li, Shuangcheng, additional, Xie, Yuan, additional, and Wei, Shaojun, additional
- Published
- 2019
- Full Text
- View/download PDF
50. GNA: Reconfigurable and Efficient Architecture for Generative Network Acceleration
- Author
-
Yan, Jiale, primary, Yin, Shouyi, additional, Tu, Fengbin, additional, Liu, Leibo, additional, and Wei, Shaojun, additional
- Published
- 2018
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.