15 results on '"Tu, Fengbin"'
Search Results
2. Towards efficient generative AI and beyond-AI computing: New trends on ISSCC 2024 machine learning accelerators.
- Author
-
Yang, Bohan, Chen, Jia, and Tu, Fengbin
- Published
- 2024
- Full Text
- View/download PDF
3. DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
- Author
-
Zhou, Jiajun, Wu, Jiajun, Gao, Yizhao, Ding, Yuhao, Tao, Chaofan, Li, Boyu, Tu, Fengbin, Cheng, Kwang-Ting, So, Hayden Kwok-Hay, and Wong, Ngai
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
- Published
- 2023
4. Dynamic Sparse Attention for Scalable Transformer Acceleration.
- Author
-
Liu, Liu, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Ding, Yufei, and Xie, Yuan
- Subjects
COMPUTER vision ,MACHINE learning ,SPARSE matrices - Abstract
Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks.
- Author
-
Liang, Ling, Qu, Zheng, Chen, Zhaodong, Tu, Fengbin, Wu, Yujie, Deng, Lei, Li, Guoqi, Li, Peng, and Xie, Yuan
- Subjects
ARTIFICIAL neural networks ,MACHINE learning ,GRAPHICS processing units ,REINFORCEMENT learning ,NEUROPLASTICITY ,CONVOLUTIONAL neural networks ,SUPERVISED learning - Abstract
Although spiking neural networks (SNNs) take benefits from the bioplausible neural modeling, the low accuracy under the common local synaptic plasticity learning rules limits their application in many practical tasks. Recently, an emerging SNN supervised learning algorithm inspired by backpropagation through time (BPTT) from the domain of artificial neural networks (ANNs) has successfully boosted the accuracy of SNNs, and helped improve the practicability of SNNs. However, current general-purpose processors suffer from low efficiency when performing BPTT for SNNs due to the ANN-tailored optimization. On the other hand, current neuromorphic chips cannot support BPTT because they mainly adopt local synaptic plasticity rules for simplified implementation. In this work, we propose H2Learn, a novel architecture that can achieve high efficiency for BPTT-based SNN learning, which ensures high accuracy of SNNs. At the beginning, we characterized the behaviors of BPTT-based SNN learning. Benefited from the binary spike-based computation in the forward pass and weight update, we first design look-up table (LUT)-based processing elements in the forward engine and weight update engine to make accumulations implicit and to fuse the computations of multiple input points. Second, benefited from the rich sparsity in the backward pass, we design a dual-sparsity-aware backward engine, which exploits both input and output sparsity. Finally, we apply a pipeline optimization between different engines to build an end-to-end solution for the BPTT-based SNN learning. Compared with the modern NVIDIA V100 GPU, H2Learn achieves $7.38\times $ area saving, $5.74-10.20\times $ speedup, and $5.25-7.12\times $ energy saving on several benchmark datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
6. GQNA: Generic Quantized DNN Accelerator With Weight-Repetition-Aware Activation Aggregating.
- Author
-
Yang, Jianxun, Tu, Fengbin, Li, Yixuan, Wang, Yiqi, Liu, Leibo, Wei, Shaojun, and Yin, Shouyi
- Subjects
- *
ARTIFICIAL neural networks , *ACCELERATOR mass spectrometry , *ENERGY consumption - Abstract
Quantization is a prominent approach to compress model sizes of deep neural networks (DNNs), which clusters high-precision weights into a smaller set of quantization levels and represents high-precision weights by low-precision indexes. To achieve the same accuracy, nonuniform quantized DNNs (NUQ-DNNs) with unequal quantization intervals need lower index precision than uniform quantized DNNs (UQ-DNNs) with equal intervals, achieving smaller model sizes. Hence, deploying NUQ-DNNs on accelerators costs less on- and off-chip memory accesses than UQ-DNNs, which are more valuable for edge devices. However, accelerating NUQ-DNNs is nontrivial, since weight indexes cannot be directly used for computations. Previous NUQ-DNN accelerators adopt standard convolutions by decoding weight indexes into actual-weights multiplied with activations, causing abundant look-up overhead and redundant computations. In this work, we propose a weight-repetition-aware activation aggregating (WPAA) convolution approach to accelerate inference of variable-precision NUQ- and UQ-DNNs. By merging convolutions of multiple kernels, WPAA requires no look-up operation and removes redundant computations. Based on WPAA, we design a generic quantized DNN accelerator (GQNA). Furthermore, we propose a layer-adaptive kernel-reordering merging scheme to off-line adjust merging order of kernels for minimizing energy consumption of GQNA. Implemented under TSMC 28-nm technology, GQNA achieves 31.9 and 32.6 TOPS/W energy efficiency for 1-b UQ- and NUQ-VGG-16, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning.
- Author
-
Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Subjects
REINFORCEMENT learning ,COMPUTER architecture ,ENERGY consumption ,DEEP learning ,MACHINE learning - Abstract
When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage–frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization–voltage–frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
8. A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA.
- Author
-
Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, and Wei, Shaojun
- Subjects
ARTIFICIAL neural networks ,RESOURCE management ,FIELD programmable gate arrays ,DEEP learning ,ARTIFICIAL intelligence - Abstract
Deep learning is the amazing technology which has promoted the development of artificial intelligence and achieved many amazing successes in intelligent fields. Convolution-based layers (CLs), fully connected layers (FLs) and recurrent layers (RLs) are three types of layers in classic neural networks. Most intelligent tasks are implemented by the hybrid neural networks (hybrid-NNs), which are commonly composed of different layer-blocks (LBs) of CLs, FLs, and RLs. Because the CLs require the most computation in hybrid-NNs, many field-programmable gate array (FPGA)-based accelerators focus on CLs acceleration and have demonstrated great performance. However, the CLs accelerators lead to an underutilization of FPGA resources in the acceleration of the whole hybrid-NN. To fully exploit the logic resources and the memory bandwidth in the acceleration of CLs/FLs/RLs, we propose an FPGA resource efficient mapping mechanism for hybrid-NNs. The mechanism first improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP. Then the LB-level spatial mapping is used to exploit the complementary features between different neural networks in the hybrid-NN. We evaluate the mapping mechanism by implementing four hybrid-NNs on Xilinx Virtex7 690T FPGA. The proposed mechanism achieves a peak performance of 1805.8 giga operations per second (GOPs). With the analysis on resource utilization and throughput, the proposed method exploits more computing power in FPGA and achieves up to $4.13 \times$ higher throughput than the state-of-the-art acceleration. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory.
- Author
-
Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, and Wei, Shaojun
- Subjects
DEEP learning ,NEURAL circuitry ,ARTIFICIAL intelligence ,VIDEO surveillance ,MACHINE learning - Abstract
Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 $^\circ$ C and the steady temperature by up to 32.27 $^\circ$ C over state-of-the-art accelerators with 3D memory without performance degradation. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
10. GNA: Reconfigurable and Efficient Architecture for Generative Network Acceleration.
- Author
-
Yan, Jiale, Yin, Shouyi, Tu, Fengbin, Liu, Leibo, and Wei, Shaojun
- Subjects
GENERATIVE programming (Computer science) ,COMPUTER network architectures ,ADAPTIVE computing systems ,SIGNAL convolution ,COMPUTER scheduling ,BANDWIDTHS ,DECONVOLUTION of digital images - Abstract
Generative networks have become ubiquitous in image generation applications like image super-resolution, image to image translation, and text to image synthesis. They are usually composed of convolutional (CONV) layers, convolution-based residual blocks, and deconvolutional (DeCONV) layers. Previous works on neural network acceleration focus too much on optimizing CONV layers computation such as data-reuse or parallel computation, but have low processing element (PE) utilization in computing residual blocks and DeCONV layers: residual blocks require very high memory bandwidth when performing elementwise additions on residual paths; DeCONV layers have imbalanced operation counts for different outputs. In this paper, we propose a dual convolution mapping method for CONV and DeCONV layers to make full use of the available PE resources. A cross-layer scheduling method is also proposed to avoid extra off-chip memory access in residual block processing. Precision-adaptive PEs and buffer bandwidth reconfiguration are used to support flexible bitwidths for both inputs and weights in deep neural networks. We implement a generative network accelerator (GNA) based on intra-PE processing, inter-PE processing, and cross-layer scheduling techniques. Owing to the proposed optimization techniques, GNA achieves energy efficiency of 2.05 TOPS/W with 61% higher PE utilization than traditional methods in generative network acceleration. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
11. A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications.
- Author
-
Yin, Shouyi, Ouyang, Peng, Tang, Shibin, Tu, Fengbin, Li, Xiudong, Zheng, Shixuan, Lu, Tianyi, Gu, Jiangyuan, Liu, Leibo, and Wei, Shaojun
- Subjects
MICROPROCESSORS ,ARTIFICIAL neural networks ,DEEP learning - Abstract
Hybrid neural networks (hybrid-NNs) have been widely used and brought new challenges to NN processors. Thinker is an energy efficient reconfigurable hybrid-NN processor fabricated in 65-nm technology. To achieve high energy efficiency, three optimization techniques are proposed. First, each processing element (PE) supports bit-width adaptive computing to meet various bit-widths of neural layers, which raises computing throughput by 91% and improves energy efficiency by $1.93 \times $ on average. Second, PE array supports on-demand array partitioning and reconfiguration for processing different NNs in parallel, which results in 13.7% improvement of PE utilization and improves energy efficiency by $1.11 \times $ . Third, a fused data pattern-based multi-bank memory system is designed to exploit data reuse and guarantee parallel data access, which improves computing throughput and energy efficiency by $1.11 \times $ and $1.17 \times $ , respectively. Measurement results show that this processor achieves 5.09-TOPS/W energy efficiency at most. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
12. Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns.
- Author
-
Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Tang, Shibin, Liu, Leibo, and Wei, Shaojun
- Subjects
ARTIFICIAL neural networks ,CONFIGURATION management ,CONVOLUTION codes ,ENERGY consumption of computers ,COMPUTER vision - Abstract
Deep convolutional neural networks (DCNNs) have been successfully used in many computer vision tasks. Previous works on DCNN acceleration usually use a fixed computation pattern for diverse DCNN models, leading to imbalance between power efficiency and performance. We solve this problem by designing a DCNN acceleration architecture called deep neural architecture (DNA), with reconfigurable computation patterns for different models. The computation pattern comprises a data reuse pattern and a convolution mapping method. For massive and different layer sizes, DNA reconfigures its data paths to support a hybrid data reuse pattern, which reduces total energy consumption by 5.9~8.4 times over conventional methods. For various convolution parameters, DNA reconfigures its computing resources to support a highly scalable convolution mapping method, which obtains 93% computing resource utilization on modern DCNNs. Finally, a layer-based scheduling framework is proposed to balance DNA’s power efficiency and performance for different DCNNs. DNA is implemented in the area of 16 mm2 at 65 nm. On the benchmarks, it achieves 194.4 GOPS at 200 MHz and consumes only 479 mW. The system-level power efficiency is 152.9 GOPS/W (considering DRAM access power), which outperforms the state-of-the-art designs by one to two orders. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
13. Neural approximating architecture targeting multiple application domains.
- Author
-
Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, and Wei, Shaojun
- Published
- 2015
- Full Text
- View/download PDF
14. RNA: A reconfigurable architecture for hardware neural acceleration.
- Author
-
Tu, Fengbin, Yin, Shouyi, Ouyang, Peng, Liu, Leibo, and Wei, Shaojun
- Published
- 2015
15. Erratum to “Evolver: a Deep Learning Processor With On-Device Quantization-Voltage-Frequency Tuning”.
- Author
-
Tu, Fengbin, Wu, Weiwei, Wang, Yang, Chen, Hongjiang, Xiong, Feng, Shi, Man, Li, Ning, Deng, Jinyi, Chen, Tianbao, Liu, Leibo, Wei, Shaojun, Xie, Yuan, and Yin, Shouyi
- Subjects
DEEP learning - Abstract
In the above article , there are some references that were cited incorrectly in the text. The corrections are as follows: [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.