Descriptor: "Systolic array" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Systolic array"' showing total 3,617 results

Start Over Descriptor "Systolic array"

3,617 results on '"Systolic array"'

1. Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing.

Author: Moghaddasi, Iraj and Nam, Byeong-Gyu
Subjects: ARTIFICIAL neural networks, IMAGE recognition (Computer vision), ENERGY consumption, PARALLEL programming, ARRAY processing
Abstract: In recent years, deep neural networks (DNNs) have addressed new applications with intelligent autonomy, often achieving higher accuracy than human experts. This capability comes at the expense of the ever-increasing complexity of emerging DNNs, causing enormous challenges while deploying on resource-limited edge devices. Improving the efficiency of DNN hardware accelerators by compression has been explored previously. Existing state-of-the-art studies applied approximate computing to enhance energy efficiency even at the expense of a little accuracy loss. In contrast, bit-serial processing has been used for improving the computational efficiency of neural processing without accuracy loss, exploiting a simple design, dynamic precision adjustment, and computation pruning. This research presents Serial/Parallel Systolic Array (SPSA) and Octet Serial/Parallel Systolic Array (OSPSA) processing elements for edge DNN acceleration, which exploit bit-serial processing on systolic array architecture for improving computational efficiency. For evaluation, all designs were described at the RTL level and synthesized in 28 nm technology. Post-synthesis cycle-accurate simulations of image classification over DNNs illustrated that, on average, a sample 16 × 16 systolic array indicated remarkable improvements of 17.6% and 50.6% in energy efficiency compared to the baseline, with no loss of accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Author: Iraj Moghaddasi and Byeong-Gyu Nam
Subjects: systolic array, DNN accelerator, serial inference engine, TPU, energy efficiency, Computer engineering. Computer hardware, TK7885-7895
Abstract: In recent years, deep neural networks (DNNs) have addressed new applications with intelligent autonomy, often achieving higher accuracy than human experts. This capability comes at the expense of the ever-increasing complexity of emerging DNNs, causing enormous challenges while deploying on resource-limited edge devices. Improving the efficiency of DNN hardware accelerators by compression has been explored previously. Existing state-of-the-art studies applied approximate computing to enhance energy efficiency even at the expense of a little accuracy loss. In contrast, bit-serial processing has been used for improving the computational efficiency of neural processing without accuracy loss, exploiting a simple design, dynamic precision adjustment, and computation pruning. This research presents Serial/Parallel Systolic Array (SPSA) and Octet Serial/Parallel Systolic Array (OSPSA) processing elements for edge DNN acceleration, which exploit bit-serial processing on systolic array architecture for improving computational efficiency. For evaluation, all designs were described at the RTL level and synthesized in 28 nm technology. Post-synthesis cycle-accurate simulations of image classification over DNNs illustrated that, on average, a sample 16 × 16 systolic array indicated remarkable improvements of 17.6% and 50.6% in energy efficiency compared to the baseline, with no loss of accuracy.
Published: 2024
Full Text: View/download PDF

3. High-throughput systolic array-based accelerator for hybrid transformer-CNN networks

Author: Qingzeng Song, Yao Dai, Hao Lu, and Guanghao Jin
Subjects: Hardware accelerator, Hybrid transformer-CNN, Systolic array, FPGA, Electronic computers. Computer science, QA75.5-76.95
Abstract: In this era of Transformers enjoying remarkable success, Convolutional Neural Networks (CNNs) remain highly relevant and useful. Indeed, hybrid Transformer-CNN network architectures, which combine the benefits of both approaches, have achieved impressive results. Vision Transformer (ViT) is a significant neural network architecture that features a convolutional layer as its first layer, primarily built on the transformer framework. However, owing to the distinct computation patterns inherent in attention and convolution, existing hardware accelerators for these two models are typically designed separately and lack a unified approach toward accelerating both models efficiently. In this paper, we present a dedicated accelerator on a field-programmable gate array (FPGA) platform. The accelerator, which integrates a configurable three-dimensional systolic array, is specifically designed to accelerate the inferential capabilities of hybrid Transformer-CNN networks. The Convolution and Transformer computations can be mapped to a systolic array by unifying these operations for matrix multiplication. Softmax and LayerNorm which are frequently used in hybrid Transformer-CNN networks were also implemented on FPGA boards. The accelerator achieved high performance with a peak throughput of 722 GOP/s at an average energy efficiency of 53 GOPS/W. Its respective computation latencies were 51.3 ms, 18.1 ms, and 6.8 ms for ViT-Base, ViT-Small, and ViT-Tiny. The accelerator provided a 12× improvement in energy efficiency compared to the CPU, a 2.3× improvement compared to the GPU, and a 1.5× to 2× improvement compared to existing accelerators regarding speed and energy efficiency.
Published: 2024
Full Text: View/download PDF

4. Principle of Embedded AI Chips

Author: Li, Bin and Li, Bin
Published: 2024
Full Text: View/download PDF

5. ExaFlexHH: an exascale-ready, flexible multi-FPGA library for biologically plausible brain simulations.

Author: Miedema, Rene and Strydis, Christos
Subjects: COMPUTATIONAL neuroscience, NEUROPLASTICITY, SCALABILITY, LIBRARIES
Abstract: Introduction: In-silico simulations are a powerful tool in modern neuroscience for enhancing our understanding of complex brain systems at various physiological levels. To model biologically realistic and detailed systems, an ideal simulation platform must possess: (1) high performance and performance scalability, (2) flexibility, and (3) ease of use for non-technical users. However, most existing platforms and libraries do not meet all three criteria, particularly for complex models such as the Hodgkin-Huxley (HH) model or for complex neuron-connectivity modeling such as gap junctions. Methods: This work introduces ExaFlexHH, an exascale-ready, flexible library for simulatingHHmodels onmulti-FPGA platforms. Utilizing FPGA-basedData-Flow Engines (DFEs) and the dataflow programming paradigm, ExaFlexHH addresses all three requirements. The library is also parameterizable and compliant with NeuroML, a prominent brain-description language in computational neuroscience. We demonstrate the performance scalability of the platform by implementing a highly demanding extended-Hodgkin-Huxley (eHH) model of the Inferior Olive using ExaFlexHH. Results: Model simulation results show linear scalability for unconnected networks and near-linear scalability for networks with complex synaptic plasticity, with a 1.99× performance increase using two FPGAs compared to a single FPGA simulation, and 7.96× when using eight FPGAs in a scalable ring topology. Notably, our results also reveal consistent performance efficiency in GFLOPS per watt, further facilitating exascale-ready computing speeds and pushing the boundaries of future brain-simulation platforms. Discussion: The ExaFlexHHlibrary shows superior resource efficiency, quantified in FLOPS per hardware resources, benchmarked against other competitive FPGA-based brain simulation implementations. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications.

Author: Seo, Juwon and Kong, Joonho
Subjects: MATRIX multiplications, SPARSE matrices, ARTIFICIAL neural networks
Abstract: A key part of modern deep neural network (DNN) applications is matrix multiplication. As DNN applications are becoming more diverse, there is a need for both dense and sparse matrix multiplications to be accelerated by hardware. However, most hardware accelerators are designed to accelerate either dense or sparse matrix multiplication. In this paper, we propose VerSA, a versatile systolic array architecture for both dense and sparse matrix multiplications. VerSA employs intermediate paths and SRAM buffers between the rows of the systolic array (SA), thereby enabling an early termination in sparse matrix multiplication with a negligible performance overhead when running dense matrix multiplication. When running sparse matrix multiplication, 256 × 256 VerSA brings performance (i.e., an inverse of execution time) improvement and energy saving by 1.21×–1.60× and 7.5–30.2%, respectively, when compared to the conventional SA. When running dense matrix multiplication, VerSA results in only a 0.52% performance overhead compared to the conventional SA. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Systolic Tensor Core for Arithmetics calculations.

Author: Ibarra Carrillo, Mario Alfredo, Montiel Pérez, Jesús Yaljá, and Molina Lozano, Herón
Abstract: Today, it is the amount of data that defines the existence of mankind. Scientists respond to the large amount of required calculations by developing hardware in several directions. One of them is to increase the number of arithmetic elements. Another direction is to create new architectures that represent new algorithms for processing numerical data. We have chosen the second direction by developing a new systolic core architecture, which implies an improvement in efficiency, i.e. performing the same task with the same number of arithmetic elements but reducing the latency. Measurements are made in terms of computational capacity and the number of arithmetic elements involved in the operations. The results of the tests are compared with data from a number of selected articles. Today, we have achieved 3.2GFlops with only two modules. In the future, we plan to integrate up to four of our cores in a system with its own memory and management processor and at a higher operating frequency. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative Study.

Author: Taheri, Nedasadat, Tabrizchi, Sepehr, and Roohi, Arman
Subjects: ELECTRIC power failures, SPATIAL systems, MATHEMATICAL optimization, COMPARATIVE studies, ENERGY consumption
Abstract: This paper conducts a comprehensive study on intermittent computing within IoT environments, emphasizing the interplay between different dataflows—row, weight, and output—and a variety of non-volatile memory technologies. We then delve into the architectural optimization of these systems using a spatial architecture, namely IDEA, with their processing elements efficiently arranged in a rhythmic pattern, providing enhanced performance in the presence of power failures. This exploration aims to highlight the diverse advantages and potential applications of each combination, offering a comparative perspective. In our findings, using IDEA for the row stationary dataflow with AlexNet on the CIFAR10 dataset, we observe a power efficiency gain of 2.7% and an average reduction of 21% in the required cycles. This study elucidates the potential of different architectural choices in enhancing energy efficiency and performance in IoT systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. A Survey of Design and Optimization for Systolic Array-based DNN Accelerators.

Author: RUI XU, SHENG MA, YANG GUO, and DONGSHENG LI
Subjects: *DEEP learning, *ARTIFICIAL neural networks, *CONVOLUTIONAL neural networks, *SMART devices, *PATTERN recognition systems
Published: 2024
Full Text: View/download PDF

10. ExaFlexHH: an exascale-ready, flexible multi-FPGA library for biologically plausible brain simulations

Author: Rene Miedema and Christos Strydis
Subjects: brain simulation, FPGA, dataflow engine, systolic array, scalable, Inferior Olive, Neurosciences. Biological psychiatry. Neuropsychiatry, RC321-571
Abstract: IntroductionIn-silico simulations are a powerful tool in modern neuroscience for enhancing our understanding of complex brain systems at various physiological levels. To model biologically realistic and detailed systems, an ideal simulation platform must possess: (1) high performance and performance scalability, (2) flexibility, and (3) ease of use for non-technical users. However, most existing platforms and libraries do not meet all three criteria, particularly for complex models such as the Hodgkin-Huxley (HH) model or for complex neuron-connectivity modeling such as gap junctions.MethodsThis work introduces ExaFlexHH, an exascale-ready, flexible library for simulating HH models on multi-FPGA platforms. Utilizing FPGA-based Data-Flow Engines (DFEs) and the dataflow programming paradigm, ExaFlexHH addresses all three requirements. The library is also parameterizable and compliant with NeuroML, a prominent brain-description language in computational neuroscience. We demonstrate the performance scalability of the platform by implementing a highly demanding extended-Hodgkin-Huxley (eHH) model of the Inferior Olive using ExaFlexHH.ResultsModel simulation results show linear scalability for unconnected networks and near-linear scalability for networks with complex synaptic plasticity, with a 1.99 × performance increase using two FPGAs compared to a single FPGA simulation, and 7.96 × when using eight FPGAs in a scalable ring topology. Notably, our results also reveal consistent performance efficiency in GFLOPS per watt, further facilitating exascale-ready computing speeds and pushing the boundaries of future brain-simulation platforms.DiscussionThe ExaFlexHH library shows superior resource efficiency, quantified in FLOPS per hardware resources, benchmarked against other competitive FPGA-based brain simulation implementations.
Published: 2024
Full Text: View/download PDF

11. Flexible Systolic Hardware Architecture for Computing a Custom Lightweight CNN in CT Images Processing for Automated COVID-19 Diagnosis

Author: Aguirre-Alvarez, Paulo Aarón, Diaz-Carmona, Javier, Arredondo-Velázquez, Moisés, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Mahmud, Mufti, editor, Mendoza-Barrera, Claudia, editor, Kaiser, M. Shamim, editor, Bandyopadhyay, Anirban, editor, Ray, Kanad, editor, and Lugo, Eduardo, editor
Published: 2023
Full Text: View/download PDF

12. Description and Verification of Systolic Array Parallel Computation Model in Synchronous Circuit Using LOTOS

Author: Chiba, Yuya, Wasaki, Katsumi, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, and Latifi, Shahram, editor
Published: 2023
Full Text: View/download PDF

13. Haica: A High Performance Computing & Artificial Intelligence Fused Computing Architecture

Author: Chen, Zhengbo, Zheng, Fang, Guo, Feng, Yu, Qi, Chen, Zuoning, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Meng, Weizhi, editor, Lu, Rongxing, editor, Min, Geyong, editor, and Vaidya, Jaideep, editor
Published: 2023
Full Text: View/download PDF

14. CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses.

Author: Hyun-Wook Son, Al-Hamid, Ali A., Yong-Seok Na, Dong-Yeong Lee, and Hyung-Won Kim
Subjects: FIELD programmable gate arrays, CONVOLUTIONAL neural networks, DIGITAL signal processing
Abstract: This paper presents the architecture of a Convolution Neural Network (CNN) accelerator based on a new processing element (PE) array called a diagonal cyclic array (DCA). As demonstrated, it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models, which maximizes the data reuse rate and improve the computation speed. Furthermore, an integrated computation architecture has been implemented for the activation function, max-pooling, and activation function after convolution calculation, reducing the hardware resource. To evaluate the effectiveness of the proposed architecture, a CNN accelerator has been implemented for You Only Look Once version 2 (YOLOv2)-Tiny consisting of 9 layers. Furthermore, the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work. We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+ Field Programmable Gate Array (FPGA) and ISE Design Suite. The FPGA implementation uses 34,336 Look Up Tables (LUTs), 576 Digital Signal Processing (DSP) blocks, and an on-chip memory of only 58 KB, and it could achieve accuracies of 57.92% and 56.42% mean Average Precession @0.5 thresholds for intersection over union (mAP@0.5) using quantized 16- bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8- bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz. These speeds are expected to be doubled five times using a clock speed of 1 GHz if implemented in a silicon System on Chip (SoC) using a sub-micron process. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

15. FPGA-Based Chaotic Image Encryption Using Systolic Arrays.

Author: Ciylan, Furkan, Ciylan, Bünyamin, and Atak, Mehmet
Subjects: IMAGE encryption, STREAMING video & television, DATA security
Abstract: Along with the recent advancements in video streaming, concerns over the security of transferred data have increased. Thus, the development of fast and reliable image encryption methodologies has become an emerging research area in the field of communications. In this paper, a systolic array-based image encryption architecture is proposed. Systolic arrays are used to apply the convolution operation, and a Lü–Chen chaotic oscillator is used to obtain a convolutional filter. To decrease resource consumption, a method to fuse confusion and diffusion processes by using systolic arrays is also proposed in this paper. The results show that the proposed method is highly secure against some differential and statistical attacks. It is also shown that the proposed method has a high speed of encryption compared to other methods. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

16. 基于 DPI-C 的脉动阵列模块验证平台.

Author: 王鑫 and 陈博
Abstract: Copyright of Computer Measurement & Control is the property of Magazine Agency of Computer Measurement & Control and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2023
Full Text: View/download PDF

17. SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning

Author: Wang, Bo, Ma, Sheng, Liu, Zhong, Huang, Libo, Yuan, Yuan, Dai, Yi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Shaoshan, editor, and Wei, Xiaohui, editor
Published: 2022
Full Text: View/download PDF

18. 基于脉动阵列的层融合注意力模型加速器结构.

Author: 刘晓航, 姜晶菲, and 许金伟
Abstract: Attention mechanism has recently shown superior performance in deep neural networks, its computation generates complex data flow and requires high computation and memory overheads. Therefore, customized accelerators are required to optimize the inference computing. This paper proposes an accelerator architecture for attention mechanism computation. A flexible partitioning method based on hardware control is used to divide the huge matrices in the attention model into hardwarefriendly computing blocks, which realizes the systolic array in accelerator matched by the block computation match. A layer fusion computing structure based on two-step softmax function decomposition is proposed, which effectively reduces the memory access of attention mechanism computation. A fusedlayer attention model accelerator based on fine-grained computational scheduling is designed and implemented by HDL. The performance was evaluated based on the XLINIX FPGA device and HLS tool. Compared with the CPU and GPU implementation under the same settings, the delay of accelerator was improved by 4.91 times, the efficiency of accelerator was improved by 1.24 times. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

19. Deep Learning Accelerators' Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study.

Author: Gookyi, Dennis Agyemanh Nana, Lee, Eunchong, Kim, Kyungho, Jang, Sung-Joon, and Lee, Sang-Seol
Subjects: *DEEP learning, *CONFIGURATION space, *SPACE exploration, *EDGE computing
Abstract: Though custom deep learning (DL) hardware accelerators are attractive for making inferences in edge computing devices, their design and implementation remain a challenge. Open-source frameworks exist for exploring DL hardware accelerators. Gemmini is an open-source systolic array generator for agile DL accelerator exploration. This paper details the hardware/software components generated using Gemmini. The general matrix-to-matrix multiplication (GEMM) of different dataflow options, including output/weight stationary (OS/WS), was explored in Gemmini to estimate the performance relative to a CPU implementation. The Gemmini hardware was implemented on an FPGA device to explore the effect of several accelerator parameters, including array size, memory capacity, and the CPU/hardware image-to-column (im2col) module, on metrics such as the area, frequency, and power. This work revealed that regarding the performance, the WS dataflow offered a speedup of 3× relative to the OS dataflow, and the hardware im2col operation offered a speedup of 1.1× relative to the operation on the CPU. For hardware resources, an increase in the array size by a factor of 2 led to an increase in both the area and power by a factor of 3.3, and the im2col module led to an increase in area and power by factors of 1.01 and 1.06, respectively. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

20. Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative Study

Author: Nedasadat Taheri, Sepehr Tabrizchi, and Arman Roohi
Subjects: intermittent computing, systolic array, non-volatile memory, accelerator, Mechanical engineering and machinery, TJ1-1570
Abstract: This paper conducts a comprehensive study on intermittent computing within IoT environments, emphasizing the interplay between different dataflows—row, weight, and output—and a variety of non-volatile memory technologies. We then delve into the architectural optimization of these systems using a spatial architecture, namely IDEA, with their processing elements efficiently arranged in a rhythmic pattern, providing enhanced performance in the presence of power failures. This exploration aims to highlight the diverse advantages and potential applications of each combination, offering a comparative perspective. In our findings, using IDEA for the row stationary dataflow with AlexNet on the CIFAR10 dataset, we observe a power efficiency gain of 2.7% and an average reduction of 21% in the required cycles. This study elucidates the potential of different architectural choices in enhancing energy efficiency and performance in IoT systems.
Published: 2024
Full Text: View/download PDF

21. High-Frequency Systolic Array-Based Transformer Accelerator on Field Programmable Gate Arrays.

Author: Chen, Yonghao, Li, Tianrui, Chen, Xiaojie, Cai, Zhigang, and Su, Tao
Subjects: FIELD programmable gate arrays, TRANSFORMER models, NATURAL language processing, MACHINE translating, ELECTRONIC design automation
Abstract: The systolic array is frequently used in accelerators for neural networks, including Transformer models that have recently achieved remarkable progress in natural language processing (NLP) and machine translation. Due to the constraints of FPGA EDA (Field Programmable Gate Array Electronic Design Automation) tools and the limitations of design methodology, existing systolic array accelerators for FPGA deployment often cannot achieve high frequency. In this work, we propose a well-designed high-frequency systolic array for an FPGA-based Transformer accelerator, which is capable of performing the Multi-Head Attention (MHA) block and the position-wise Feed-Forward Network (FFN) block, reaching 588 MHz and 474 MHz for different array size, achieving a frequency improvement of 1.8× and 1.5× on a Xilinx ZCU102 board, while drastically saving resources compared to similar recent works and pushing the utilization of each DSP slice to a higher level. We also propose a semi-automatic design flow with constraint-generating tools as a general solution for FPGA-based high-frequency systolic array deployment. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

22. 一种矩阵块间提前切换的脉动阵列优化策略.

Author: 鞠鑫, 曹亚松, 文梅, 汪志, and 冯静
Abstract: The demand for hardware computing power in AI applications increases year by year, driving the evolution of AI accelerators towards higher performance. Research shows that the main computing form of AI applications can be transformed into matrix multiplication, and systolic array has become one of the mainstream matrix multiplication acceleration technologies because of its unique advantages in matrix multiplication. However, there is a certain amount of pipeline filling and emptying overhead when the matrix is flowed into and out of the systolic array, especially for a floating-point systolic array that supports training, whose MAC latency is greater than 1. Untimely switching between matrix blocks will lead to a sharp drop in PE utilization. To solve these problems, theoretical analysis based on typical application scenarios is conducted, and an early switching strategy between matrix blocks is proposed, which can accurately calculate the optimal switching time between matrix blocks in various situations. The RTL design was implemented. The experimental results show that the hardware overhead of the optimized systolic array is slightly increased, but the performance can be improved in all scenarios. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

23. Flexible Convolver for Convolutional Neural Networks Deployment onto Hardware-Oriented Applications.

Author: Arredondo-Velázquez, Moisés, Aguirre-Álvarez, Paulo Aaron, Padilla-Medina, Alfredo, Espinosa-Calderon, Alejandro, Prado-Olivarez, Juan, and Diaz-Carmona, Javier
Subjects: CONVOLUTIONAL neural networks, FIELD programmable gate arrays, SHIFT registers
Abstract: This paper introduces a flexible convolver capable of adapting to the different convolution layer configurations of state-of-the-art Convolution Neural Networks (CNNs). The use of two proposed programmable components achieves this adaptability. A Programmable Line Buffer (PLB) based on Programmable Shift Registers (PSRs) allows the generation of the required convolution masks required for each processed CNN layer. The convolution layer computing is performed through a proposed programmable systolic array configured according to the target device resources. In order to maximize the device resource usage and to achieve a shortened processing time, the filter, data, and loop parallelisms are leveraged. These characteristics allow the described architecture to be scalable and implemented on any FPGA device targeting different applications. The convolver description was written in VHDL using the Intel Cyclone V 5CSXFC6D6F31C6N device as a reference. The experimental results show that the proposed computing method allows the processing of any CNN without requiring special adaptation for a specific application since the standard convolution algorithm is used. The proposed flexible convolver achieves competitive performance compared with those reported in related works. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

24. An Energy-Efficient Convolutional Neural Network Processor Architecture Based on a Systolic Array.

Author: Zhang, Chen, Wang, Xin'an, Yong, Shanshan, Zhang, Yining, Li, Qiuping, and Wang, Chenyang
Subjects: CONVOLUTIONAL neural networks, ARTIFICIAL intelligence, STATIC random access memory, IMAGE segmentation, ENERGY consumption
Abstract: Deep convolutional neural networks (CNNs) have shown strong abilities in the application of artificial intelligence. However, due to their extensive amount of computation, traditional processors have low energy efficiency when executing CNN algorithms, which is unacceptable for portable devices with limited hardware cost and battery capacity, so designing a CNN-specific processor is necessary. In this paper, we propose an energy-efficient CNN processor architecture for lightweight devices with a processing elements (PEs) array consisting of 384 PEs. Using the systolic array-based PE array, it realizes parallel operations between filter rows and between channels of output feature maps, supporting the acceleration of 3D convolution and fully connected computation with various parameters by configuring internal instruction registers. The computing strategy based on the proposed systolic dataflow achieves less hardware overhead compared with other strategies, and the reuse of image values and weight values, which effectively reduce the power of memory access. A memory system with a multi-level storage structure combined with register file (RF) and SRAM is used in the proposed CNN processor, which further reduces the energy overhead of computing. The proposed CNN processor architecture has been verified on a ZC706 FPGA platform using VGG-16 based on the proposed image segmentation method, the evaluation results indicate that the peak throughput achieves 115.2 GOP/s consuming 3.801 W at 150 MHz, energy efficiency and DSP efficiency reaches 30.32 GOP/s/W and 0.26 GOP/s/DSP, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

25. Functional Criticality Analysis of Structural Faults in AI Accelerators.

Author: Chaudhuri, Arjun, Talukdar, Jonti, Su, Fei, and Chakrabarty, Krishnendu
Subjects: *ARTIFICIAL neural networks, *GENERATIVE adversarial networks, *ARTIFICIAL intelligence, *FUNCTIONAL analysis
Abstract: The ubiquitous application of deep neural networks (DNNs) has led to a rise in demand for artificial intelligence (AI) accelerators. For example, the tensor processing unit from Google–based on a systolic array–and its variants are of considerable interest for DNN inferencing using AI accelerators. This article studies the problem of classifying structural faults in such an accelerator based on their functional criticality. We first analyze pin-level faults in the processing elements (PEs) of a systolic array. Simulation results for the LeNet network with 8-bit fixed-point, 16-bit floating-point (FP), and 32-bit FP data paths applied to the MNIST dataset show that over 93% of the pin-level structural faults in a PE are functionally benign. We present a greedy iterative framework for determining the criticality of stuck-at faults in a PE netlist and analyze the limitations of criticality analysis methods based on repeated fault simulations. We next present a scalable two-tier machine-learning (ML)-based method to assess the functional criticality of stuck-at faults in a computationally efficient manner. We address the problem of minimizing misclassification by utilizing generative adversarial networks (GANs). Two-tier ML/GAN-based criticality assessment leads to less than 1% test escapes during functional criticality evaluation of structural faults. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

26. High-throughput systolic array-based accelerator for hybrid transformer-CNN networks.

Author: Song, Qingzeng, Dai, Yao, Lu, Hao, and Jin, Guanghao
Subjects: TRANSFORMER models, CONVOLUTIONAL neural networks, GATE array circuits, MATRIX multiplications, ENERGY consumption
Abstract: In this era of Transformers enjoying remarkable success, Convolutional Neural Networks (CNNs) remain highly relevant and useful. Indeed, hybrid Transformer-CNN network architectures, which combine the benefits of both approaches, have achieved impressive results. Vision Transformer (ViT) is a significant neural network architecture that features a convolutional layer as its first layer, primarily built on the transformer framework. However, owing to the distinct computation patterns inherent in attention and convolution, existing hardware accelerators for these two models are typically designed separately and lack a unified approach toward accelerating both models efficiently. In this paper, we present a dedicated accelerator on a field-programmable gate array (FPGA) platform. The accelerator, which integrates a configurable three-dimensional systolic array, is specifically designed to accelerate the inferential capabilities of hybrid Transformer-CNN networks. The Convolution and Transformer computations can be mapped to a systolic array by unifying these operations for matrix multiplication. Softmax and LayerNorm which are frequently used in hybrid Transformer-CNN networks were also implemented on FPGA boards. The accelerator achieved high performance with a peak throughput of 722 GOP/s at an average energy efficiency of 53 GOPS/W. Its respective computation latencies were 51.3 ms, 18.1 ms, and 6.8 ms for ViT-Base, ViT-Small, and ViT-Tiny. The accelerator provided a 12 × improvement in energy efficiency compared to the CPU, a 2. 3 × improvement compared to the GPU, and a 1. 5 × to 2 × improvement compared to existing accelerators regarding speed and energy efficiency. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Design and implementation of a nano-scale high-speed multiplier for signal processing applications.

Author: Ahmadpour, Seyed-Sajad, Navimipour, Nima Jafari, Ain, Noor Ul, Kerestecioglu, Feza, Yalcin, Senay, Avval, Danial Bakhshayeshi, and Hosseinzadeh, Mehdi
Subjects: DIGITAL signal processing, DISCRETE wavelet transforms, CENTRAL processing units, SIGNAL processing, ELECTRONIC circuits
Abstract: Digital signal processing (DSP) is an engineering field involved with increasing the precision and dependability of digital communications and mathematical processes, including equalization, modulation, demodulation, compression, and decompression, which can be used to produce a signal of the highest caliber. To execute vital tasks in DSP, an essential electronic circuit such as a multiplier plays an important role, continually performing tasks such as the multiplication of two binary numbers. Multiplier is a crucial component utilized to implement a wide range of DSP tasks, including convolution, Fourier transform, discrete wavelet transforms (DWT), filtering and dithering, multimedia information processing, and more. A multiplier device includes a clock and reset buttons for more flexible operational control. Each digital signal processor constitutes a multiplier unit. A multiplier unit functions entirely autonomously from the central processing unit (CPU); consequently, the CPU is burdened with a significantly reduced amount of work. Since DSP algorithms must constantly carry out multiplication tasks, the employment of a high-speed multiplier to execute fast-speed filtering processes is vital. The previous multipliers had lots of weaknesses, such as high energy, low speed, and high area, because they implemented this necessary circuit based on traditional technology such as complementary metal-oxide semiconductor (CMOS) and very large-scale integration (VLSI). To solve all previous drawbacks in this necessary circuit, we can use nanotechnology, which directly affects the performance of the multiplier and can overcome all previous issues. One of the alternative nanotechnologies that can be used for designing digital circuits is quantum dot cellular automata, which is high speed, low area, and low power. Therefore, this manuscript suggests a quantum technology-based multiplier for DSP applications. In addition, some vital circuits, such as half adder, full adder, and ripple carry adder (RCA), are suggested for designing a multiplier. Moreover, a systolic array, accumulator, and multiply and accumulate (MAC) unit are proposed based on the quantum technology-based multiplier. Nonetheless, each of the suggested frameworks has a coplanar configuration without rotated cells. The suggested structure is developed and verified utilizing the QCADesigner 2.0.3 tools. The findings showed that all circuits have no complicated configuration, including a higher number of quantum cells, latency, and an optimum area. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Implementation of Discrete Sine Transform Realization Though Systolic Architecture

Author: Jain, Anamika, Pandey, Neeta, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Nath, Vijay, editor, and Mandal, J. K., editor
Published: 2021
Full Text: View/download PDF

29. Efficient Homomorphic Encryption Accelerator With Integrated PRNG Using Low-Cost FPGA

Author: Infall Syafalni, Gilbert Jonatan, Nana Sutisna, Rahmat Mulyawan, and Trio Adiono
Subjects: BFV scheme, fully homomorphic encryption, Gaussian PRNG, hardware accelerator, systolic array, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: With recent development in internet speed and reliability, cloud computing has become a more reliable solution for the user. In many cases where data privacy is critical, fully homomorphic encryption (FHE) can be a security solution for securing cloud computing. FHE enables computation on encrypted data, hence it ensures data privacy in case of cloud computing. One popular scheme of FHE is the BFV homomorphic encryption scheme, which is based on ring learning with error (RLWE) computation. The BFV scheme uses ring polynomials as the main object, hence its encryption, decryption, and evaluation require high-degree polynomial multiplication. In this paper, we present comprehensive design and implementation of a hardware architecture to accelerate encryption and decryption in BFV scheme. Our accelerator uses convolution approach for calculating a polynomial multiplication. To implement the convolution, we use a systolic array to calculate polynomial convolution followed by a simple delayed subtraction to calculate polynomial modulo reduction inside our accelerator’s core. Moreover, we use a built-in Gaussian pseudo-random number generator (PRNG) to generate Gaussian noise in the encryption operations. Finally, we implement the 1024 degrees BFV accelerator on the Xilinx PYNQ Z1 board and compare the encryption and decryption performances to other methods as well as a software implementation on Intel Core i7 with 8GB memory. Experimental results show that our accelerator outperforms the clock cycles of other methods with the same polynomial degrees 1024 up to $22\times $ . Moreover, our proposed Gaussian PRNG has better $2\times $ correlation compared to the rotation-only-based PRNG. Finally, our accelerator accelerates up to $9\times $ for encryption and $3.5\times $ for decryption as well as $6.8\times $ for overall compared to Microsoft SEAL on Intel Core i7 processor with 8GB memory. The proposed design is scalable for higher degrees polynomial multiplication and useful for security technology such as high-speed secure cloud computing, blind computing, and secure communication.
Published: 2022
Full Text: View/download PDF

30. Deep Neural Network Memory Performance and Throughput Modeling and Simulation Framework.

Author: Gabbay, Freddy, Lev Aharoni, Rotem, and Schweitzer, Ori
Subjects: *ARTIFICIAL neural networks, *AUTOMATIC speech recognition, *ARTIFICIAL intelligence, *SEMICONDUCTOR technology, *IMAGE recognition (Computer vision), *SPEECH perception
Abstract: Deep neural networks (DNNs) are widely used in various artificial intelligence applications and platforms, such as sensors in internet of things (IoT) devices, speech and image recognition in mobile systems, and web searching in data centers. While DNNs achieve remarkable prediction accuracy, they introduce major computational and memory bandwidth challenges due to the increasing model complexity and the growing amount of data used for training and inference. These challenges introduce major difficulties not only due to the constraints of system cost, performance, and energy consumption, but also due to limitations in currently available memory bandwidth. The recent advances in semiconductor technologies have further intensified the gap between computational hardware performance and memory systems bandwidth. Consequently, memory systems are, today, a major performance bottleneck for DNN applications. In this paper, we present DRAMA, a deep neural network memory simulator. DRAMA extends the SCALE-Sim simulator for DNN inference on systolic arrays with a detailed, accurate, and extensive modeling and simulation environment of the memory system. DRAMA can simulate in detail the hierarchical main memory components—such as memory channels, modules, ranks, and banks—and related timing parameters. In addition, DRAMA can explore tradeoffs for memory system performance and identify bottlenecks for different DNNs and memory architectures. We demonstrate DRAMA's capabilities through a set of experimental simulations based on several use cases. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

31. Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators.

Author: Xu, Rui, Ma, Sheng, Wang, Yaohua, Guo, Yang, Li, Dongsheng, and Qiao, Yuran
Subjects: *CONVOLUTIONAL neural networks, *FLEXIBLE structures
Abstract: Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise convolutional layers in the neural networks. To make systolic arrays more efficient for compact convolutional neural networks, we propose the heterogeneous systolic array (HeSA) architecture. It introduces heterogeneous processing elements that support multiple dataflows, which can further exploit the reuse data chance of depthwise convolutional layers and without changing the structure of the naÃ¯ve systolic array. By increasing the utilization rate of processing elements in the array, the HeSA improves the performance, throughput, and energy efficiency compared to the standard baseline. In addition, we design the flexible buffer structure for the HeSA. Through configuring it, the HeSA can allocate bandwidth flexibly to maintaining high performance and low communication cost. Based on our evaluation with typical workloads, the HeSA improves the utilization rate of the computing resource in depthwise convolutional layers by 4.5× - 11.2× and acquires 1.6 - 3.1× total performance speedup compared to the standard systolic array architecture. In the large-scale array design, the HeSA can reduce the data traffic by 40% while maintaining the same performance as the scaling-out method. By improving the on-chip data reuse opportunities and reducing data traffic, the HeSA saves over 20% in energy consumption. Meanwhile, the area of the HeSA is basically unchanged compared to the baseline due to its simple design. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

32. A Fine-Grained Modeling Approach for Systolic Array-Based Accelerator.

Author: Li, Yuhang, Wen, Mei, Fei, Jiawei, Shen, Junzhong, and Cao, Yasong
Subjects: MATRIX multiplications, DEEP learning
Abstract: The systolic array provides extremely high efficiency for running matrix multiplication and is one of the mainstream architectures of today's deep learning accelerators. In order to develop efficient accelerators, people usually employ simulators to make design trade-offs. However, current simulators suffer from coarse-grained modeling methods and ideal assumptions, which limits their ability to describe structural characteristics of systolic arrays. In addition, they do not support the exploration of microarchitecture. This paper presents FG-SIM, a fine-grained modeling approach for evaluating systolic array accelerators by using an event-driven method. FG-SIM can obtain accurate results and provide the best mapping scheme for different workloads due to its fine-grained modeling technique and deny of ideal assumption. Experimental results show that FG-SIM plays a significant role in design trade-offs and outperforms state-of-the-art simulators, with an accuracy of more than 95%. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

33. Near-Precise Parameter Approximation for Multiple Multiplications on a Single DSP Block.

Author: Kalali, Ercan and van Leuken, Rene
Subjects: *MULTIPLICATION, *FIELD programmable gate arrays
Abstract: DSP blocks are one of the efficient solutions to implement multiply-accumulate (MAC) operations on FPGA’s. However, since the DSP blocks have wide multiplier and adder blocks, MAC operations using low bit-length parameters lead to an underutilization. Hence, an efficient approximation technique is introduced. The technique includes manipulation and approximation of the low bit-length parameters based upon a Single DSP - Multiple Multiplication (SDMM) execution. The accuracy of the developed optimization technique was evaluated for different CNN weight bit precisions using the Alexnet and VGG-16 networks and the ImageNet ILSVRC-2012 dataset. The optimization can be implemented without loss of accuracy in almost all cases, while it causes slight accuracy losses in a few cases. Through these optimizations, multiple parameter multiplications are performed in a single DSP block at the cost of a small hardware overhead. As a result of our optimizations, the parameters are represented in a different format on off-chip memory, providing up to 33% compression without any hardware cost. A prototype systolic array architecture was implemented employing our optimizations on a Xilinx Zynq FPGA. It reduced the number of DSP blocks by 66.6%, 75%, and 83.3% for 8, 6, and 4-bit input variables, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

34. A Low-Power Area-Efficient Precision Scalable Multiplier with an Input Vector Systolic Structure.

Author: Tang, Xiqin, Li, Yang, Lin, Chenxiao, and Shang, Delong
Subjects: CONVOLUTIONAL neural networks, MODULAR design, ELECTRICITY pricing
Abstract: In this paper, a small-area low-power 64-bit integer multiplier is presented, which is suitable for portable devices or wireless applications. To save the area cost and power consumption, an input vector systolic (IVS) structure is proposed based on four 16-bit radix-8 Booth multipliers and a data input scheme is proposed to reduce the number of signal transitions. This structure is similar to a systolic array in matrix multiply units of a Convolutional Neural Network (CNN), but it reduces the number of processing elements by 3/4 concerning the same vector systolic accelerator in reference. The comparison results prove that the IVS multiplier reduces at least 61.9% of the area and 45.18% of the power over its counterparts. To increase the hardware resource utilization, a Transverse Carry Array (TCA) structure for Partial Products Accumulation (PPA) was designed by replacing the 32-bit adders with 3/17-bit adders in the 16-bit multipliers. The experiment results show that the optimization could lead to at least a 6.32% and 13.65% reduction in power consumption and area cost, respectively, compared to the standard 16-bit radix-8 Booth multiplier. In the end, the precise scale of the proposed IVS multiplier is discussed. Benefiting from the modular design, the IVS multiplier can be configured to support sixteen different kinds of multiplications at a step of 16 bits [16b, 32b, 48b, 64b] × [16b, 32b, 48b, 64b]. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

35. SAFFIRA : a Framework for Assessing the Reliability of Systolic-Array-Based DNN Accelerators

Author: Taheri, M., Daneshtalab, Masoud, Raik, J., Jenihhin, M., Pappalardo, S., Jimenez, P., Deveautour, B., Bosio, A., Taheri, M., Daneshtalab, Masoud, Raik, J., Jenihhin, M., Pappalardo, S., Jimenez, P., Deveautour, B., and Bosio, A.
Abstract: Systolic array has emerged as a prominent archi-tecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essen-tial for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators. The uniform Recurrent Equations system is used for software modeling of the systolic-array core of the DNN accelerators. The approach demonstrates a reduction of the fault injection time up to 3 × compared to the state-of-the-art hybrid (software/hardware) hardware-aware fault injection frameworks and more than 2000 × compared to RT-level fault injection frameworks - without compromising accuracy. Additionally, we propose and evaluate a new reliability metric through experimental assessment. The performance of the framework is studied on state-of-the-art DNN benchmarks.
Published: 2024
Full Text: View/download PDF

36. ExaFlexHH: an exascale-ready, flexible multi-FPGA library for biologically plausible brain simulations

Author: Miedema, Rene (author), Strydis, C. (author), Miedema, Rene (author), and Strydis, C. (author)
Abstract: IntroductionIn-silico simulations are a powerful tool in modern neuroscience for enhancing our understanding of complex brain systems at various physiological levels. To model biologically realistic and detailed systems, an ideal simulation platform must possess: (1) high performance and performance scalability, (2) flexibility, and (3) ease of use for non-technical users. However, most existing platforms and libraries do not meet all three criteria, particularly for complex models such as the Hodgkin-Huxley (HH) model or for complex neuron-connectivity modeling such as gap junctions.MethodsThis work introduces ExaFlexHH, an exascale-ready, flexible library for simulating HH models on multi-FPGA platforms. Utilizing FPGA-based Data-Flow Engines (DFEs) and the dataflow programming paradigm, ExaFlexHH addresses all three requirements. The library is also parameterizable and compliant with NeuroML, a prominent brain-description language in computational neuroscience. We demonstrate the performance scalability of the platform by implementing a highly demanding extended-Hodgkin-Huxley (eHH) model of the Inferior Olive using ExaFlexHH.ResultsModel simulation results show linear scalability for unconnected networks and near-linear scalability for networks with complex synaptic plasticity, with a 1.99 × performance increase using two FPGAs compared to a single FPGA simulation, and 7.96 × when using eight FPGAs in a scalable ring topology. Notably, our results also reveal consistent performance efficiency in GFLOPS per watt, further facilitating exascale-ready computing speeds and pushing the boundaries of future brain-simulation platforms.DiscussionThe ExaFlexHH library shows superior resource efficiency, quantified in FLOPS per hardware resources, benchmarked against other competitive FPGA-based brain simulation implementations., Computer Engineering
Published: 2024
Full Text: View/download PDF

37. Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform

Author: Cheng, Qixuan, Wen, Mei, Shen, Junzhong, Wang, Deguang, Zhang, Chunyuan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Qiu, Meikang, editor
Published: 2020
Full Text: View/download PDF

38. A Block-Based Systolic Array on an HBM2 FPGA for DNA Sequence Alignment

Author: Ben Abdelhamid, Riadh, Yamaguchi, Yoshiki, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Rincón, Fernando, editor, Barba, Jesús, editor, So, Hayden K. H., editor, Diniz, Pedro, editor, and Caba, Julián, editor
Published: 2020
Full Text: View/download PDF

39. Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration.

Author: Inayat, Kashif and Chung, Jaeyong
Subjects: MACHINE learning, DEEP learning, COMPUTER arithmetic, COMPUTER architecture
Abstract: Deep learning applications have become ubiquitous in today’s era and it has led to vast development in machine learning (ML) accelerators. Systolic arrays have been a primary part of ML accelerator architecture. To fully leverage the systolic arrays, it is required to explore the computer arithmetic data-path components and their tradeoffs in accelerators. We present a novel factored systolic array (FSA) architecture, in which the carry propagation adder (CPA) and carry-save adder (CSA) perform hybrid accumulation on least significant bit (LSB) bits and most significant bits (MSB) bits, respectively, inside each processing element. In addition, a small CPA to complete accumulation for MSB bits along with rounding logic for each column of the array is placed, which not only reduces the area, delay, and power but also balances the combinational and sequential area tradeoffs. We demonstrate the hybrid accumulator with partial CPA factoring in “Gemmini,” an open-source practical systolic array accelerator and factoring technique does not change the functionality of the base design. We implemented three baselines, original Gemmini and two variants of it, and show that the proposed approach leads to overall significant reduction in area within the range 12.8% – 50.2% and in power within the range 18.6% – 41% with improved or similar delay in comparison to the baselines. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

40. S 2 Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks.

Author: Yang, Jianlei, Fu, Wenzhi, Cheng, Xingzhou, Ye, Xucheng, Dai, Pengcheng, and Zhao, Weisheng
Subjects: *CONVOLUTIONAL neural networks, *VERNACULAR architecture, *SYSTEMS design, *COMPUTER systems, *ENGINES
Abstract: Convolutional neural networks (CNNs) have achieved great success in performing cognitive tasks. However, execution of CNNs requires a large amount of computing resources and generates heavy memory traffic, which imposes a severe challenge on computing system design. Through optimizing parallel executions and data reuse in convolution, systolic architecture demonstrates great advantages in accelerating CNN computations. However, regular internal data transmission path in traditional systolic architecture prevents the systolic architecture from completely leveraging the benefits introduced by neural network sparsity. Deployment of fine-grained sparsity on the existing systolic architectures is greatly hindered by the incurred computational overheads. In this work, we propose ${\mathsf {S}}^{2}$ S 2 Engine – a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. ${\mathsf {S}}^{2}$ S 2 Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naïve systolic array, ${\mathsf {S}}^{2}$ S 2 Engine achieves about $3.2\times$ 3. 2 × and about $3.0\times$ 3. 0 × improvements on speed and energy efficiency, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

41. Research on Heterogeneous Acceleration of Deep Learning Method for Missile-Borne Image Processing

Author: Chen Dong, Tian Zonghao
Subjects: missile-borne image, deep learning, fpga, systolic array, winograd convolution, Motor vehicles. Aeronautics. Astronautics, TL1-4050
Abstract: The problem existing in the transformation of deep learning algorithm to engineering application is analyzed. Combining with the characteristics and development trends of army intelligent ammunition, the missile-borne image processing heterogeneous accelerate system for deep learning is put forward based on the research of compression, quantitative and hardware heterogeneous acceleration, realizing heterogeneous hardware design. The DNNDK is used to compress and quantify the Yolo v3 model. The weight and parameter compression rate are more than 90% and 80%, realizing the lightweight design of Yolo v3. Based on the DPU hardware acceleration architecture, the algorithm is transplanted to the missile-borne embedded platform, and its power consumption and detection efficiency meet the requirements of missile-borne image processing.
Published: 2021
Full Text: View/download PDF

42. A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks.

Author: Huang, Mingqiang, Liu, Yucen, Man, Changhai, Li, Kai, Cheng, Quan, Mao, Wei, and Yu, Hao
Subjects: *ARTIFICIAL neural networks, *DEEP learning, *CONVOLUTIONAL neural networks, *FIELD programmable gate arrays, *MATRIX multiplications, *ELECTRONIC data processing
Abstract: Multi-bit-width convolutional neural network (CNN) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized deep learning neural networks. To efficiently process the multi-bit-width network inferencing, multi-level optimizations have been proposed. Firstly, differential Neural Architecture Search (NAS) method is adopted for the high accuracy multi-bit-width network generation. Secondly, hybrid Booth based multi-bit-width multiply-add-accumulation (MAC) unit is developed for data processing. Thirdly, vector systolic array is proposed for effectively accelerating the matrix multiplications. With vector-style systolic dataflow, both the processing time and logic resources consumption can be reduced when compared with the classical systolic array. Finally, The proposed multi-bit-width CNN acceleration scheme has been practically deployed on FPGA platform of Xilinx ZCU102. Average performance on accelerating the full NAS optimized VGG16 network is 784.2 GOPS, and peek performance of the convolutional layer can reach as high as 871.26 GOPS for INT8, 1676.96 GOPS for INT4, and 2863.29 GOPS for INT2 respectively, which is among the best results in previous CNN accelerator benchmarks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

43. Integration of Single-Port Memory (ISPM) for Multiprecision Computation in Systolic-Array-Based Accelerators.

Author: Yang, Renyu, Shen, Junzhong, Wen, Mei, Cao, Yasong, and Li, Yuhang
Subjects: DEEP learning, MEMORY, MACHINE learning
Abstract: On-chip memory is one of the core components of deep learning accelerators. In general, the area used by the on-chip memory accounts for around 30% of the total chip area. With the increasing complexity of deep learning algorithms, it will become a challenge for the accelerators to integrate much larger on-chip memory responding to algorithm needs, whereas the on-chip memory for multiprecision computation is required by the different precision (such as FP32, FP16) computations in training and inference. To solve it, this paper explores the use of single-port memory (SPM) in systolic-array-based deep learning accelerators. We propose transformation methods for multiple precision computation scenarios, respectively, to avoid the conflict of simultaneous read and write requests on the SPM. Then, we prove that the two methods are feasible and can be implemented on hardware without affecting the computation efficiency of the accelerator. Experimental results show that both methods have about 30% and 25% improvement in terms of area cost when accelerator integrates SPM without affecting the throughput of the accelerator, while the hardware cost is almost negligible. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

44. ASIC Implementation of Hardware Efficient DTCWT Architecture for Intra Prediction HEVC Coding in Complex Wavelet.

Author: Madhurima, V. and Padmapriya, K.
Subjects: SYSTOLIC array circuits, PARALLEL processing
Abstract: Hardware efficient DTCWT architecture for intra prediction coding that is designed using systolic array algorithm is presented in this work. In order to have a trade-off between computation complexity and latency systolic array based architecture is designed for DTCWT computation. Design of optimum structure for systolic array algorithm is presented. Parallel processing of the input images considering multiple rows and columns is designed to accelerate the throughput. The systolic array structure design presented in this work is for multiplying matrix of 6 x 6 and 6 x 4 elements. The data control unit provides synchronization of information flow in both the SAA structures. The design is synthesized in the Cadence environment, post place, map and route simulation also have been carried out. Tables in the text summarizes the initial estimation of area performances for the four different timing constraints set. The maximum operating frequency is identified to be of 300 MHz and the power dissipation is limited to less than 3 mW at 2.0 V power. Further optimization in the design metrics can be achieved by increasing the number of frames to be processed. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

45. PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

Author: Wang, Shuquan, Wang, Lei, Li, Shiming, Shuo, Tian, Guo, Shasha, Kang, Ziyang, Zhang, Shuzheng, Xu, Weixia, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Tang, Xiaoxin, editor, Chen, Quan, editor, Bose, Pradip, editor, Zheng, Weiming, editor, and Gaudiot, Jean-Luc, editor
Published: 2019
Full Text: View/download PDF

46. Systolic Arrays

Author: Hu, Yu Hen, Kung, Sun-Yuan, Bhattacharyya, Shuvra S., editor, Deprettere, Ed F., editor, Leupers, Rainer, editor, and Takala, Jarmo, editor
Published: 2019
Full Text: View/download PDF

47. Design and Chip Implementation of a SMI/MVDR Dual-Mode Beamformer for Wireless MIMO Communication Systems

Author: Kuan-Ting Chen, Yin-Tsung Hwang, and Cheng-Yi Huang
Subjects: Adaptive beamforming, Cholesky decomposition, Schur algorithm, systolic array, hardware accelerator, chip design, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: This paper presents a low complexity chip design supporting dual-mode beamforming, i.e. sampling matrix inversion (SMI) and the minimum variance distortionless response (MVDR), for wireless Multiple-Input Multiple-Output (MIMO) communication systems. The auto-correlation matrix inversion is the critical computing kernel shared by the two beamforming schemes. To alleviate the computing complexity, the auto-correlation matrix is approximated by a Toeplitz counterpart, which can be decomposed efficiently by applying the Cholesky decomposition and the Schur algorithm. This leads to an O (N3) to O (N2) complexity reduction, where N is the matrix size, while preserving computing parallelism for the hardware design. In addition, a diagonal loading technique is employed to mitigate the stability problem when the matrix is ill-conditioned. Simulation results indicate that no performance loss is observed due to the algorithm simplification measures. A systolic array based mapping procedure converts the two beamforming algorithms to a unified hardware accelerator design with 80% shared circuitry. Complex-valued divisions are achieved by adopting a hardware efficient coordinate rotation digital computer (CORDIC) scheme. In chip implementation, a TSMC 90nm UTM process technology is used and the design specs largely follow the requirements of IEEE 802.11ac standard. The core size of the chip design is 0.68mm2. The measurement results show that the chip can operate up to 200MHz with a power consumption of 49.03mW. It can complete the computation of a new beamforming vector (of size 8) every 0.64us and exhibits the highest throughput among the 6 compared designs.
Published: 2020
Full Text: View/download PDF

48. An FPGA-Based Hardware Accelerator for Real-Time Block-Matching and 3D Filtering

Author: Dong Wang, Jia Xu, and Ke Xu
Subjects: FPGA, BM3D, systolic array, image denoising, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Block-matching and 3D filtering (BM3D) denoising algorithm has been employed in many application fields because of its superior image processing quality. Due to the huge computational workload, real-time implementation of this algorithm is very challenging. Recently, studies on accelerating the BM3D algorithm on GPU have presented impressive speed up over CPU-based implementations. However, GPU devices are generally inefficient in energy dissipation and, thus, are not suitable for embedded application scenarios. In this paper, we propose a dedicated hardware accelerator design to efficiently boost the BM3D algorithm with reduced power consumption on FPGA device. The proposed design is based on a deeply pipelined OpenCL kernel architecture that can efficiently speed up the compute-intensive procedures of the denoising algorithm by exploiting the intrinsic parallelism and maximizing data reuse. The final design was implemented on Intel's Arria-10 GX1150 FPGA, and achieved an average 1.2× performance boost and an outstanding 8.3× reduction in energy dissipation when compared to a state-of-the-art GPU-based software design.
Published: 2020
Full Text: View/download PDF

49. Hardware design of convolution calculation module based on systolic array

Author: Wang Chunlin and Tan Kejun
Subjects: fpga, systolic array, convolution computation, high level synthesis, Electronics, TK7800-8360
Abstract: Aiming at the long broadcast, much fan in/fan out data path problem brought by high parullelism in the process of the Field Programmable Gate Array(FPGA) to realize the convolution computation in convolutional neural network, this paper adopts pulse array to realize convolution calculation module of convolutional neural network, fixes weights to each processing unit, according to the dimension of the input and output characteristic figure sets to pulse array size, and finally by Vivado high level synthesis realizes convolution calculation module hardware design. The experimental results show that the design has low resource occupancy and good expansibility while realizing the time-series requirements of level 1 pipelining.
Published: 2020
Full Text: View/download PDF

50. Deep Learning Accelerators’ Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study

Author: Dennis Agyemanh Nana Gookyi, Eunchong Lee, Kyungho Kim, Sung-Joon Jang, and Sang-Seol Lee
Subjects: deep learning, hardware accelerators, open-source, Gemmini, systolic array, GEMM, Chemical technology, TP1-1185
Abstract: Though custom deep learning (DL) hardware accelerators are attractive for making inferences in edge computing devices, their design and implementation remain a challenge. Open-source frameworks exist for exploring DL hardware accelerators. Gemmini is an open-source systolic array generator for agile DL accelerator exploration. This paper details the hardware/software components generated using Gemmini. The general matrix-to-matrix multiplication (GEMM) of different dataflow options, including output/weight stationary (OS/WS), was explored in Gemmini to estimate the performance relative to a CPU implementation. The Gemmini hardware was implemented on an FPGA device to explore the effect of several accelerator parameters, including array size, memory capacity, and the CPU/hardware image-to-column (im2col) module, on metrics such as the area, frequency, and power. This work revealed that regarding the performance, the WS dataflow offered a speedup of 3× relative to the OS dataflow, and the hardware im2col operation offered a speedup of 1.1× relative to the operation on the CPU. For hardware resources, an increase in the array size by a factor of 2 led to an increase in both the area and power by a factor of 3.3, and the im2col module led to an increase in area and power by factors of 1.01 and 1.06, respectively.
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,617 results on '"Systolic array"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources