Descriptor: "Stratix" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Stratix"' showing total 889 results

Start Over Descriptor "Stratix"

889 results on '"Stratix"'

1. Introduction to PLD

Author: Taraate, Vaibbhav and Taraate, Vaibbhav
Published: 2017
Full Text: View/download PDF

2. A 4.29nJ/pixel Stereo Depth Coprocessor With Pixel Level Pipeline and Region Optimized Semi-Global Matching for IoT Application

Author: Pingcheng Dong, Zhuoao Li, Chen Zhuoyu, Fu Yuzhe, Lei Chen, and Fengwei An
Subjects: Hardware architecture, Coprocessor, Video Graphics Array, Pixel, Computer science, business.industry, Pipeline (computing), Stratix, Electrical and Electronic Engineering, Frame rate, Field-programmable gate array, business, Computer hardware
Abstract: The semi-global matching (SGM) algorithm in stereo vision is a well-known depth-estimation method since it can generate dense and robust disparity maps. However, the real-time processing and low power dissipation, the specifications of the Internet-of-Thing (IoT) applications, are challenging for their computational complexity. In this paper, we propose a hardware-oriented SGM algorithm with pixel-level pipeline and region-optimized cost aggregation for high-speed processing and low hardware-resource usage. Firstly, the matching costs in a region are integrated with an optimization strategy to significantly reduce memory usage and improve the processing speed of the cost aggregation. Then, a two-layer parallel two-stage pipeline (TPTP) architecture, which enables pixel-level processing, is designed to calculate two directions (0° and 135°) aggregation to further solve the crucial computational bottleneck of the SGM algorithm. Finally, the architecture is demonstrated on a low-cost XILINX Spartan-7 device and an advanced Stratix-V FPGA device for VGA (640x 480) depth estimation. The experimental results show that the proposed architecture with compact hardware architecture also ensures accuracy. The pixel-level pipeline architecture enables a processing speed of 355 frames per second (fps) at 109MHz on the Spartan-7 FPGA device and 508 fps at 156MHz on the Stratix-V FPGA. Besides, the coprocessor respectively achieves an energy efficiency of 4.74 nJ/pixel with a power dissipation of 517mW and 4.29nJ/pixel with a power dissipation of 669mW on these two FPGAs.
Published: 2022
Full Text: View/download PDF

3. Mitigating Voltage Attacks in Multi-Tenant FPGAs

Author: George Provelengios, Daniel Holcomb, and Russell Tessier
Subjects: General Computer Science, Clock signal, Computer science, business.industry, 020208 electrical & electronic engineering, Cloud computing, 02 engineering and technology, Fault injection, 020202 computer hardware & architecture, Power (physics), Embedded system, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Field-programmable gate array, business, Reset (computing), Electronic circuit
Abstract: Recent research has exposed a number of security issues related to the use of FPGAs in embedded system and cloud computing environments. Circuits that deliberately waste power can be carefully crafted by a malicious cloud FPGA user and deployed to cause denial-of-service and fault injection attacks. The main defense strategy used by FPGA cloud services involves checking user-submitted designs for circuit structures that are known to aggressively consume power. Unfortunately, this approach is limited by an attacker’s ability to conceive new designs that defeat existing checkers. In this work, our contributions are twofold. We evaluate a variety of circuit power wasting techniques that typically are not flagged by design rule checks imposed by FPGA cloud computing vendors. The efficiencies of five power wasting circuits, including our new design, are evaluated in terms of power consumed per logic resource. We then show that the source of voltage attacks based on power wasters can be identified. Our monitoring approach localizes the attack and suppresses the clock signal for the target region within 21 μs, which is fast enough to stop an attack before it causes a board reset. All experiments are performed using a state-of-the-art Intel Stratix 10 FPGA.
Published: 2021
Full Text: View/download PDF

4. OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives

Author: Michihiro Koibuchi, Yutaka Urino, Hiroshi Yamaguchi, and Kenji Mizutani
Subjects: Ethernet, Remote direct memory access, Network packet, business.industry, Computer science, Local area network, Network topology, Theoretical Computer Science, Computational Theory and Mathematics, Hardware and Architecture, Stratix, Synchronization (computer science), business, Field-programmable gate array, Software, Computer hardware
Abstract: Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect packet network using Ethernet's Intellectual Properties imposes an unacceptably large amount of the logic for handling such high-bandwidth interconnects on an FPGA. Besides the indirect network, another approach builds a direct packet network. Existing direct inter-FPGA networks have a low-radix network topology, e.g., 2-D torus. However, the low-radix network has the disadvantage of a large diameter and large average shortest path length that increases the latency of collectives. To mitigate both problems, we propose a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collectives. Since all end-to-end separate communication paths are statically established using onboard optics, raw block data can be transferred with simple link-level synchronization. Once each source FPGA assigns a communication stream to a path by its internal switch logic between memory-mapped and stream interfaces for remote direct memory access (RDMA), a one-hop transfer is provided. Since each FPGA performs input/output of the remote memory access between all FPGAs simultaneously, multiple RDMAs efficiently form collectives. The OPTWEB network provides 0.71-μsec start-up latency of collectives among multiple Intel Stratix 10 MX FPGA cards with onboard optics. The OPTWEB network consumes 31.4 and 57.7 percent of adaptive logic modules for aggregate 400-Gbps and 800-Gbps interconnects on a custom Stratix 10 MX 2100 FPGA, respectively. The OPTWEB network reduces by 40 percent the cost compared to a conventional packet network.
Published: 2021
Full Text: View/download PDF

5. A High-Throughput FPGA Accelerator for Short-Read Mapping of the Whole Human Genome

Author: Chia-Hsiang Yang, Yen-Lung Chen, Bo-Yi Chang, and Tzi-Dar Chiueh
Subjects: 020203 distributed computing, Generator (computer programming), business.industry, Computer science, 02 engineering and technology, Bloom filter, Data structure, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, Field-programmable gate array, business, Throughput (business), Access time, Computer hardware
Abstract: The mapping of DNA subsequences to a known reference genome, referred to as “short-read mapping”, is essential for next-generation sequencing. Hundreds of millions of short reads need to be aligned to a tremendously long reference sequence, making short-read mapping very time consuming. In this article, a high-throughput hardware accelerator is proposed so as to accelerate this task. A Bloom filter-based candidate mapping location (CML) generator and a folded processing element (PE) array are proposed to address CML selection and the Smith-Waterman (SW) alignment algorithm, respectively. It is shown that the proposed CML generator reduces the required memory access by 40 percent by employing a down-sampling scheme when compared to the Ferragina-Manzini index (FM-index) solution. The proposed hierarchical Bloom filter (HBF) that includes optimized parameters achieves a 1.5×104 times acceleration over the conventional Bloom filter. The proposed memory re-allocation scheme further reduces the memory access time for the HBF by a factor of 256. The proposed folded PE array delivers a 1.2-to-3.2 times higher giga cell updates per second (GCUPS). The processing time can be further reduced by 53-to-72 percent by employing a fully pipelined PE array that allows for a tailored shift amount for seeding. The accelerator is realized on a Stratix V GX FPGA with 16GB external SDRAM. Operated at 200MHz, the proposed FPGA accelerator delivers a 2.1-to-11 times higher throughput with the highest 99 percent accuracy and 98 percent sensitivity compared to the state-of-the-art FPGA-based solutions.
Published: 2021
Full Text: View/download PDF

6. FPGA Realization of Spherical Chaotic System with Application in Image Transmission

Author: F. Javier Pérez-Pinal, Jose Cruz Nuñez-Perez, Esteban Tlelo-Cuautle, Yuma Sandoval-Ibarra, and Vincent Ademola Adeyemi
Subjects: 0209 industrial biotechnology, Article Subject, Computer science, General Mathematics, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, General Engineering, Chaotic, 02 engineering and technology, Engineering (General). Civil engineering (General), Grayscale, Synchronization, 020901 industrial engineering & automation, Transmission (telecommunications), Stratix, Attractor, QA1-939, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, TA1-2040, Field-programmable gate array, Algorithm, Realization (systems), Mathematics
Abstract: This paper considers a three-dimensional nonlinear dynamical system capable of generating spherical attractors. The main activity is the realization of a spherical chaotic attractor on Intel and Xilinx FPGA boards, with a focus on implementation of a secure communication system. The first major contribution is the successful synchronization of two chaotic spherical systems, in VHDL program, in a master-slave topology using Hamiltonian forms. The synchronization errors show that the two spherical chaotic systems synchronize in a very short time after which the error signals become zero. The second major contribution is the FPGA realization of a spherical chaotic attractor-based secure communication system, which involves encrypting both grayscale and RGB images with chaos and diffusion key at the transmitting system, sending the encrypted image via the state variables, and reconstructing the encrypted image at the receiving system. The Intel Stratix III and Xilinx Artix-7 AC701 results are the same as those of MATLAB. The statistical analyses of the encrypted and received images show that the implemented system is very effective, as it reveals high degree of randomness in the encrypted images with the entropy test, and the obtained correlation coefficient, which is zero, removes relativity between the original and encrypted images. Finally, the transmission system fully recovers the original grayscale and RGB images without loss of information.
Published: 2021
Full Text: View/download PDF

7. Design and Analysis of a Multirate 5-bit High-Order 52 fsrms Δ ∑ Time-to-Digital Converter Implemented on 40 nm Altera Stratix IV FPGA

Author: Karim Ansari Asl, Ebrahim Farshidi, Sawal Hamid Md Ali, Ahmad Mouri Zadeh Khaki, and Masuri Othman
Subjects: General Computer Science, Dynamic range, business.industry, Computer science, General Engineering, Synchronization, Time-to-digital converter, Sampling (signal processing), Logic gate, Stratix, Calibration, General Materials Science, business, Field-programmable gate array, Computer hardware
Abstract: This paper describes FPGA implementation of a high-order continuous-time multi-stage noise-shaping (MASH) $\Delta \Sigma $ time-to-digital converter (TDC). The TDC is based on Gated Switched-Ring Oscillator (GSRO) and employs multirating technique to achieve improved performance over conventional $\Delta \Sigma $ TDCs. The proposed TDC has been implemented on an Altera Stratix IV FPGA development board. Dynamic and static tests were performed on the proposed design and experimental results demonstrate that it can perform its function without the need of calibration. The built-in clock circuitries of the FPGA board provides sampling clocks and operating frequencies of the GSROs. This work presents a 52 fsrms, 89.7 dB dynamic range and 0.18 ps time-resolution at 200 MHz, 800 MHz, 1600 MHz sampling rate at the first, second and third stage, respectively, which demonstrate that the proposed third-order TDC can play an important role in applications such as ADPLLs and range finders in which accuracy and speed are vital.
Published: 2021
Full Text: View/download PDF

8. FTA-GAN: A Computation-Efficient Accelerator for GANs With Fast Transformation Algorithm

Author: Peixiang Yang, Zhongfeng Wang, and Wendong Mao
Subjects: Hardware architecture, Computer Networks and Communications, Computer science, Dataflow, Computation, Computer Science Applications, Computer engineering, Artificial Intelligence, Gate array, Stratix, Deconvolution, Field-programmable gate array, Software, Energy (signal processing)
Abstract: Nowadays, generative adversarial network (GAN) is making continuous breakthroughs in many machine learning tasks. The popular GANs usually involve computation-intensive deconvolution operations, leading to limited real-time applications. Prior works have brought several accelerators for deconvolution, but all of them suffer from severe problems, such as computation imbalance and large memory requirements. In this article, we first introduce a novel fast transformation algorithm (FTA) for deconvolution computation, which well solves the computation imbalance problem and removes the extra memory requirement for overlapped partial sums. Besides, it can reduce the computation complexity for various types of deconvolutions significantly. Based on FTA, we develop a fast computing core (FCC) and the corresponding computing array so that the deconvolution can be efficiently computed. We next optimize the dataflow and storage scheme to further reuse on-chip memory and improve the computation efficiency. Finally, we present a computation-efficient hardware architecture for GANs and validate it on several GAN benchmarks, such as deep convolutional GAN (DCGAN), energy-based GAN (EBGAN), and Wasserstein GAN (WGAN). The experimental results show that our design can reach 2211 GOPS under 185-MHz working frequency on Intel Stratix 10SX field-programmable gate array (FPGA) board with satisfactory visual results. In brief, the proposed design can achieve more than 2x hardware efficiency improvement over previous designs, and it can reduce the storage requirement drastically.
Published: 2021
Full Text: View/download PDF

9. FPGA-Based Real-Time Simulation Platform for Large-Scale STN-GPe Network

Author: Linlu Zu, Fei Su, Hong Wang, and Min Chen
Subjects: Computer science, Deep Brain Stimulation, Biomedical Engineering, 02 engineering and technology, Globus Pallidus, 03 medical and health sciences, 0302 clinical medicine, Subthalamic Nucleus, Real-time simulation, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Internal Medicine, Waveform, Computer Simulation, MATLAB, Field-programmable gate array, Network model, computer.programming_language, Neurons, business.industry, General Neuroscience, Rehabilitation, 020201 artificial intelligence & image processing, Multiplier (economics), Performance improvement, business, computer, 030217 neurology & neurosurgery, Computer hardware
Abstract: The real-time simulation of large-scale subthalamic nucleus (STN)-external globus pallidus (GPe) network model is of great significance for the mechanism analysis and performance improvement of deep brain stimulation (DBS) for Parkinson's states. This paper implements the real-time simulation of a large-scale STN-GPe network containing 512 single-compartment Hodgkin-Huxley type neurons on the Altera Stratix IV field programmable gate array (FPGA) hardware platform. At the single neuron level, some resource optimization schemes such as multiplier substitution, fixed-point operation, nonlinear function approximation and function recombination are adopted, which consists the foundation of the large-scale network realization. At the network level, the simulation scale of network is expanded using module reuse method at the cost of simulation time. The correlation coefficient between the neuron firing waveform of the FPGA platform and the MATLAB software simulation waveform is 0.9756. Under the same physiological time, the simulation speed of FPGA platform is 75 times faster than the Intel Core i7-8700K 3.70 GHz CPU 32GB RAM computer simulation speed. In addition, the established platform is used to analyze the effects of temporal pattern DBS on network firing activities. The proposed large-scale STN-GPe network meets the need of real time simulation, which would be rather helpful in designing closed-loop DBS improvement strategies.
Published: 2020
Full Text: View/download PDF

10. Accelerating FPGA Routing Through Algorithmic Enhancements and Connection-aware Parallelization

Author: Dries Vercruyce, Dirk Stroobandt, and Yun Zhou
Subjects: 010302 applied physics, Router, Speedup, General Computer Science, Computer science, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Minimum bounding box, 0103 physical sciences, Stratix, Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Routing (electronic design automation), Physical design, Field-programmable gate array, Space partitioning
Abstract: Routing is a crucial step in Field Programmable Gate Array (FPGA) physical design, as it determines the routes of signals in the circuit, which impacts the design implementation quality significantly. It can be very time-consuming to successfully route all the signals of large circuits that utilize many FPGA resources. Attempts have been made to shorten the routing runtime for efficient design exploration while expecting high-quality implementations. In this work, we elaborate on the connection-based routing strategy and algorithmic enhancements to improve the serial FPGA routing. We also explore a recursive partitioning-based parallelization technique to further accelerate the routing process. To exploit more parallelism by a finer granularity in both spatial partitioning and routing, a connection-aware routing bounding box model is proposed for the source-sink connections of the nets. It is built upon the location information of each connection’s source, sink, and the geometric center of the net that the connection belongs to, different from the existing net-based routing bounding box that covers all the pins of the entire net. We present that the proposed connection-aware routing bounding box is more beneficial for parallel routing than the existing net-based routing bounding box. The quality and runtime of the serial and multi-threaded routers are compared to the router in VPR 7.0.7. The large heterogeneous Titan23 designs that are targeted to a detailed representation of the Stratix IV FPGA are used for benchmarking. With eight threads, the parallel router using the connection-aware routing bounding box model reaches a speedup of 6.1× over the serial router in VPR 7.0.7, which is 1.24× faster than the one using the existing net-based routing bounding box model, while reducing the total wire-length by 10% and the critical path delay by 7%.
Published: 2020
Full Text: View/download PDF

11. A High-Linearity Vernier Time-to-Digital Converter on FPGAs With Improved Resolution Using Bidirectional-Operating Vernier Delay Lines

Author: Xiangyu Li and Ke Cui
Subjects: Computer science, Vernier scale, 020208 electrical & electronic engineering, Linearity, 02 engineering and technology, Signal, law.invention, Time-to-digital converter, law, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Electronic engineering, Electrical and Electronic Engineering, Line (text file), Instrumentation, Jitter
Abstract: Time-to-digital converters (TDCs) act as a core component in many timing-critical systems. Tapped delay line is the most widely used style for the field-programmable gate-array (FPGA)-based TDCs, but it suffers from several intractable problems such as poor nonlinearity and long delay-line occupation. Recently, a new ring-oscillator-based Vernier TDC circuit using carry chains was demonstrated to have much smaller nonlinearity with a much shorter delay line. Since the used delay line is not compensated, the timing jitter increases (or the precision degrades) remarkably when the oscillation number is large. This would incur undesirable lower resolution selection to maintain acceptable precision. In this article, a novel bidirectional-operating Vernier delay line structure is proposed to reduce the required oscillation number during the fine time measurement. Traditionally, for the Vernier delay line, the fast signal propagates along the slow delay line, while the slow signal propagates along the fast delay line, to which we call the normal Vernier delay line. The new structure proposes two parallelized Vernier delay line, consisting of one normal Vernier delay line and another abnormal Vernier delay line. The term “abnormal” refers to that the fast and slow signals are unusually fed to the fast and slow delay lines, respectively. A prototype TDC circuit based on this new method was implemented on a Stratix III FPGA. Test results demonstrate that by adopting the proposed novel circuit architecture, both the resolution and the root-mean-square error can be improved from the 30–40-ps level to the 20–30-ps level.
Published: 2020
Full Text: View/download PDF

12. Hardware Implementation of Streaming Logarithm Computing Unit for Fixed-Point Data

Author: Yaroslav O. Hordiienko, Ievgen V. Korotkyi, and Anton Yu. Varfolomieiev
Subjects: Floating point, Logarithm, Computer science, Clock rate, Stratix, Verilog, CORDIC, Field-programmable gate array, computer, computer.programming_language, ModelSim, Computational science
Abstract: The work aims to create hardware implementation of the streaming computing unit for logarithm calculation in fixed-point. Logarithms are widely used in telecommunications, particularly in radio intelligence to convert power spectrum values to decibels for further processing of spectrum data, e.g. for radio signals detection, range-finding, or direction-finding. For spectrum analysis of high sampling rate wideband signals, it is expedient to utilize hardware computing units for streaming logarithm calculation, implemented inside FPGA or ASIC chips. The market offers a large amount of IP cores for logarithm calculation in floating point. Floating-point calculation units offer a high dynamic range but also consume a large number of hardware resources that could diminish the maximum clock frequency of devices. In the proposed work, different approaches for logarithm calculating are considered, including CORDIC, Taylor series, and table-based methods. Authors proposed a mathematical model and architecture of streaming computing unit for base 2 logarithm calculation in fixed-point that can be easily adapted to any other base, simply multiplying the result by a constant. The proposed computing unit utilizes a table-based approach and counting leading zeroes in the argument. Based on the mathematical model, the high-level computational model in MATLAB® Simulink® was created. All the components of the mentioned model are compatible with HDL Coder. The proposed MATLAB® Simulink® model is parameterizable, one can set word and fraction width for input/output data, and memory size for table-based part. Using HDL Coder, Verilog HDL implementation for the proposed logarithm computing unit was synthesized. Utilizing HDL Verifier authors created the testbench in Verilog language for the verification of created computing unit on RTL level of abstraction based on reference data collected during simulation in Simulink. Running generated testbench in ModelSim simulator for one million clock cycles proved that there are no differences in operation between the Simulink Model and generated HDL design. The authors were synthesized HDL implementation of the created computing unit in Quartus Prime for the Stratix IV FPGA chip to evaluate the hardware cost of the proposed solution. The developed logarithm calculation unit was compared to the existing CORDIC, Taylor series, and table-based implementations in terms of calculation error and hardware costs. Additionally, for comparison purposes the authors created a hardware implementation for the base 2 logarithm calculation unit in a single-precision floating-point. During the evaluation of the calculation error, the double-precision floating-point logarithm computing unit from Simulink was chosen as the source for reference results. The comparison showed that created computing unit provides less calculation error compared to the existing fixed-point solutions, requires fewer hardware resources for implementation and can operate on higher clock frequencies. All the created models and source codes are open for utilization and can be downloaded from GitHub.
Published: 2020
Full Text: View/download PDF

13. A Reconfigurable Architecture for Discrete Cosine Transform in Video Coding

Author: Linhuang Wu, Zhifeng Chen, Xiuzhi Yang, Nam Ling, Zheng Jingyi, and Mingkui Zheng
Subjects: business.industry, Computer science, 02 engineering and technology, Display resolution, Gate array, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Discrete cosine transform, Codec, 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, business, Field-programmable gate array, Computer hardware, Coding (social sciences)
Abstract: Discrete cosine transform (DCT) is an indispensable module in video codecs and is a major part in many video coding standards including the latest high efficiency video coding (HEVC). As the video resolution increases, both transform sizes and the number of transforms increase continuously which poses challenges to the reusability design especially in hardware implementation. This paper presents reconfigurable transform architecture to flexibly support the reusability of different transform sizes. The proposed architecture maximally reuses the hardware resources by rearranging the order of input data for different transform sizes while still exploiting the butterfly property. Furthermore, this architecture supports reconfigurable throughput according to different hardware resource requirements. By applying the proposed architecture to the field-programmable gate array (FPGA) design of HEVC core transform matrices, the synthesis results show much lower consumption of hardware resources comparing to existing methods in the literature. The implementation in Altera’s Stratix III FPGA can operate at 139 MHz and supports real-time processing of $3840\times 2160$ ultra-high definition video at a minimum of 45 f/s and up to 359 f/s for different DCT sizes.
Published: 2020
Full Text: View/download PDF

14. Automatic Compilation of Diverse CNNs Onto High-Performance FPGA Accelerators

Author: Jae-sun Seo, Yu Cao, Sarma Vrudhula, and Yufei Ma
Subjects: Schedule, business.industry, Computer science, Dataflow, 020208 electrical & electronic engineering, Reconfigurability, 02 engineering and technology, computer.software_genre, Computer Graphics and Computer-Aided Design, Convolutional neural network, 020202 computer hardware & architecture, Software, Computer architecture, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Compiler, Electrical and Electronic Engineering, business, Field-programmable gate array, computer
Abstract: A broad range of applications are increasingly benefiting from the rapid and flourishing development of convolutional neural networks (CNNs). The FPGA-based CNN inference accelerator is gaining popularity due to its high-performance and low-power as well as FPGA’s conventional advantage of reconfigurability and flexibility. Without a general compiler to automate the implementation, however, significant efforts and expertise are still required to customize the design for each CNN model. In this paper, we present an register-transfer level (RTL)-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g., GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g., NiN, VGG, GoogLeNet, and ResNet, on two standalone Intel FPGAs, Arria 10, and Stratix 10, achieving end-to-end inference throughputs of 969 GOPS and 1604 GOPS, respectively, with batch size of one.
Published: 2020
Full Text: View/download PDF

15. Architectural improvements and technological enhancements for the APEnet+ interconnect system

Author: Piero Vicini, Michele Martinelli, Davide Rossetti, Andrea Biagioni, Pierluigi Paolucci, F. Lo Cicero, Elena Pastorelli, Roberto Ammendola, Ottorino Frezza, Alessandro Lonardo, Laura Tosoratto, and Francesco Simula
Subjects: FOS: Computer and information sciences, Network architecture, Remote direct memory access, Grid network, business.industry, Computer science, Interface (computing), Electrical engineering, FOS: Physical sciences, Computational Physics (physics.comp-ph), Embedded system, Hardware Architecture (cs.AR), Stratix, Bandwidth (computing), business, Field-programmable gate array, Computer Science - Hardware Architecture, Instrumentation, Physics - Computational Physics, Mathematical Physics, PCI Express
Abstract: The APEnet+ board delivers a point-to-point, low-latency, 3D torus network interface card. In this paper we describe the latest generation of APEnet NIC, APEnet v5, integrated in a PCIe Gen3 board based on a state-of-the-art, 28 nm Altera Stratix V FPGA. The NIC features a network architecture designed following the Remote DMA paradigm and tailored to tightly bind the computing power of modern GPUs to the communication fabric. For the APEnet v5 board we show characterizing figures as achieved bandwidth and BER obtained by exploiting new high performance ALTERA transceivers and PCIe Gen3 compliancy.
Published: 2022

16. Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

Author: Kim Hyesoon, Blaise Tine, Krishna Praveen Yalamarthy, and Fares Elsabbagh
Subjects: Computer graphics, Computer science, OpenGL, Stratix, Parallel computing, General-purpose computing on graphics processing units, Field-programmable gate array, 3D computer graphics, Reconfigurable computing, ComputingMethodologies_COMPUTERGRAPHICS, Rendering (computer graphics)
Abstract: The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.
Published: 2021
Full Text: View/download PDF

17. A memory bandwidth improvement with memory space partitioning for single-precision floating-point FFT on Stratix 10 FPGA

Author: Kentaro Sano and Takaaki Miyajima
Subjects: Computer science, Computer cluster, Reading (computer), Fast Fourier transform, Stratix, Bandwidth (computing), Memory bandwidth, Parallel computing, Space partitioning, Single-precision floating-point format
Abstract: The Fast Fourier Transform (FFT) is one of the fundamental computational methods used in the fields of computational science and high-performance computing. Single-precision floating-point complex FFT itself is known as a memory bandwidth bottleneck and often becomes a bottleneck of application acceleration in these fields. We are researching and developing a parallel FFT on FPGA(s) to overcome this problem. In this paper, we discuss the memory bandwidth of the single-precision floating-point complex FFT on an FPGA. Our FFT implementation is based on a state-of-the-art OpenCL implementation provided by Intel. We first show that the computational performance of the FFT on Intel PAC D5005 is proportional to the effective memory bandwidth of the main memory. Then we propose a memory sub-system to improve the effective memory bandwidth. Specifically, a memory space partitioning and the sub-modules that access each memory space individually. In our FPGA design running at 270 MHz, two memory channels of DDR4-2400 memory are used for both reading and writing, respectively. Our proposed memory sub-system achieved an effective memory bandwidth of 22.57 [GB/s] (65.3% of the theoretical peak of this implementation) was achieved when the number of data points for FFT was 16,777,216.
Published: 2021
Full Text: View/download PDF

18. Dense FPGA Compute Using Signed Byte Tuples

Author: Martin Langhammer, Simon Finn, Sergey Gribok, and Bogdan Pasca
Subjects: Computer science, Stratix, Byte, Multiplier (economics), Applications of artificial intelligence, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Tuple, Field-programmable gate array, Block (data storage), Integer (computer science)
Abstract: The importance of AI to FPGA has resulted in ever increasing low precision hard arithmetic features in newer devices. Many FPGAs, including those from Achronix, Intel, and Xilinx, have significantly increased the density of INT8 and INT9 embedded multipliers. Mainstream devices with these enhanced densities still support the traditional intermediate integer (typically 18-bit) multipliers, with IEEE-754 floating-point now becoming more prevalent as well.Recently, Intel introduced the Stratix 10 NX FPGA, which is targeted specifically at AI acceleration. This device contains a new type of AI-specific DSP Block with approximately an order of magnitude higher INT8 density than previous FPGA industry DSP Blocks. Larger standard FPGA integer precisions, however, are not directly supported. Intel has described some methods of aggregating larger multipliers from the NX Blocks, but these are somewhat smaller than typically used by DSP applications. Larger multiplications can also be useful for other AI applications, such as found in training. In this paper, we introduce the concept of signed tuples, which can be used to assemble signed multipliers into more useful larger precision multipliers by leveraging FPGA soft-logic inexpensively. We demonstrate several constructions of INT16 multipliers, with some modes requiring less than 3 ALMs per INT16 multiplier when implemented in a tensor format. We also describe the application of these methods to even larger multipliers and alternate constructs such as complex multiplication. We show that there is essentially no performance degradation or system fitting impact from our method. The mid-size NX device can support up 33 TOPs INT16 (from 29,700 constructed INT16 multipliers on a mid-speed grade device) with this approach, which is higher than any other current or announced monolithic die FPGA. Our methods are not limited to FPGA, or any particular starting precision, and so may be used for other aggregations as well.
Published: 2021
Full Text: View/download PDF

19. FPGA Implementation for LDPC Decoders Using A Novel Memory Effective Decoding Algorithm

Author: Feng Haijie, Yu Zhenghong, Ma Ziqian, Li Hongyuan, and Zheng Wanjun
Subjects: Variable (computer science), Computer science, Stratix, Convergence (routing), Node (circuits), Low-density parity-check code, Chip, Field-programmable gate array, Algorithm, Decoding methods
Abstract: In this paper, a novel memory effective algorithm for decoding Low-Density Parity-Check (LDPC) codes is proposed with a view to reduce the implementation complexity and hardware resources. The algorithm, called check node self-update (CNSU) algorithm, is based on layered normalized min-sum (LNMS) decoding algorithm while utilizing iteration parallel techniques to integrate both variable nodes (VNs) message and a-posterior probability message into the check nodes (CNs) message, which eliminates memories of both the variable node and the a-posterior probability message as well as updating module of a-posterior probability message in CNs unit. Based on the proposed CNSU algorithm, design of partially parallel decoder architecture and serial simulations followed by implementation on the Stratix II EP2S180 FPGA are presented. The results show that the proposed algorithm significantly reduces hardware memory resources and chip area while keeping the benefit of bit-error-rate (BER) performance and speeding up of convergence with LNMS.
Published: 2021
Full Text: View/download PDF

20. End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression

Author: Mohamed Ibrahim, Jae-sun Seo, Eriko Nurvitadhi, Anupreetham Anupreetham, Yu Cao, Vaughn Betz, Ajay Kuzhively, Andrew Boutros, Abinash Mohanty, and Mathew Hall
Subjects: Computer science, Minimum bounding box, business.industry, Feature extraction, Stratix, Overhead (computing), Latency (engineering), business, Convolutional neural network, Throughput (business), Computer hardware, Object detection
Abstract: Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSD-MobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13μ s to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5× higher throughput and 4.4× lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.
Published: 2021
Full Text: View/download PDF

21. A Memory-Efficient Adaptive Optimal Binary Search Tree Architecture for IPV6 Lookup Address

Author: D. Shalini Punithavathani and M. M. Vijay
Subjects: Computer science, Pipeline (computing), Optimal binary search tree, Stratix, Lookup table, Verilog, Static random-access memory, Parallel computing, Hardware_ARITHMETICANDLOGICSTRUCTURES, Altera Quartus, computer, AND gate, computer.programming_language
Abstract: The Internet protocol version 6 which has an increase in address length with prefix length poses challenges in memory efficiency, incremental updates. So this work proposes adaptive optimal binary search tree (AOBT) based IPv6 lookup (AOBT-IL) architecture. An adaptive optimal binary search tree (AOBT) structure is introduced for the minimization of memory utilization. An Altera Quartus Stratix II device with Verilog HDL implements the IP lookup design. The proposed method performance is validated using different lookup table sizes with comparative analysis. The proposed method accomplished better outcomes in the case of maximum frequency, memory, SRAM, and logic elements results when compared to existing methods such as balanced parallelized frugal lookup (BPFL), linear pipelined IPv6 lookup architecture (IPILA) and parallel optimized linear pipeline (POLP) methods.
Published: 2021
Full Text: View/download PDF

22. BLASTP-ACC: Parallel Architecture and Hardware Accelerator Design for BLAST-Based Protein Sequence Alignment

Author: Yi-Chang Lu and Yu-Cheng Li
Subjects: Sequence, Databases, Factual, Xeon, business.industry, Computer science, 020208 electrical & electronic engineering, Biomedical Engineering, Proteins, Equipment Design, 02 engineering and technology, Parallel computing, Giga, Software, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, Amino Acid Sequence, Electrical and Electronic Engineering, business, Field-programmable gate array, Sequence Alignment, Dram
Abstract: In this study, we design a hardware accelerator for a widely used sequence alignment algorithm, the basic local alignment search tool for proteins (BLASTP). The architecture of the proposed accelerator consists of five stages: a new systolic-array-based one-hit finding stage, a novel RAM-REG-based two-hit finding stage, a refined ungapped extension stage, a faster gapped extension stage, and a highly efficient parallel sorter. The system is implemented on an Altera Stratix V FPGA with a processing speed of more than 500 giga cell updates per second (GCUPS). It can receive a query sequence, compare it with the sequences in the database, and generate a list sorted in descending order of the similarity scores between the query sequence and the subject sequences. Moreover, it is capable of processing both query and subject protein sequences comprising as many as 8192 amino acid residues in a single pass. Using data from the National Center for Biotechnology Information (NCBI) database, we show that a speed-up of more than 3X can be achieved with our hardware compared to the runtime required by BLASTP software on an 8-thread Intel Xeon CPU with 144 GB DRAM.
Published: 2019
Full Text: View/download PDF

23. Microfluidic Cooling of a 14-nm 2.5-D FPGA With 3-D Printed Manifolds for High-Density Computing: Design Considerations, Fabrication, and Electrical Characterization

Author: Aravind Dasu, Thomas E. Sarvey, Sreejith Kochupurackal Rajan, Gutala Ravi Prakash, Ankit Kaul, and Muhannad S. Bakir
Subjects: Materials science, business.industry, Dice, Heat sink, Industrial and Manufacturing Engineering, Die (integrated circuit), Electronic, Optical and Magnetic Materials, Coolant, Gate array, Thermal, Stratix, Optoelectronics, Electronics cooling, Electrical and Electronic Engineering, business
Abstract: The 2.5-D integration is becoming a common method of tightly integrating heterogeneous dice with dense interconnects for efficient, high-bandwidth inter-die communication. While this tight integration improves performance, it also increases the challenge of heat extraction by increasing aggregate package powers and introducing thermal crosstalk between the adjacent dice. In this article, a microfluidic heat sink is used to cool a 2.5-D Stratix 10 GX field-programmable gate array (FPGA) consisting of an FPGA die surrounded by four transceiver dice. The heat sink utilizes a heterogeneous micropin-fin array with micropin-fin densities that are tailored to the local heat fluxes of the underlying dice. Enabled through a 3-D printed enclosure for fluid delivery, the assembled heat sink has a total height of 6.5 mm, including the tubes used for fluid delivery. The heat sink is tested in an open-loop system with deionized water as a coolant, and thermal performance is compared against a high-end air-cooled heat sink. Improvements in die temperatures, computational density, and thermal coupling between the dice are observed. The effect of the FPGA power on the surrounding transceiver die temperatures was reduced by a factor of ${10}\times$ to over ${100}\times$ when compared with the air-cooled heat sink.
Published: 2019
Full Text: View/download PDF

24. FPGA-based interrogation controller with optimized pipeline architecture for very large-scale fiber-optic interferometric sensor arrays

Author: Zhongjie Ren, Rihong Zhu, Wenjun Peng, Jieyu Qian, and Ke Cui
Subjects: Computer science, business.industry, Mechanical Engineering, Controller (computing), Pipeline (computing), 02 engineering and technology, 021001 nanoscience & nanotechnology, 01 natural sciences, Multiplexing, Atomic and Molecular Physics, and Optics, Electronic, Optical and Magnetic Materials, 010309 optics, Time-division multiplexing, Wavelength-division multiplexing, 0103 physical sciences, Stratix, Demodulation, Electrical and Electronic Engineering, 0210 nano-technology, Field-programmable gate array, business, Computer hardware
Abstract: Fiber-optic sensor arrays are always organized utilizing hybrid time division multiplexing (TDM)/wavelength division multiplexing (WDM) techniques to share some common optical components and reduce the cost. Modern sensor arrays can support up to several thousands of sensors by employing distributed amplification technique. They generate huge amount of data and pose other big challenge on the interrogation controller of the sensor system, such as long demodulation time and heavy computational burden. Aiming to solve these problems, we present a complete design of the interrogation controller based on a field programmable gate array (FPGA) by lending its powerful parallelism potential. The interrogation controller adopts a hardware-implementation-friendly modulation method called the rectangular-pulse binary (RPB) modulation. The functional modules in the FPGA are properly organized according to the working principle of the RPB method and carefully designed by optimizing their pipeline architecture to minimize the output latency, which is essential to increase the maximal supportable number of sensors. By analyzing the resource budget and the actual resource consumption result, it demonstrates that the controller can support up to 1000+ sensors in real time by using a single middle-end Stratix III FPGA chip. A 8-sensors array prototype is constructed to validate its functionality. This work can greatly reduce the cost of the sensor system and push it closer to the practical applications.
Published: 2019
Full Text: View/download PDF

25. An efficient and adaptable multimedia system for converting PAL to VGA in real-time video processing

Author: Deepak Kumar Jain, Sunil Jacob, Jafar A. Alzubi, and Varun G. Menon
Subjects: Video Graphics Array, Computer science, business.industry, 020207 software engineering, 02 engineering and technology, Video processing, computer.software_genre, Videoconferencing, Stratix, VHDL, 0202 electrical engineering, electronic engineering, information engineering, Super video graphics array, 020201 artificial intelligence & image processing, Altera Quartus, Field-programmable gate array, business, computer, Computer hardware, Information Systems, computer.programming_language
Abstract: Real-time video processing has found its range of applications from defense to consumer electronics for surveillance, video conferencing, etc. With the advent of Field Programmable Gate Arrays (FPGAs), flexible real-time video processing systems which can meet hard real-time constraints are easily realized with short development time. Most of the existing solutions have high utilization of system resources and are not quite flexible with many applications. Here we propose a hardware–software co-design for an FPGA-based real-time video processing system to convert video in standard Phase Alternating Line (PAL) 576i format to standard video of Video Graphics Array (VGA)/Super Video Graphics Array (SVGA) format with little utilization of resources. Switching between multiple video streams, character/text overlaying, and skin color detection are also incorporated with the system. The system is also adaptable for rugged applications. VHSIC Hardware Description Language (VHDL) codes for the architecture were synthesized using Altera Quartus II and targeted for Altera Stratix I FPGA. Results achieved confirm that the proposed system performs efficient conversion with very less resource utilization compared to the existing solutions. Since the proposed system is also flexible, many other applications can be incorporated in the future.
Published: 2019
Full Text: View/download PDF

26. Inside Project Brainwave's Cloud-Scale, Real-Time AI Processor

Author: Prerak Patel, Gabriel Weisz, Kalin Ovtcharov, Lo Daniel, Doug Burger, Shlomi Alkalay, Stephen F. Heil, Adam Sapek, Michael Haselman, Steven K. Reinhardt, Todd Massengill, Jeremy Fowers, Eric S. Chung, Michael K. Papamichael, Adrian M. Caulfield, Logan Adams, Sitaram Lanka, Ming Liu, Lisa Woods, and Mahdi Ghandi
Subjects: Computer science, business.industry, Recurrent neural nets, Cloud computing, 02 engineering and technology, 020202 computer hardware & architecture, Microarchitecture, Recurrent neural network, Computer architecture, Parallel processing (DSP implementation), Hardware and Architecture, Stratix, 0202 electrical engineering, electronic engineering, information engineering, System on a chip, Electrical and Electronic Engineering, Field-programmable gate array, business, Software
Abstract: Growing computational demands from deep neural networks (DNNs), coupled with diminishing returns from general-purpose architectures, have led to a proliferation of Neural Processing Units (NPUs). This paper describes the Project Brainwave NPU (BW-NPU), a parameterized microarchitecture specialized at synthesis time for convolutional and recurrent DNN workloads. The BW-NPU deployed on an Intel Stratix 10 280 FPGA achieves sustained performance of 35 teraflops at a batch size of 1 on a large recurrent neural network (RNN).
Published: 2019
Full Text: View/download PDF

27. Digital Hardware Implementation of Gaussian Wilson–Cowan Neocortex Model

Author: Shaghayegh Gomar and Majid Ahmadi
Subjects: education.field_of_study, Control and Optimization, Neocortex, Quantitative Biology::Neurons and Cognition, Computer science, business.industry, Gaussian, Population, Static timing analysis, Computer Science Applications, Computer Science::Hardware Architecture, Computational Mathematics, symbols.namesake, medicine.anatomical_structure, Artificial Intelligence, Stratix, medicine, symbols, Point (geometry), Field-programmable gate array, business, education, Computer hardware, Electronic circuit
Abstract: Hardware implementation of biological neural models can help in better understanding of the brain functionality, implementing cognitive tasks, and also studying the brain diseases. Gaussian Wilson–Cowan model as one of the well-known population-based models represents neuronal functionality in neocortex. In this paper, Gaussian Wilson–Cowan model is investigated in terms of its digital implementation feasibility. Digital model is proposed for the Gaussian Wilson–Cowan and examined from dynamical and timing behavior point of view. The evaluations indicate that the digitized model is able to reproduce the dynamical bifurcations as the original model is capable of. An efficient digital hardware system is given for the proposed model with minimum required resources using Verilog Hardware Description Language. Digital architectures are physically implemented on an Altera FPGA board. Experimental results show that the proposed circuits take maximum 2% of the available resources of a Stratix Altera board. In addition, static timing analysis indicates that the circuits can work in a maximum frequency of 244 MHz.
Published: 2019
Full Text: View/download PDF

28. COFFE 2

Author: Sadegh Yazdanshenas and Vaughn Betz
Subjects: 010302 applied physics, Standard cell, General Computer Science, Computer science, Interface (Java), business.industry, Circuit design, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Embedded system, 0103 physical sciences, Stratix, Lookup table, 0202 electrical engineering, electronic engineering, information engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES, Full custom, Field-programmable gate array, business, Hardware_LOGICDESIGN, Block (data storage)
Abstract: FPGAs are becoming more heteregeneous to better adapt to different markets, motivating rapid exploration of different blocks/tiles for FPGAs. To evaluate a new FPGA architectural idea, one should be able to accurately obtain the area, delay, and energy consumption of the block of interest. However, current FPGA circuit design tools can only model simple, homogeneous FPGA architectures with basic logic blocks and also lack DSP and other heterogeneous block support. Modern FPGAs are instead composed of many different tiles, some of which are designed in a full custom style and some of which mix standard cell and full custom styles. To fill this modelling gap, we introduce COFFE 2, an open-source FPGA design toolset for automatic FPGA circuit design. COFFE 2 uses a mix of full custom and standard cell flows and supports not only complex logic blocks with fracturable lookup tables and hard arithmetic but also arbitrary heterogeneous blocks. To validate COFFE 2 and demonstrate its features, we design and evaluate a multi-mode Stratix III-like DSP block and several logic tiles with fracturable LUTs and hard arithmetic. We also demonstrate how COFFE 2’s interface to VTR allows full evaluation of block-routing interfaces and various fracturable 6-LUT architectures.
Published: 2019
Full Text: View/download PDF

29. A Novel Parallel Architecture for Template Matching based on Zero-Mean Normalized Cross-Correlation

Author: Xiaotao Wang, Liangliang Han, and Xingbo Wang
Subjects: Normalization (statistics), template matching, General Computer Science, Pixel, Cross-correlation, business.industry, Computer science, Template matching, General Engineering, Image processing, Real image, parallel architecture, Automatic target recognition, normalized cross-correlation measure, Stratix, General Materials Science, lcsh:Electrical engineering. Electronics. Nuclear engineering, business, Image resolution, Algorithm, lcsh:TK1-9971, Digital signal processing, FPGA
Abstract: Template matching based on zero-mean normalized cross-correlation measure (ZNCC) has been widely used in a broad range of image processing applications. To meet the requirements for high processing speed, small size, and variable image size in automatic target recognition systems, a novel field-programmable gate array (FPGA)-based parallel architecture is presented in this paper for the ZNCC computation. The proposed architecture employs two groups of RAM blocks, one of which is used for the multiply-accumulate operations of the real and the reference images and the other for data rearrangement of the reference image, and their functions are switched through 2-input multiplexers when searching at the next row. Moreover, the sum of the pixels in the searching area of the real image is computed through serially accumulating the differences between the new column in the current searching area and the old column in the last searching area using one dual-port RAM. Simultaneously, the sum of the squares of the pixels is calculated in the same way. Using the Altera Stratix II FPGA chip (EP2S90F780I4) as the target device, the compilation results with Quartus II show that compared with the traditional architecture, the synthesis logic utilization decreases from 63% to 35% and the usage of DSP blocks decreases from 59% to 39%, while the memory bits only increase by 8% and the usage of other resources is nearly the same. The simulation and practical experimental results show that the proposed architecture can effectively improve the performance of the practical automatic target recognition system.
Published: 2019

30. Accelerating an FHE Integer Multiplier Using Negative Wrapped Convolution and Ping-Pong FFT

Author: Shuguo Li and Xiang Feng
Subjects: Cooley–Tukey FFT algorithm, Computer science, 020208 electrical & electronic engineering, Fast Fourier transform, 02 engineering and technology, 020202 computer hardware & architecture, Convolution, Multiplier (Fourier analysis), Computer Science::Hardware Architecture, symbols.namesake, Fourier transform, Strassen algorithm, Stratix, 0202 electrical engineering, electronic engineering, information engineering, symbols, Electrical and Electronic Engineering, Algorithm, Integer (computer science)
Abstract: This brief proposes a novel hardware structure for large integer multiplication in fully homomorphic encryption. We propose a method based on negative wrapped convolution to avoid zero padding in Strassen’s algorithm, which can cut down half of the Fourier transform length. In addition, we also optimize the ping-pong fast Fourier transform algorithm by doubling the transform throughput and generating the round constant on the fly. Based on our proposed method and optimized algorithm, we design and implement a 768 k-bit integer multiplier on Altera Stratix V field-programmable gate array (FPGA). Implementation results on FPGA show that our structure outperforms the current competitors in area efficiency.
Published: 2019
Full Text: View/download PDF

31. Acceleration of LSTM With Structured Pruning Method on FPGA

Author: Ruihan Hu, Hao Wang, Shaorun Wang, Sheng Chang, Jin He, Qijun Huang, and Peng Lin
Subjects: Hardware architecture, Speedup, General Computer Science, Computer science, Computation, pruning, 020208 electrical & electronic engineering, General Engineering, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Recurrent neural network, hardware acceleration, Stratix, 0202 electrical engineering, electronic engineering, information engineering, General Materials Science, lcsh:Electrical engineering. Electronics. Nuclear engineering, Pruning (decision trees), LSTM, Field-programmable gate array, lcsh:TK1-9971, Throughput (business), FPGA
Abstract: This paper focuses on accelerating long short-term memory (LSTM), which is one of the popular types of recurrent neural networks (RNNs). Because of the large number of weight memory accesses and high computation complexity with the cascade-dependent structure, it is a big challenge to efficiently implement the LSTM on field-programmable gate arrays (FPGAs). To speed up the inference on FPGA, considering its limited resource, a structured pruning method that can not only reduce the LSTM model's size without loss of prediction accuracy but also eliminate the imbalance computation and irregular memory accesses is proposed. Besides that, the hardware architecture of the compressed LSTM is designed to pursue high performance. As a result, the implementation of an LSTM language module on Stratix V GXA7 FPGA can achieve 85.2 GOPS directly on the sparse LSTM network by our method, corresponding to 681.6-GOPS effective throughput on the dense one, which shows that the proposed structured pruning algorithm makes 7.82 times speedup when only 1/8 parameters are reserved. We hope that our method can give an efficient way to accelerate the LSTM and similar recurrent neural networks when the resource-limited environment is emphasized.
Published: 2019
Full Text: View/download PDF

32. PETRA: A 22nm 6.97TFLOPS/W AIB-Enabled Configurable Matrix and Convolution Accelerator Integrated with an Intel Stratix 10 FPGA

Author: Sung-Gun Cho, Wei Tang, Chester Liu, and Zhengya Zhang
Subjects: Very-large-scale integration, Computer science, business.industry, Interface (computing), Stratix, Latency (audio), Multiplication, business, Field-programmable gate array, Matrix multiplication, Computer hardware, Convolution
Abstract: PETRA is a configurable FP16 matrix multiplication and convolution accelerator designed to be 2.5D integrated using Advanced Interface Bus (AIB). PETRA is built upon four 16×16 systolic arrays, but it employs a configurable H-tree accumulation to improve both the latency and the utilization by up to 8×. A 22nm 3.04mm2 PETRA prototype provides 1.433TFLOPS in computing matrix-matrix multiplication (MMM) and convolution (conv) at 0.88V, and it achieves a 6.97TFLOPS/W peak efficiency at 0.7V. PETRA is integrated with an Intel Stratix 10 FPGA in a multi-chip package (MCP) to provide the flexibility of FPGA and the performance and efficiency of PETRA.
Published: 2021
Full Text: View/download PDF

33. An efficient FPGA-based design for the AVMF filter

Author: Ahmed Ben Atitallah
Subjects: Nios II, Hardware architecture, Coprocessor, business.industry, Computer science, Software, Filter (video), Embedded system, Stratix, VHDL, business, Field-programmable gate array, computer, computer.programming_language
Abstract: This paper introduces an efficient parallel hardware architecture to implement the Adaptive Vector Median Filter (AVMF) in Field Programmable Gate Array (FPGA). This architecture is developed using the VHSIC Hardware Description language (VHDL) language and integrated in the Hardware/Software (HW/SW) environment as coprocessor. The NIOS II softcore processor is used to execute the SW part. The communication between HW and SW parts is carried out through the Avalon bus. The experimental results on the Stratix II development board show that the HW/SW AVMF system allows a reduction in processing time by 572 times relative to the SW solution at 140MHz with small decrease in image quality.
Published: 2021
Full Text: View/download PDF

34. Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

Author: Hao Chen, Xin He, Jiawen Liu, Dong Li, Zhen Xie, Guoyang Chen, and Weifeng Zhang
Subjects: Power management, Exploit, Computer science, Design space exploration, 020207 software engineering, 02 engineering and technology, 020202 computer hardware & architecture, Scheduling (computing), Computer architecture, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Field-programmable gate array, Throughput (business), Efficient energy use
Abstract: DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, we take the initiative to explore new opportunities and viable solutions in enabling energy-efficient DNN training on hybrid accelerators. To overcome fundamental challenges including avoiding training throughput loss, enabling fast design space exploration, and efficient scheduling, we propose a comprehensive framework, Hype-training, that utilizes a combination of offline characterization, performance modeling, and online scheduling of individual operations. Experimental tests using NVIDIA V100 GPUs and Intel Stratix 10 FPGAs show that, Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to achieve significant energy reduction, by 44.3% on average and up to 59.7%, without any loss in training throughput. Hype-training can also enforce power caps more effectively than state-of-the-art power management mechanisms on GPUs.
Published: 2021
Full Text: View/download PDF

35. Particle Mesh Ewald for Molecular Dynamics in OpenCL on an FPGA Cluster

Author: Emery Davis, Lawrence C. Stewart, Brian W. Sherman, Vipin Sachdeva, Carlo Pascoe, and Martin C. Herbordt
Subjects: Computer science, Particle Mesh, Stratix, Scalability, Fast Fourier transform, Parallel computing, Field-programmable gate array, Network topology, Host (network), Interpolation
Abstract: Molecular Dynamics (MD) simulations play a central role in physics-driven drug discovery. MD applications often use the Particle Mesh Ewald (PME) algorithm to accelerate electrostatic force computations, but efficient parallelization has proven difficult due to the high communication requirements of distributed 3D FFTs. In this paper, we present the design and implementation of a scalable PME algorithm that runs on a cluster of Intel Stratix 10 FPGAs and can handle FFT sizes appropriate to address real-world drug discovery projects (grids up to 1283). To our knowledge, this is the first work to fully integrate all aspects of the PME algorithm (charge spreading, 3D FFT/IFFT, and force interpolation) within a distributed FPGA framework. The design is fully implemented with OpenCL for flexibility and ease of development and uses 100 Gbps links for direct FPGA-to-FPGA communications without the need for host interaction. We present experimental data up to 4 FPGAs (e.g., 206 microseconds per timestep for a 65536 atom simulation and 643 3D FFT), outperforming GPUs. Additionally, we discuss design scalability on clusters with differing topologies up to 64 FPGAs (with expected performance greater than all known GPU implementations) and integration with other hardware components to form a complete molecular dynamics application. We predict best-case performance of 6.6 microseconds per timestep on 64 FPGAs.
Published: 2021
Full Text: View/download PDF

36. High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection

Author: Artur Podobas, Niclas Jansson, Philipp Schlatter, Stefano Markidis, Martin Karp, Tobias Kenter, and Christian Plessl
Subjects: Computer engineering, Computer science, Gate array, Spectral element method, Stratix, Hardware acceleration, Pascal (programming language), Projection (set theory), Field-programmable gate array, Ampere, computer, computer.programming_language
Abstract: Improvements in computer systems have historically relied on two well-known observations: Moore’s law and Dennard’s scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures’ comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs’ applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Teslaseries cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs’ performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?
Published: 2021
Full Text: View/download PDF

37. GORDON: Benchmarking Optane DC Persistent Memory Modules on FPGAs

Author: Nicholas Beckwith, Jing Jane Li, and Jialiang Zhang
Subjects: 010302 applied physics, Profiling (computer programming), Hardware_MEMORYSTRUCTURES, Computer science, business.industry, Byte, 02 engineering and technology, DIMM, 01 natural sciences, 020202 computer hardware & architecture, Microarchitecture, Non-volatile memory, Embedded system, 0103 physical sciences, Scalability, Stratix, 0202 electrical engineering, electronic engineering, information engineering, business, Field-programmable gate array
Abstract: Scalable nonvolatile memory DIMMs become commercially available on FPGAs with the release of Intel’s Optane DC Persistent Memory (DCPM) product. This new class of memory combines the benefits of DRAM-like solid-state memory (fast, byte addressable) and Flash-like persistent storage (cost- effective, non-volatile), making FPGA highly competitive in accelerating large-scale machine learning and data analytics applications. Despite of the great promise, the performance characteristics of Optane DCPM remains relatively alien to FPGA developers compared to conventional DDRx DRAM or SSD. Recent preliminary studies all use CPU-based systems running full OS stack that limits their ability to characterize the detailed performance characteristics of the Optane DCPM. To fully exploit the advantages of Optane DCPM in FPGA-based accelerator design, we present the first FPGA-based Optane DCPM profiling framework, named GORDON 1 on Stratix 10 DX FPGA. By leveraging the flexibility of the FPGA in building custom logic and the FPGA-specific features for Optane DCPM, GORDON addresses the fundamental limitations of prior CPU- based profiling. The detailed understanding on Optane DCPM may also benefit system design and optimization beyond FPGAs using CPUs and GPUs.
Published: 2021
Full Text: View/download PDF

38. Compute-Capable Block RAMs for Efficient Deep Learning Acceleration on FPGAs

Author: Valeria Bertacco, Charles Augustine, Jiecao Yu, Ravi Iyer, Reetuparna Das, Vidushi Goyal, Andrew Boutros, Xiaowei Wang, and Eriko Nurvitadhi
Subjects: 010302 applied physics, Computer science, Reconfigurability, 02 engineering and technology, Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, 0103 physical sciences, Stratix, Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Bandwidth (computing), SIMD, Hardware_ARITHMETICANDLOGICSTRUCTURES, Routing (electronic design automation), Field-programmable gate array, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Throughput (business), Block (data storage)
Abstract: The density of FPGA on-chip memory has been continuously increasing with modern FPGAs having thousands of block RAMs (BRAMs) distributed across their reconfigurable fabric. These distributed BRAMs can provide a tremendous amount of on-chip bandwidth for efficient acceleration of data-intensive applications. In this work, we propose enhancing the ubiquitous FPGA BRAMs with in-memory compute-capabilities. As a result, BRAMs can act as normal storage units or their bitlines can be re-purposed as SIMD lanes executing bit-serial arithmetic operations. Our proposed architectural change results in 1.6× and 2.3× increase in the peak multiply-accumulate throughput of a large Stratix 10 FPGA, at a minimal cost of only 1.8% increase in the FPGA die size and no change to the BRAM’s interface to the programmable routing. Then, we present RIMA, a reconfigurable in-memory accelerator architecture for deep learning (DL) inference. RIMA exploits the proposed compute-capable BRAMs and the FPGA’s reconfigurability to achieve 1.25× and 3× higher performance compared to the state-of-the-art Brainwave DL soft processor for 8-bit integer and block floating-point precisions, respectively. In addition, RIMA implemented on a Stratix 10 FPGA enhanced with compute-capable BRAMs can achieve an order of magnitude higher performance compared to a same-generation GPU.
Published: 2021
Full Text: View/download PDF

39. DMA Medusa: A Vendor-Independent FPGA-Based Architecture for 400 Gbps DMA Transfers

Author: Jan Kubalek, Martin Spinler, Radek Isa, and Jakub Cabal
Subjects: Ethernet, business.industry, Computer science, Embedded system, Transfer (computing), Packet analyzer, Stratix, Architecture, business, Field-programmable gate array, Throughput (business), PCI Express
Abstract: FPGA accelerator cards are used for packet capture and monitoring in high-speed networks. With the 400G Ethernet technology, there is a need for an ability to transfer data to and from the host memory at the speed of 400Gbps. Currently available architectures (for example [1],[2],[3]) are limited to throughput up to 100Gbps and are therefore not suitable for this use case.This paper presents a vendor-independent DMA architecture that is capable of scaling up to 400Gbps throughput in a single FPGA using two PCIe Gen4 x16 slots bifurcated into four x8 interfaces. This architecture is designed to support hundreds of independent DMA channels and supports one or more PCIe endpoints with different configurations. We also demonstrate the performance of the proposed DMA architecture using results measured on an accelerator card with Intel Stratix 10 DX FPGA.
Published: 2021
Full Text: View/download PDF

40. True Random Number Generator Based on Fibonacci-Galois Ring Oscillators for FPGA

Author: Stefano Di Matteo, Luca Baldanzi, Luca Fanucci, Luca Crocetti, Jacopo Belli, Pietro Nannipieri, and Sergio Saponara
Subjects: Random number generation, Computer science, Cryptography, 02 engineering and technology, Ring oscillator, Entropy, Fibonacci, FiGaRO, FPGA, Galois, Generator, NIST, Number, Random, TRNG, lcsh:Technology, lcsh:Chemistry, Stratix, 0202 electrical engineering, electronic engineering, information engineering, General Materials Science, System on a chip, Field-programmable gate array, Instrumentation, lcsh:QH301-705.5, Randomness, Fluid Flow and Transfer Processes, number, business.industry, generator, lcsh:T, Process Chemistry and Technology, 020208 electrical & electronic engineering, random, General Engineering, lcsh:QC1-999, 020202 computer hardware & architecture, Computer Science Applications, Computer engineering, lcsh:Biology (General), lcsh:QD1-999, lcsh:TA1-2040, Place and route, business, entropy, lcsh:Engineering (General). Civil engineering (General), lcsh:Physics
Abstract: Random numbers are widely employed in cryptography and security applications. If the generation process is weak, the whole chain of security can be compromised: these weaknesses could be exploited by an attacker to retrieve the information, breaking even the most robust implementation of a cipher. Due to their intrinsic close relationship with analogue parameters of the circuit, True Random Number Generators are usually tailored on specific silicon technology and are not easily scalable on programmable hardware, without affecting their entropy. On the other hand, programmable hardware and programmable System on Chip are gaining large adoption rate, also in security critical application, where high quality random number generation is mandatory. The work presented herein describes the design and the validation of a digital True Random Number Generator for cryptographically secure applications on Field Programmable Gate Array. After a preliminary study of literature and standards specifying requirements for random number generation, the design flow is illustrated, from specifications definition to the synthesis phase. Several solutions have been studied to assess their performances on a Field Programmable Gate Array device, with the aim to select the highest performance architecture. The proposed designs have been tested and validated, employing official test suites released by NIST standardization body, assessing the independence from the place and route and the randomness degree of the generated output. An architecture derived from the Fibonacci-Galois Ring Oscillator has been selected and synthesized on Intel Stratix IV, supporting throughput up to 400 Mbps. The achieved entropy in the best configuration is greater than 0.995.
Published: 2021

41. Stratix 10 NX Architecture and Applications

Author: Sergey Gribok, Martin Langhammer, Bogdan Pasca, and Eriko Nurvitadhi
Subjects: Computer science, Stratix, Hardware_ARITHMETICANDLOGICSTRUCTURES, Block floating-point, FLOPS, Field-programmable gate array, Throughput (business), Single-precision floating-point format, IEEE floating point, Block (data storage), Computational science
Abstract: The advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft logic fabric, a new type of DSP Block provides the dense arrays of low precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support for support block floating point FP16 and FP12 numerics. All additions/accumulations can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multiplier that are more applicable to standard signal processing requirements. In terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs, or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency is in the range of 1-4 TOPs or TFLOPs/W.
Published: 2021
Full Text: View/download PDF

42. Folded Integer Multiplication for FPGAs

Author: Martin Langhammer and Bogdan Pasca
Subjects: 010302 applied physics, business.industry, Computer science, Karatsuba algorithm, 02 engineering and technology, Parallel computing, Folding (DSP implementation), Encryption, 01 natural sciences, 020202 computer hardware & architecture, 0103 physical sciences, Stratix, 0202 electrical engineering, electronic engineering, information engineering, Multiplier (economics), Hardware_ARITHMETICANDLOGICSTRUCTURES, business, Field-programmable gate array, Throughput (business), Integer (computer science)
Abstract: Encryption - especially the key exchange algorithms such as RSA - is an increasing use-model for FPGAs, driven by the adoption of the FPGA as a SmartNIC in the datacenter. While bulk encryption such as AES maps well to generic FPGA features, the very large multipliers required for RSA are a much more difficult problem. Although FPGAs contain thousands of small integer multipliers in DSP Blocks, aggregating them into very large multipliers is very challenging because of the large amount of soft logic required - especially in the form of long adders, and the high embedded multiplier count. In this paper, we describe a large multiplier architecture that operates in a multi-cycle format and which has a linear area/throughput ratio. We show results for a 2048-bit multiplier that has a latency of 118 cycles, inputs data every 9th cycle and closes timing at 377MHz in an Intel Arria 10 FPGA, and over 400MHz in a Stratix 10. The proposed multiplier uses 1/9 of the DSP resources typically used in a 2048-bit Karatsuba implementation, showing a perfectly linear throughput to DSP-count ratio. Our proposed solution outperforms recently reported results, in either arithmetic complexity - by making use of the Karatsuba techniques, or in scheduling efficiency - embedded DSP resources are fully utilized.
Published: 2021
Full Text: View/download PDF

43. A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

Author: Yi-Chang Lu, Nae-Chyun Chen, and Yu-Cheng Li
Subjects: FOS: Computer and information sciences, 0301 basic medicine, Instruction prefetch, Computer science, Controller (computing), String searching algorithm, Data structure, 03 medical and health sciences, 030104 developmental biology, Computer architecture, Hardware Architecture (cs.AR), Stratix, Overhead (computing), Field-programmable gate array, Computer Science - Hardware Architecture
Abstract: FM-index is an efficient data structure for string search and is widely used in next-generation sequencing (NGS) applications such as sequence alignment and de novo assembly. Recently, FM-indexing is even performed down to the read level, raising a demand of an efficient algorithm for FM-index construction. In this work, we propose a hardware-compatible Self-Aided Incremental Indexing (SAII) algorithm and its hard-ware architecture. This novel algorithm builds FM-index with no memory overhead, and the hardware system for realizing the algorithm can be very compact. Parallel architecture and a special prefetch controller is designed to enhance computational efficiency. An SAII-based FM-index constructor is implemented on an Altera Stratix V FPGA board. The presented constructor can support DNA sequences of sizes up to 131,072-bp, which is enough for small-scale references and reads obtained from current major platforms. Because the proposed constructor needs very few hardware resource, it can be easily integrated into different hardware accelerators designed for FM-index-based applications.
Published: 2021

44. Spectrum Sensing System for Cognitive Radio

Author: Kavitha Veerappan and Maheswari Murali
Subjects: Spectrum analyzer, Noise, Cognitive radio, Transmission (telecommunications), business.industry, Computer science, Stratix, Bandwidth (signal processing), Electronic engineering, Wireless, business, Frequency allocation
Abstract: The objective of this project is to develop a spectrum sensing system for Cognitive Radio (CR). Due to abrupt change in the wireless technologies in the last decade, the demand for usage of spectrum had grown to a large extant. This demand is taken care in Cognitive Radios that finds the unoccupied bandwidth and allocates frequency for specific task (Transmission or Reception) in the spectrum rather than static frequency allocation, which is commonly used today. The major task in a cognitive radio is to sense the spectrum for spectrum hole dynamically and allot to the required user. Spectrum sensing requires filtering of allotted channels, digitize and usage of a Spectrum Analyzer (SA) to assess the average power in allotted channels. This project targets at implementing the spectrum sensing task of cognitive radio. A 12-bit ADC is used to digitize the input signal. A FFT based spectrum analyzer is realized to measure the signal average power. Correlation of same input is done in MAC unit to minimize noise presence in the spectrum. The entire spectrum sensing system is implemented in an Altera Stratix II FPGA.
Published: 2021
Full Text: View/download PDF

45. A FPGA-Based Heterogeneous Implementation of NTRUEncrypt

Author: Hai Jiang, Chaoyu Zhang, and Hexuan Yu
Subjects: business.industry, Computer science, NTRU, NTRUEncrypt, Cryptography, Parallel computing, Encryption, Computer Science::Hardware Architecture, Stratix, Cryptosystem, Lattice-based cryptography, Hardware_ARITHMETICANDLOGICSTRUCTURES, Elliptic curve cryptography, business, Computer Science::Cryptography and Security
Abstract: Nowadays, the lattice-based cryptography is believed to be thwarting future quantum computers. The NTRU (Nth-degree truncated polynomial ring unit) encryption algorithm, abbreviated as NTRUEncrypt, is belonging to the family of lattice-based public-key cryptosystems. Comparing to other asymmetric cryptosystems such as RSA and elliptic curve cryptography (ECC), the encryption and decryption operations of NTRU significantly rely on basic polynomial multiplication, which makes it faster compared to those alternatives. This paper proposes the first heterogeneous implementation of NTRUEncrypt on FPGA (Altera Stratix V) and CPU using OpenCL, which has shown that this kind of lattice-based cryptography lends itself excellently for parallelization and achieves high throughput as well as energy efficiency.
Published: 2021
Full Text: View/download PDF

46. Accelerating Convolutional Neural Networks in FPGA-based SoCs using a Soft-Core GPU

Author: Mitko Veleski, Michael Hübner, Hector Gerardo Munoz Hernandez, and Marcelo Brandalero
Subjects: Programmable logic device, Speedup, Computer science, business.industry, Deep learning, Stratix, Graphics processing unit, System on a chip, Parallel computing, Artificial intelligence, Field-programmable gate array, business, Convolutional neural network
Abstract: Field-Programmable Gate Arrays (FPGAs) have increased in complexity over the last few years. Now available in the form of Systems-on-Chip (SoCs) such as Xilinx Zynq or Intel Stratix, users are offered significant flexibility in deciding the best approach to execute their Deep Learning (DL) model: a) in a fixed, hardwired general-purpose processor, or b) using the programmable logic to implement application-specific processing cores. While the latter choice offers the best performance and energy efficiency, the programmable logic’s limited size requires advanced strategies for mapping large models onto hardware. In this work, we investigate using a soft-core Graphics Processing Unit (GPU), implemented in the FPGA, to execute different Convolutional Neural Networks (CNNs). We evaluate the performance, area, and energy tradeoffs of running each layer in a) an ARM Cortex-A9 with Neon extensions and in b) the soft-core GPU, and find that the GPU overlay can provide a mean acceleration of 5.9$\times $ and 1.8$\times $ in convolution and max-pooling layers respectively compared to the ARM core. Finally, we show the potential of the collaborative execution of CNNs using these two platforms together, with an average speedup of 2$\times $ and 4.8$\times $ compared to using only the ARM core or the soft GPU, respectively.
Published: 2021
Full Text: View/download PDF

47. SHA2 and SHA-3 accelerator design in a 7 nm technology within the European Processor Initiative

Author: Luca Baldanzi, Matteo Bertolucci, Francesco Falaschi, Luca Fanucci, Stefano Di Matteo, Sergio Saponara, Luca Crocetti, and Pietro Nannipieri
Subjects: Hash, Computer Networks and Communications, Computer science, Hash function, Cryptography, 02 engineering and technology, Artificial Intelligence, SHA-3, Stratix, 0202 electrical engineering, electronic engineering, information engineering, EPI (European Processor Initiative), SHA2, Hardware_ARITHMETICANDLOGICSTRUCTURES, Field-programmable gate array, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Keccak, business.industry, ASIC, 020208 electrical & electronic engineering, FPGA verification, 020202 computer hardware & architecture, Hardware and Architecture, 7 nm, business, Software, Computer hardware
Abstract: This paper proposes the architecture of the hash accelerator, developed in the framework of the European Processor Initiative. The proposed circuit supports all the SHA2 and SHA-3 operative modes and is to be one of the hardware cryptographic accelerators within the crypto-tile of the European Processor Initiative. The accelerator has been verified on a Stratix IV FPGA and then synthesised on the Artisan 7 nanometres TSMC silicon technology, obtaining throughputs higher than 50 Gbps for the SHA2 and 230 Gbps for the SHA-3, with complexity ranging from 15 to about 30 kGE and estimated power dissipation of about 13 (SHA2) to 26 (SHA-3) mW (supply voltage 0.75 V). The proposed design demonstrates absolute performances beyond the state-of-the-art and efficiency aligned with it. One of the main contributions is that this is the first SHA-2 SHA-3 accelerator synthesised on such advanced technology.
Published: 2021

48. StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

Author: Torsten Hoefler, Dominic Hofer, Tal Ben-Nun, Andreas Kuster, Johannes de Fine Licht, and Tiziano De Matteis
Subjects: FOS: Computer and information sciences, CUDA, Computer Science - Distributed, Parallel, and Cluster Computing, Computer science, Stratix, Locality of reference, Code generation, Parallel computing, Distributed, Parallel, and Cluster Computing (cs.DC), Deadlock, Directed acyclic graph, Field-programmable gate array, Stencil
Abstract: Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.
Published: 2021

49. Design of Fast-SSC Decoder for STT-MRAM Channel

Author: Jianming Cui, Zengxiang Bao, Xiaojun Zhang, Guo Hua, and Chen Geng
Subjects: Hardware architecture, Magnetoresistive random-access memory, Hardware_MEMORYSTRUCTURES, Memory module, Computer science, business.industry, Polar code, Stratix, Memory bandwidth, business, Throughput (business), Computer hardware, Decoding methods
Abstract: In order to achieve fast decoding and improve the throughput, this paper uses the polar code to encode the spin transfer torque MARAM (STT-MRAM) channel. Based on the Fast-SSC algorithm, a (256, 220) hardware architecture is designed, including the controller, processing element, Kronecker product and memory module. This paper reduces the complexity of data process by splitting the data of nodes, and reduces the memory bandwidth by increasing the reusability of data. The decoder is synthesized on Stratix V 5SGXEA7N2F45C2, the decoding latency is 0.68us, and it can achieve 375 Mbps at 167 MHz.
Published: 2021
Full Text: View/download PDF

50. Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs

Author: Andrew Boutros, Nicolas Papernot, Vaughn Betz, and Mathew Hall
Subjects: 010302 applied physics, Flexibility (engineering), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, business.industry, Computer science, Cloud computing, Clock gating, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Machine Learning (cs.LG), Embedded system, 0103 physical sciences, Stratix, Hardware Architecture (cs.AR), 0202 electrical engineering, electronic engineering, information engineering, Bitstream, business, Field-programmable gate array, Resilience (network), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR), Efficient energy use
Abstract: Field-programmable gate arrays (FPGAs) are becoming widely used accelerators for a myriad of datacenter applications due to their flexibility and energy efficiency. Among these applications, FPGAs have shown promising results in accelerating low-latency real-time deep learning (DL) inference, which is becoming an indispensable component of many end-user applications. With the emerging research direction towards virtualized cloud FPGAs that can be shared by multiple users, the security aspect of FPGA-based DL accelerators requires careful consideration. In this work, we evaluate the security of DL accelerators against voltage-based integrity attacks in a multitenant FPGA scenario. We first demonstrate the feasibility of such attacks on a state-of-the-art Stratix 10 card using different attacker circuits that are logically and physically isolated in a separate attacker role, and cannot be flagged as malicious circuits by conventional bitstream checkers. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. Then, we carry out the attack on a DL accelerator running ImageNet classification in the victim role to evaluate the inherent resilience of DL models against timing faults induced by the adversary. We find that even when using the strongest attacker circuit, the prediction accuracy of the DL accelerator is not compromised when running at its safe operating frequency. Furthermore, we can achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy., Published in the 2020 proceedings of the International Conference of Field-Programmable Technology (ICFPT)
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

889 results on '"Stratix"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources