"Reconfigurable hardware" / Journal: ieee transactions on computers - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Reconfigurable hardware"' showing total 205 results

Start Over "Reconfigurable hardware" Journal ieee transactions on computers

205 results on '"Reconfigurable hardware"'

1. Reconfigurable hardware implementations of tweakable enciphering schemes

Author: Mancillas-Lopez, C., Chakraborty, D., and Rodriguez Henriquez, F.
Subjects: Encryption, Programmable logic array, Technology application, Data encryption -- Innovations, Hashing functions -- Usage, Mathematical optimization -- Technology application, Digital integrated circuits -- Usage
Published: 2010

2. Area-time efficient implementation of the elliptic curve method of factoring in reconfigurable hardware for application in the number field sieve

Author: Gaj, K., Soonhak Kwon, Baier, P., Kohlbrenner, P., Hoang Le, Khaleeluddin, M., Bachimanchi, R., and Rogawski, M.
Subjects: Programmable logic array, Computers -- Design and construction, Curves, Elliptic -- Usage, Ellipse -- Usage, Digital integrated circuits -- Design and construction
Published: 2010

3. High-performance designs for linear algebra operations on reconfigurable hardware

Author: Zhuo Ling and Prasanna, Viktor K.
Subjects: Programmable logic array, Algebras, Linear -- Usage, Digital integrated circuits -- Analysis, Matrices -- Usage
Published: 2008

4. Reconfigurable hardware SAT solvers: A survey of systems

Author: Skliarova, Iouliia and Ferrari, Antonio de Brito
Subjects: Microprocessor, Microprocessor upgrade, Microprocessors -- Testing
Published: 2004

5. High-radix Montgomery modular exponentiation on reconfigurable hardware

Author: Blum, Thomas and Paar, Christof
Subjects: Computers -- Safety and security measures, Cryptography -- Research, Modulation (Electronics) -- Research, Computer programming -- Models
Published: 2001

6. Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware.

Author: Ramanathan, Nadesh, Wickerson, John, and Constantinides, George A.
Subjects: *SCHEDULING software, *FIELD programmable gate arrays, *ALGORITHMS, *COMPUTER storage devices, *ARRAY processors
Abstract: Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations (‘atomics’), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis(HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further inter-iteration constraints. We implement our approach on two constraint-based scheduling HLS tools: LegUp 4.0 and LegUp 5.1. We extend both tools to support two memory models that are capable of synthesising atomics correctly. The first memory model only supports sequentially consistent (SC) atomics and the second supports weakly consistent (‘weak’) atomics as defined by the 2011 revision of the C standard. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many multi-threaded algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that on average circuits synthesised from programs that schedule atomics correctly can be 6x faster than an existing lock-based implementation of atomics, that weak atomics can yield a further 1.3x speedup, and that pipelining can yield a further 1.3x speedup. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

7. An Embedded Memory-Centric Reconfigurable Hardware Accelerator for Security Applications

Author: Robert Karam, Christopher Babecki, Somnath Paul, Swarup Bhunia, and Wenchao Qian
Subjects: business.industry, Computer science, 020208 electrical & electronic engineering, 02 engineering and technology, Security kernel, Reconfigurable computing, 020202 computer hardware & architecture, Theoretical Computer Science, Software, Computational Theory and Mathematics, Hardware and Architecture, Embedded system, Datapath, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, business, Field-programmable gate array, Efficient energy use
Abstract: Security has emerged as a critical need in today’s computer applications. Unfortunately, most security algorithms are computationally expensive and often do not map efficiently to general purpose processors. Fixed-function accelerators offer significant improvement in energy-efficiency, but they do not allow more than one application to reuse hardware resources. Mapping applications to generic reconfigurable fabrics can achieve the desired flexibility, but at the cost of area and energy efficiency. This paper presents a novel reconfigurable framework, referred to as hardware accelerator for security kernel (HASK), for accelerating a wide array of security applications. This framework incorporates a coarse-grained datapath, supports for lookup functions, and flexible interconnect optimizations, which enable on-demand pipelining and parallel computations in multiple ultralight-weight processing elements. These features are highly effective for energy-efficient operation in a diverse set of security applications. Through simulations, we have compared the performance of HASK to software and field programmable gate array (FPGA) platforms. Simulation results for a set of six common security applications show comparable latency between HASK and FPGA with 2.5X improvement in energy-delay product and 4X improvement in iso-area throughput. HASK also shows 5X improvement in iso-area throughput and 45X improvement in energy-delay product compared to optimized software implementations.
Published: 2016
Full Text: View/download PDF

8. Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants

Author: Mohamed Bakhouya, Vikram K. Narayana, Miaoqing Huang, Jaafar Gaber, and Tarek El-Ghazawi
Subjects: Computational Theory and Mathematics, Computer architecture, Hardware and Architecture, Computer science, Genetic algorithm, FpgaC, Throughput (business), Execution time, Software, Reconfigurable computing, Theoretical Computer Science, Task (project management)
Abstract: High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. Mapping application task graphs onto reconfigurable hardware is, therefore, of rising attention. In this work, we approach the mapping problem by incorporating multiple architectural variants for each hardware task; the variants reflect tradeoffs between the logic resources consumed and the task execution throughput. We propose a mapping approach based on the genetic algorithm, and show its effectiveness for random task graphs as well as an N-body simulation application, demonstrating improvements of up to 78.6 percent in the execution time compared with choosing a fixed implementation variant for all tasks. We then validate our methodology through experiments on real hardware, an SRC-6 reconfigurable computer.
Published: 2012
Full Text: View/download PDF

9. Lattice-Based Signatures: Optimization and Implementation on Reconfigurable Hardware.

Author: Guneysu, Tim, Lyubashevsky, Vadim, and Poppelmann, Thomas
Subjects: *LATTICE theory, *DIGITAL signatures, *MATHEMATICAL optimization, *ADAPTIVE computing systems, *QUANTUM computers
Abstract: Nearly all of the currently used signature schemes, such as RSA or DSA, are based either on the factoring assumption or the presumed intractability of the discrete logarithm problem. As a consequence, the appearance of quantum computers or algorithmic advances on these problems may lead to the unpleasant situation that a large number of today’s schemes will most likely need to be replaced with more secure alternatives. In this work we present such an alternative—an efficient signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based cryptography and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly $1.5$ kB and $0.3$ kB long, while the signature size is approximately $1.1$ kB for a security level of around $80$ bits. We provide implementation results on reconfigurable hardware (Spartan/Virtex-6) and demonstrate that the scheme is scalable, has low area consumption, and even outperforms classical schemes. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

10. Reconfigurable Hardware Implementations of Tweakable Enciphering Schemes

Author: Cuauhtemoc Mancillas-López, Debrup Chakraborty, and Francisco Rodriguez Henriquez
Subjects: Block cipher mode of operation, Computer science, business.industry, Hash function, Cryptography, Parallel computing, Encryption, Pseudorandom permutation, Reconfigurable computing, Theoretical Computer Science, Computational Theory and Mathematics, Disk encryption, Hardware and Architecture, Embedded system, business, Software, Block cipher
Abstract: Tweakable enciphering schemes are length-preserving block cipher modes of operation that provide a strong pseudorandom permutation. It has been suggested that these schemes can be used as the main building blocks for achieving in-place disk encryption. In the past few years, there has been an intense research activity toward constructing secure and efficient tweakable enciphering schemes. But actual experimental performance data of these newly proposed schemes are yet to be reported. In this paper, we present optimized FPGA implementations of six tweakable enciphering schemes, namely, HCH, HCTR, XCB, EME, HEH, and TET, using a 128-bit AES core as the underlying block cipher. We report the performance timings of these modes when using both pipelined and sequential AES structures. The universal polynomial hash function included in the specification of HCH, HCHfp (a variant of HCH), HCTR, XCB, TET, and HEH was implemented using a Karatsuba multiplier as the main building block. We provide detailed algorithm analysis of each of the schemes trying to exploit their inherent parallelism as much as possible. Our experiments show that a sequential AES core is not an attractive option for the design of these modes as it leads to rather poor throughput. In contrast, according to our place-and-route results on a Xilinx Virtex 4 FPGA, our designs achieve a throughput of 3.95 Gbps for HEH when using an encryption/decryption pipelined AES core, and a throughput of 5.71 Gbps for EME when using a encryption-only pipeline AES core. The performance results reported in this paper provide experimental evidence that hardware implementations of tweakable enciphering schemes can actually match and even outperform the data rates achieved by state-of-the-art disk controllers, thus showing that they might be used for achieving provably secure in-place hard disk encryption.
Published: 2010
Full Text: View/download PDF

11. Area-Time Efficient Implementation of the Elliptic Curve Method of Factoring in Reconfigurable Hardware for Application in the Number Field Sieve

Author: Mohammed Khaleeluddin, Marcin Rogawski, Kris Gaj, Patrick Baier, Hoang Le, Soonhak Kwon, Paul Kohlbrenner, and Ramakrishna Bachimanchi
Subjects: Hardware architecture, business.industry, Computer science, Parallel computing, Porting, Reconfigurable computing, Theoretical Computer Science, General number field sieve, Public-key cryptography, Elliptic curve, Memory management, Software, Computational Theory and Mathematics, Hardware and Architecture, business, Field-programmable gate array
Abstract: A novel portable hardware architecture of the Elliptic Curve Method of factoring, designed and optimized for application in the relation collection step of the Number Field Sieve, is described and analyzed. A comparison with an earlier proof-of-concept design by Pelzl et al. has been performed, and a substantial improvement has been demonstrated in terms of both the execution time and the area-time product. The ECM architecture has been ported across five different families of FPGA devices in order to select the family with the best performance to cost ratio. A timing comparison with the highly optimized software implementation, GMP-ECM, has been performed. Our results indicate that low-cost families of FPGAs, such as Spartan-3 and Spartan-3E, offer at least an order of magnitude improvement over the same generation of microprocessors in terms of the performance to cost ratio, without the use of embedded FPGA resources, such as embedded multipliers.
Published: 2010
Full Text: View/download PDF

12. An Embedded Memory-Centric Reconfigurable Hardware Accelerator for Security Applications

Author: Babecki, Christopher, primary, Qian, Wenchao, additional, Paul, Somnath, additional, Karam, Robert, additional, and Bhunia, Swarup, additional
Published: 2016
Full Text: View/download PDF

13. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware

Author: Viktor K. Prasanna and Ling Zhuo
Subjects: Numerical linear algebra, Floating point, Computer science, Parallel algorithm, Dot product, Memory bandwidth, Parallel computing, computer.software_genre, Reconfigurable computing, Matrix multiplication, Theoretical Computer Science, Matrix decomposition, Computer Science::Hardware Architecture, Computational Theory and Mathematics, Hardware and Architecture, Linear algebra, Hardware acceleration, Multiplication, Field-programmable gate array, computer, Software
Abstract: Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (field programmable gate arrays) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor based designs. We also show that with faster floating-point units and larger devices, the performance of our designs increases accordingly.
Published: 2008
Full Text: View/download PDF

14. High-radix Montgomery modular exponentiation on reconfigurable hardware

Author: Christof Paar and T. Blum
Subjects: Modular exponentiation, Exponentiation, Modular arithmetic, business.industry, Computer science, Modulus, Systolic array, Cryptography, Operand, Reconfigurable computing, Theoretical Computer Science, Public-key cryptography, Computational Theory and Mathematics, Montgomery reduction, Computer architecture, Integer, Hardware and Architecture, Discrete logarithm, Radix, Hardware_ARITHMETICANDLOGICSTRUCTURES, business, Software
Abstract: It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. Central tools for achieving system security are cryptographic algorithms. This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine a high-radix Montgomery modular multiplication algorithm with a new systolic array design. The designs are flexible, allowing any choice of operand and modulus. The new architecture also allows the use of high radices. Unlike previous approaches, we systematically implement and compare several variants of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture. The results allow conclusions about the feasibility and time-space trade-offs of our architecture for implementation on commercially available FPGAs. We found that 1,024-bit RSA decryption can be done in 3.1 ms with our fastest architecture.
Published: 2001
Full Text: View/download PDF

15. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware.

Author: Ling Zhuo and Prasanna, Viktor K.
Subjects: *BROADBAND communication systems, *FIELD programmable gate arrays, *MATHEMATICAL analysis, *MATRICES (Mathematics), *DIGITAL communications, *DATA transmission systems, *COMBINATORICS, *COMPUTER programming
Abstract: Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using field-programmable gate arrays (FPGAs) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication, and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor-based designs. We also show that, with faster floating-point units and larger devices, the performance of our designs increases accordingly. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

16. Reconfigurable Hardware SAT Solvers: A Survey of Systems.

Author: Skilarova, Iouliia and Ferrari, António de Brito
Subjects: *PROGRAMMABLE logic devices, *NETWORK processors, *ELECTRONIC equipment, *ALGORITHMS, *COMBINATORIAL optimization, *COMPUTER programming
Abstract: By adapting to computations that are not so well-supported by general-purpose processors, reconfigurable systems achieve significant increases in performance. Such computational systems use high-capacity programmable logic devices and are based on processing units customized to the requirements of a particular application. A great deal of the research effort in this area is aimed at accelerating the solution of combinatorial optimization problems. Special attention in this context was given to the Boolean satisfiability (SAT) problem resulting in a considerable number of different architectures being proposed. This paper presents the state- of-the-art in reconfigurable hardware SAT satisfiers. The analysis and classification of existing systems has been performed according to such criteria as algorithmic issues, reconfiguration modes, the execution model, the programming model, logic capacity, and performance. [ABSTRACT FROM AUTHOR]
Published: 2004

17. Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants

Author: Huang, Miaoqing, primary, Narayana, Vikram K., additional, Bakhouya, Mohamed, additional, Gaber, Jaafar, additional, and El-Ghazawi, Tarek, additional
Published: 2012
Full Text: View/download PDF

18. A Dynamically Reconfigurable System for Closed-Loop Measurements of Network Traffic

Author: Khan, Faisal, Ghiasi, Soheil, and Chuah, Chen-Nee
Subjects: Distributed Computing and Systems Software, Information and Computing Sciences, Engineering, Reconfigurable hardware, network monitoring, parallel circuits, Computer Software, Distributed Computing, Computer Hardware, Computer Hardware & Architecture, Electronics, sensors and digital hardware, Distributed computing and systems software
Abstract: Streaming network traffic measurement and analysis is critical for detecting and preventing any real-time anomalies in the network. The high speeds and complexity of today's networks, coupled with ever evolving threats, necessitate closing of the loop between measurements and their analysis in real time. The ensuing system demands high levels of programmability and processing where streaming measurements adapt to the changing network behavior in a goal-oriented manner. In this work, we exploit the features and requirements of the problem and develop an application-specific FPGA-based closed-loop measurement (CLM) system. We make novel use of fine-grained partial dynamic reconfiguration (PDR) as underlying reprogramming paradigm, performing low-latency just-in-time compiled logic changes in FPGA fabric corresponding to the dynamic measurement requirements. Our innovative dynamically reconfigurable socket offers 3× logic savings over conventional static solutions, while offering much reduced reconfiguration latencies over conventional PDR mechanisms. We integrate multiple sockets in a highly parallel CLM framework and demonstrate its effectiveness in identifying heavy flows in streaming network traffic. The results using an FPGA prototype offer 100 percent detection accuracy while sustaining increasing link speeds. © 1968-2012 IEEE.
Published: 2014

19. Lattice-Based Signatures: Optimization and Implementation on Reconfigurable Hardware

Author: Vadim Lyubashevsky, Tim Güneysu, and Thomas Pöppelmann
Subjects: Theoretical computer science, business.industry, Lattice problem, Cryptography, 02 engineering and technology, Parallel computing, Reconfigurable computing, 020202 computer hardware & architecture, Theoretical Computer Science, Public-key cryptography, Computational Theory and Mathematics, Hardware and Architecture, Discrete logarithm, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Lattice-based cryptography, business, Software, Quantum computer, Mathematics
Abstract: Nearly all of the currently used signature schemes, such as RSA or DSA, are based either on the factoring assumption or the presumed intractability of the discrete logarithm problem. As a consequence, the appearance of quantum computers or algorithmic advances on these problems may lead to the unpleasant situation that a large number of today’s schemes will most likely need to be replaced with more secure alternatives. In this work we present such an alternative—an efficient signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based cryptography and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly $1.5$ kB and $0.3$ kB long, while the signature size is approximately $1.1$ kB for a security level of around $80$ bits. We provide implementation results on reconfigurable hardware (Spartan/Virtex-6) and demonstrate that the scheme is scalable, has low area consumption, and even outperforms classical schemes.
Full Text: View/download PDF

20. Exploiting Hardware-Based Data-Parallel and Multithreading Models for Smart Edge Computing in Reconfigurable FPGAs.

Author: Rodriguez, Alfonso, Otero, Andres, Platzner, Marco, and de la Torre, Eduardo
Subjects: EDGE computing, ADAPTIVE computing systems, FIELD programmable gate arrays, COMPUTER systems, COMPUTING platforms, ONLINE exhibitions
Abstract: Current edge computing systems are deployed in highly complex application scenarios with dynamically changing requirements. In order to provide the expected performance and energy efficiency values in these situations, the use of heterogeneous hardware/software platforms at the edge has become widespread. However, these computing platforms still suffer from the lack of unified software-driven programming models to efficiently deploy multi-purpose hardware-accelerated solutions. In parallel, edge computing systems also face another huge challenge: operating under multiple conditions that were not taken into account during any of the design stages. Moreover, these conditions may change over time, forcing self-adaptation mechanisms to become a must. This paper presents an integrated architecture to exploit hardware-accelerated data-parallel models and transparent hardware/software multithreading. In particular, the proposed architecture leverages the ARTICo3 framework and ReconOS to allow developers to select the most suitable programming model to deploy their edge computing applications onto run-time reconfigurable hardware devices. An evolvable hardware system is used as an additional architectural component during validation, providing support for continuous lifelong learning in smart edge computing scenarios. In particular, the proposed setup exhibits online learning capabilities that include learning by imitation from software-based reference algorithms. Experimental results show the benefits of the proposed approach, exposing different run-time tradeoffs (e.g., computing performance versus functional correctness of the evolved solutions), and highlighting the benefits of using scalable data-parallel models to perform circuit evolution under dynamically changing application scenarios. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. An Extensive Study of Flexible Design Methods for the Number Theoretic Transform.

Author: Mert, Ahmet Can, Karabulut, Emre, Ozturk, Erdinc, Savas, Erkay, and Aysu, Aydin
Subjects: POLYNOMIAL rings, EXPERIMENTAL design, DIGITAL signatures, DESIGN software, CRYPTOGRAPHY, COMPUTATIONAL complexity, SOFTWARE architecture, HOMOMORPHISMS, ADAPTIVE computing systems
Abstract: Efficient lattice-based cryptosystems operate with polynomial rings with the Number Theoretic Transform (NTT) to reduce the computational complexity of polynomial multiplication. NTT has therefore become a major arithmetic component (thus computational bottleneck) in various cryptographic constructions like hash functions, key-encapsulation mechanisms, digital signatures, and homomorphic encryption. Although there exist several hardware designs in prior work for NTT, they all are isolated design instances fixed for specific NTT parameters or parallelization level. This article provides an extensive study of flexible design methods for NTT implementation. To that end, we evaluate three cases: (1) parametric hardware design, (2) high-level synthesis (HLS) design approach, and (3) design for software implementation compiled on soft-core processors, where all are targeted on reconfigurable hardware devices. We evaluate the designs that implement multiple NTT parameters and/or processing elements, demonstrate the design details for each case, and provide a fair comparison with each other and prior work. On a Xilinx Virtex-7 FPGA, compared to HLS and processor-based methods, the results show that the parametric hardware design is on average $4.4\times$ 4. 4 × and $73.9\times$ 73. 9 × smaller and $22.5\times$ 22. 5 × and $19.3\times$ 19. 3 × faster, respectively. Surprisingly, HLS tools can yield less efficient solutions than processor-based approaches in some cases. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

22. Bypassing Multicore Memory Bugs With Coarse-Grained Reconfigurable Logic.

Author: Lee, Doowon and Bertacco, Valeria
Subjects: *CACHE memory, *FINITE state machines, *ARM microprocessors, *MEMORY, *SYSTEMS design, *LOGIC
Abstract: Multicore systems deploy sophisticated memory hierarchies to improve memory operations’ throughput and latency by exploiting multiple levels of cache hierarchy and several complex memory-access instructions. As a result, the functional verification of the memory subsystem is one of the most challenging tasks in the overall system design effort, leading to many bugs in the released product. In this work, we propose MemPatch, a novel reconfigurable hardware solution to bypass such escaped bugs. To design MemPatch, we first analyzed publicly available errata documents and classified memory-related bugs by root cause and symptoms. We then leveraged that learning to design a specialized, reconfigurable detection fabric, implementing finite state machines that can model the bug-triggering events at the microarchitectural level. Finally, we complemented this detection logic with hardware offering multiple bug-bypassing options. Our evaluation of MemPatch mapped a multicore RISC-V out-of-order processor, augmented with our logic, to a Xilinx ZCU102 FPGA board. When configured to detect up to 32 distinct bugs, MemPatch entails 7.6% area and 7.3% power overheads. An estimate on a commercial ARM Cortex-A57 processor target indicates that the area overhead would be much lower, 1.0%. The performance impact was found to be no more than 1% in all cases. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

23. Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks.

Author: Steiger, Christoph, Walder, Herbert, and Platzner, Marco
Subjects: COMPUTER operating systems, REAL-time computing, FIELD programmable gate arrays, GATE array circuits, PROGRAMMABLE logic devices, ALGORITHMS
Abstract: Today's reconfigurable hardware devices have huge densities and are partially reconfigurable, allowing for the configuration and execution of hardware tasks in a true multitasking manner. This makes reconfigurable platforms an ideal target for many modern embedded systems that combine high computation demands with dynamic task sets. A rather new line of research is engaged in the construction of operating systems for reconfigurable embedded platforms. Such an operating system provides a minimal programming model and a runtime system. The runtime system performs online task and resource management. In this paper, we first discuss design issues for reconfigurable hardware operating systems. Then, we focus on a runtime system for guarantee- based scheduling of hard real-time tasks. We formulate the scheduling problem for the 1 D and 2D resource models and present two heuristics, the horizon and the stuffing technique, to tackle it. Simulation experiments conducted with synthetic work loads evaluate the performance and the runtime efficiency of the proposed schedulers. The scheduling performance for the 1D resource model is strongly dependent on the aspect ratios of the tasks. Compared to the 1D model, the 2D resource model is clearly superior. Finally, the runtime overhead of the scheduling algorithms is shown to be acceptably low. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

24. LayeredTrees: Most Specific Prefix-Based Pipelined Design for On-Chip IP Address Lookups.

Author: Chang, Yeim-Kuan, Kuo, Fang-Chen, Kuo, Han-Jhen, and Su, Cheng-Chien
Subjects: INTERNET protocol address, INTERNET protocols, COMPUTER network resources, COMPUTER storage devices, WEB search engines, ROUTING (Computer network management)
Abstract: Multibit trie-based pipelines for IP lookups have been demonstrated to be able to achieve the throughput of over 100 Gbps. However, it is hard to store the entire multibit trie into the on-chip memory of reconfigurable hardware devices. Thus, their performance is limited by the speed of off-chip memory. In this paper, we propose a new pipeline design called LayeredTrees that overcomes the shortcomings of the multibit trie-based pipelines. LayeredTrees pipelines the multi-layered multiway balanced prefix trees based on the concept of most specific prefixes. LayeredTrees is optimized to fit the entire routing table into the on-chip memory of reconfigurable hardware devices. No prefix duplication is needed and each \mbi W-bit prefix is encoded in a (\mbi W + \bf 1)-bit format to save memory. Assume the minimal packet size is 40 bytes. Our experimental results on Virtex-6 XC6VSX315T FPGA chip show that the throughputs of 33.6 and 120.8 Gbps can be achieved by the proposed single search engine and multiple search engines running in parallel, respectively. Furthermore, the impact of update operations on the search performance is minimal. With the same FPGA device, an IPv6 routing table of 290,503 distinct entries can also be supported. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

25. The MOLEN Polymorphic Processor.

Author: Vassiliadis, Stamatis, Wong, Stephan, Gaydadjiev, Georgi, Bertels, Koen, Kuzmanov, Georgi, and Panainte, Elena Moscu
Subjects: HIGH performance processors, COMPUTER programmers, DECODERS (Electronics), COMPILERS (Computer programs), SYSTEMS software, COMPUTER software
Abstract: In this paper, we present a polymorphic processor paradigm incorporating both general purpose and custom computing processing. The proposal incorporates an arbitrary number of programmable units, exposes the hardware to the programmers/ designers, and allows them to modify and extend the processor functionality at will. To achieve the previously stated attributes, we present a new programming paradigm, a new instruction set architecture, a microcode-based microarchitecture, and a compiler methodology. The programming paradigm, in contrast with the conventional programming paradigms, allows general-purpose conventional code and hardware descriptions to coexist in a program. In our proposal, for a given instruction set architecture, a one- time instruction set extension of eight instructions is sufficient to implement the reconfigurable functionality of the processor. We propose a microarchitecture based on reconfigurable hardware emulation to allow high-speed reconfiguration and execution. To prove the viability of the proposal, we experimented with the MPEG-2 encoder and decoder and a Xilinx Virtex II Pro FPGA. We have implemented three operations, SAD, DCT, and IDCT. The overall attainable application speedup for the MPEG-2 encoder and decoder is between 2.64-3.18 and between 1.56-1.94, respectively, representing between 93 percent and 98 percent of the theoretically obtainable speedups. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

26. A Novel Fault Tolerant and Runtime Reconfigurable Platform for Satellite Payload Processing.

Author: Sterpone, Luca, Porrmann, Mario, and Hagemeyer, Jens
Subjects: FAULT-tolerant computing, ADAPTIVE computing systems, ROCKET payloads, COMPUTER input-output equipment, INFORMATION processing, COMPUTER storage devices
Abstract: Reconfigurable hardware is gaining a steadily growing interest in the domain of space applications. The ability to reconfigure the information processing infrastructure at runtime together with the high computational power of today's FPGA architectures at relatively low power makes these devices interesting candidates for data processing in space applications. Partial dynamic reconfiguration of FPGAs enables maximum flexibility and can be utilized for performance optimization, for improving energy efficiency, and for enhanced fault tolerance. To be able to prove the effectiveness of these novel approaches for satellite payload processing, a highly scalable prototyping environment has been developed, combining dynamically reconfigurable FPGAs with the required interfaces such as SpaceWire, MIL-STD-1553B, and SpaceFibre. The developed systems have been enabled to space harsh environments thanks to an analytical analysis of the radiation effects on its most critical reconfigurable components. Aiming at that scope, a new algorithm for the analysis of critical radiation effects, in particular, related to Single Event Upsets (SEUs) and Multiple Event Upsets (MEUs) has been developed to obtain an effective estimation of the radiation impact and enabling the tuning of the component mapping reducing the routing interaction between the reconfigurable placed modules in their different feasible positions. The experimental performance of the system has been evaluated by a proper dynamic reconfiguration scenario, demonstrating a partial reconfiguration at 400 MByte/s, blind and readback scrubbing is supported and the scrub rate can be adapted individually for different parts of the design. The fault tolerance capability has been proven by means of a new analysis algorithm and by fault injection campaigns of SEUs and MCUs into the FPGA configuration memory. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

27. Memristor-Based Neural Logic Blocks for Nonlinearly Separable Functions.

Author: Soltiz, Michael, Kudithipudi, Dhireesha, Merkel, Cory, Rose, Garrett S., and Pino, Robinson E.
Subjects: MEMRISTORS, ARTIFICIAL neural networks, COMPUTER logic, NONLINEAR theories, COMPUTER input-output equipment, OPTICAL character recognition
Abstract: Neural logic blocks (NLBs) enable the realization of biologically inspired reconfigurable hardware. Networks of NLBs can be trained to perform complex computations such as multilevel Boolean logic and optical character recognition (OCR) in an area- and energy-efficient manner. Recently, several groups have proposed perceptron-based NLB designs with thin-film memristor synapses. These designs are implemented using a static threshold activation function, limiting the set of learnable functions to be linearly separable. In this work, we propose two NLB designs-robust adaptive NLB (RANLB) and multithreshold NLB (MTNLB)—which overcome this limitation by allowing the effective activation function to be adapted during the training process. Consequently, both designs enable any logic function to be implemented in a single-layer NLB network. The proposed NLBs are designed, simulated, and trained to implement ISCAS-85 benchmark circuits, as well as OCR. The MTNLB achieves 90 percent improvement in the energy delay product (EDP) over lookup table (LUT)-based implementations of the ISCAS-85 benchmarks and up to a 99 percent improvement over a previous NLB implementation. As a compromise, the RANLB provides a smaller EDP improvement, but has an average training time of only $(\approx)$ 4 cycles for 4-input logic functions, compared to the MTNLBs $(\approx)$ 8-cycle average training time. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

28. Real-Time Management of Hardware and Software Tasks for FPGA-Based Embedded Systems.

Author: Pellizzoni, Rodolfo and Caccamo, Marco
Subjects: COMPUTER operating systems, EMBEDDED computer systems, COMPUTER hardware description languages, RESOURCE allocation, ONLINE algorithms, INFORMATION networks, SIMULATION methods & models, ENGINEERING design, HIGH technology research
Abstract: Operating systems for reconfigurable devices enable the development of embedded systems where software tasks, running on a CPU, can coexist with hardware tasks running on a reconfigurable hardware device (FPGA). In this work, we consider real-time systems subject to dynamic workloads and whose tasks can be computationally intensive. We introduce a novel resource allocation scheme and an online admission control test that achieve high performance and flexibility; in addition, runtime reconfiguration is used to maximize the number of admitted real-time tasks. Moreover, in detail, we first discuss a 1D system architecture and its prototype for a Xilinx Virtex-4 FPGA; then, we concentrate on the online admission control problem. Online task allocation and migration between the CPU and the reconfigurable device are discussed and sufficient feasibility tests are derived for both the commonly used slotted and 1D area models. Finally, the effectiveness of our admission control and relocation strategy is shown through a series of synthetic simulations. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

29. Automatic Design of Area-Efficient Configurable ASIC Cores.

Author: Compton, Katherine and Hauck, Scott
Subjects: COMPUTER input-output equipment, COMPUTER software, STANDARD cells, LOGIC design, INTEGRATED circuits, PROGRAM transformation, ROUTING (Computer network management), HEURISTIC programming
Abstract: Reconfigurable hardware has been shown to provide an efficient compromise between the flexibility of software and the performance of hardware. However, even coarse-grained reconfigurable architectures target the general case and miss optimization opportunities present if characteristics of the desired application set are known. Restricting the structure to support a class or a specific set of algorithms can increase efficiency while still providing flexibility within that set. By generating a custom array for a given computation domain, we explore the design space between an ASIC and an FPGA. However, the manual creation of these customized reprogrammable architectures would be a labor-intensive process, leading to high design costs. Instead, we propose automatic reconfigurable architecture generation specialized to given application sets. This paper discusses configurable ASIC (cASIC) architecture generation that creates hardware on average up to 12.3x smaller than an FPGA solution with embedded multipliers and 2.2x smaller than a standard cell implementation of individual circuits. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

30. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications.

Author: Singh, Hartej and Lee, Ming-Hau
Subjects: COMPUTER systems
Abstract: Presents a study which introduced MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for world-level, computation-intensive applications. Taxonomy for reconfigurable systems; Components, features and program flow of MorphoSys; Design of MorphoSys components; Mapping applications to MorphoSys.
Published: 2000
Full Text: View/download PDF

31. Some Conditional Cube Testers for Grain-128a of Reduced Rounds.

Author: Dalai, Deepak Kumar, Pal, Santu, and Sarkar, Santanu
Subjects: STREAM ciphers, CUBES, SHIFT registers, HEURISTIC, BOOLEAN functions
Abstract: In this article, a new strategy, maximum last $\alpha$ α round, is proposed to select cubes for cube attacks. This strategy considers the cubes in a particular round where the probability of its superpoly to be 1 is at most $\alpha$ α , where $\alpha$ α is a very small number. A heuristic method to find a number of suitable cubes using this strategy and the previously used strategies (i.e., maximum initial zero, maximum last zero) are proposed. To get a bias at the higher rounds, the heuristic, too, imposes conditions on some state bits of the cipher to make the non-constant superpoly of a cube as zero for the first few rounds. Some cube testers are formed by using those suitable cubes to implement a distinguishing attack on Grain-128a of reduced KSA (or initialization) rounds. We present a distinguisher for Grain-128a of 191 (out of 256) KSA round in the single key setup and 201 (out of 256) KSA round in the weak key setup by using the cubes of dimension 5. The number of rounds is the highest till today, and the cube dimension is smaller than the previous results. Further, we tested our algorithm on Grain-128 and achieved good results by using small cubes. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

32. R3TOS: A Novel Reliable Reconfigurable Real-Time Operating System for Highly Adaptive, Efficient, and Dependable Computing on FPGAs.

Author: Iturbe, Xabier, Benkrid, Khaled, Hong, Chuan, Ebrahim, Ali, Torrego, Raul, Martinez, Imanol, Arslan, Tughrul, and Perez, Jon
Subjects: COMPUTER reliability, ADAPTIVE computing systems, REAL-time computing, COMPUTER operating systems, FIELD programmable gate arrays, COMPUTER users
Abstract: Despite the clear potential of FPGAs to push the current power wall beyond what is possible with general-purpose processors, as well as to meet ever more exigent reliability requirements, the lack of standard tools and interfaces to develop reconfigurable applications limits FPGAs' user base and makes their programming not productive. R3TOS is our contribution to tackle this problem. It provides systematic OS support for FPGAs, allowing the exploitation of some of the most advanced capabilities of FPGA technology by inexperienced users. What makes R3TOS special is its nonconventional way of exploiting on-chip resources: These are used indistinguishably for carrying out either computation or communication tasks at different times. Indeed, R3TOS does not rely on any static infrastructure apart from its own core circuitry, which is constrained to a specific region within the FPGA where it is implemented. Thus, the rest of the device is kept free of obstacles, with the spare resources ready to be used as and whenever needed. At runtime, the hardware tasks are scheduled and allocated with the dual objective of improving computation density and circumventing damaged resources on the FPGA. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

33. An Automated Framework for Accelerating Numerical Algorithms on Reconfigurable Platforms Using Algorithmic/Architectural Optimization.

Author: Jung Sub Kim, Lanping Deng, Prasanth Mangalagiri, Irick, Kevin, Kanwaldeep Sobti, Mahmut Kandemir, Vijaykrishnan Narayanan, Chaitali Chakrabarti, Pitsianis, Nikos, and Xiaobai Sun
Subjects: *AUTOMATION, *ALGORITHMS, *PROGRAM transformation, *COMPUTER input-output equipment, *FIELD programmable gate arrays, *KERNEL functions, *SIGNAL processing
Abstract: This paper describes TANOR, an automated framework for designing hardware accelerators for numerical computation on reconfigurable platforms. Applications utilizing numerical algorithms on large-size data sets require high-throughput computation platforms. The focus is on N-body interaction problems which have a wide range of applications spanning from astrophysics to molecular dynamics. The TANOR design flow starts with a MATLAB description of a particular interaction function, its parameters, and certain architectural constraints specified through a graphical user interface. Subsequently, TANOR automatically generates a configuration bitstream for a target FPGA along with associated drivers and control software necessary to direct the application from a host PC. Architectural exploration is facilitated through support for fully custom fixed-point and floating-point representations in addition to standard number representations such as single-precision floating point. Moreover, TANOR enables joint exploration of algorithmic and architectural variations in realizing efficient hardware accelerators. TANOR's capabilities have been demonstrated for three different N-body interaction applications: the calculation of gravitational potential in astrophysics, the diffusion or convolution with Gaussian kernel common in image processing applications, and the force calculation with vector-valued kernel function in molecular dynamics simulation. Experimental results show that TANOR-generated hardware accelerators achieve lower resource utilization without compromising numerical accuracy, in comparison to other existing custom accelerators. [ABSTRACT FROM AUTHOR]
Published: 2009
Full Text: View/download PDF

34. Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems.

Author: Ling Zhuo and Prasanna, Viktor K.
Subjects: *FIELD programmable gate arrays, *COMPUTER systems, *MATRICES (Mathematics), *DIGITAL communications, *DATA transmission systems, *SYSTEMS design, *BROADBAND communication systems, *GATE array circuits, *PROGRAMMABLE logic devices, *COMPUTER networks
Abstract: Recently, high-end reconfigurable computing systems that employ Field-Programmable Gate Arrays (FPGAs) as hardware accelerators for general-purpose processors have been built. These systems provide new opportunities for high-performance computing. However, the coexistence of the processors and the FPGAs in them also poses new challenges to application developers. In this paper, we build a design model for hybrid designs, that is, designs that utilize both the processors and the FPGAs for computations. The model characterizes a reconfigurable computing system using various parameters, including the floating-point computing power of the processors and the FPGAs, the number of nodes, the size of multiple levels of memory, the memory bandwidth, and the network bandwidth. Based on the model, we propose a design methodology for hardware/software codesign. The methodology partitions workload between the processors and the FPGAs, maintains load balance in the system, and realizes scalability over multiple nodes. Designs are proposed for several computationally intensive applications: matrix multiplication, matrix factorization, and the Floyd-Warshall algorithm for the all-pairs shortest-paths problem. To illustrate our ideas, the proposed hybrid designs are implemented on a Cray XD1. Each node of XD1 contains AMD 2.2-GHz Opteron processors and a Xilinx Virtex-II Pro FPGA. Experimental results show that our designs utilize both the processors and the FPGAs efficiently and overlap most of the data transfer overheads and network communication costs with the computations. Our designs achieve up to 90 percent of the total performance of the nodes and 90 percent of the performance predicted by the design model. In addition, our designs scale over a large number of nodes. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

35. Fast Resource and Timing Aware Design Optimisation for High-Level Synthesis.

Author: Perina, Andre B., Silitonga, Arthur, Becker, Jurgen, and Bonato, Vanderlei
Subjects: *COMPILERS (Computer programs), *GATE array circuits, *FIELD programmable gate arrays, *SPACE exploration
Abstract: Field-Programmable Gate Arrays (FPGA) are often present in energy-efficient systems, although its non-trivial development flow is an obstacle for massive adoption. High-Level Synthesis (HLS) approaches attempt to mitigate the gap by targetting FPGAs from software languages, however manual tuning is still essential to meet performance demands. We present a high-level design space exploration framework with timing and resource awareness that uses an estimator named Lina to evaluate each design point. Lina is a profiling-based approach that avoids the costly static analyses performed by HLS compilers, allowing a significantly faster exploration of optimisations. Estimations are improved by supporting a continuous range of operating frequencies and by considering resource usage for both floating-point and integer datapaths. For a given set of C kernels, the estimated solutions are among the best 1% for execution time and resource footprint. The exploration of each kernel using Lina was performed on average two orders of magnitude faster than using early HLS compiler reports, and four orders of magnitude faster than fully compiling each design point. By considering the design spaces traversed, our solutions reached 70% of the maximum speed-up achievable. This represents an average speed-up of 14-16× compared to the baseline designs with no optimisations enabled. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

36. OmpSs@FPGA Framework for High Performance FPGA Computing.

Author: de Haro, Juan Miguel, Bosch, Jaume, Filgueras, Antonio, Vidal, Miquel, Jimenez-Gonzalez, Daniel, Alvarez, Carlos, Martorell, Xavier, Ayguade, Eduard, and Labarta, Jesus
Subjects: *HIGH performance computing, *COMPILERS (Computer programs), *FIELD programmable gate arrays
Abstract: This article presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

37. OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives.

Author: Mizutani, Kenji, Yamaguchi, Hiroshi, Urino, Yutaka, and Koibuchi, Michihiro
Subjects: DISTRIBUTED computing, FIELD programmable gate arrays, COMPUTING platforms
Abstract: Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect packet network using Ethernet's Intellectual Properties imposes an unacceptably large amount of the logic for handling such high-bandwidth interconnects on an FPGA. Besides the indirect network, another approach builds a direct packet network. Existing direct inter-FPGA networks have a low-radix network topology, e.g., 2-D torus. However, the low-radix network has the disadvantage of a large diameter and large average shortest path length that increases the latency of collectives. To mitigate both problems, we propose a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collectives. Since all end-to-end separate communication paths are statically established using onboard optics, raw block data can be transferred with simple link-level synchronization. Once each source FPGA assigns a communication stream to a path by its internal switch logic between memory-mapped and stream interfaces for remote direct memory access (RDMA), a one-hop transfer is provided. Since each FPGA performs input/output of the remote memory access between all FPGAs simultaneously, multiple RDMAs efficiently form collectives. The OPTWEB network provides 0.71-μsec start-up latency of collectives among multiple Intel Stratix 10 MX FPGA cards with onboard optics. The OPTWEB network consumes 31.4 and 57.7 percent of adaptive logic modules for aggregate 400-Gbps and 800-Gbps interconnects on a custom Stratix 10 MX 2100 FPGA, respectively. The OPTWEB network reduces by 40 percent the cost compared to a conventional packet network. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

38. Guest Editors' introduction: Special section on adaptive hardware and systems.

Author: Benkrid, Khaled, Keymeulen, Didier, Patel, Umeshkumar D., and Merodio-Codinachs, David
Subjects: PERIODICAL editors, COMPUTER input-output equipment, COMPUTER systems, ADAPTIVE computing systems, COMPUTER periodicals
Abstract: This special section of IEEE Transactions on Computers presents some of the latest research developments in the field of adaptive hardware and systems. The creation of this section was motivated by lively discussions held at the annual NASA/ESA Adaptive Hardware and Systems (AHS) conference, which showed a need for such special section at a top ranked journal. At the end of a rigorous review process, ten papers were selected for publication from a set of high quality submissions consisting of regular papers and extended papers from the AHS 2012 conference proceedings. The articles are then briefly described. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

39. LayeredTrees: Most Specific Prefix-Based Pipelined Design for On-Chip IP Address Lookups

Author: Fang-Chen Kuo, Yeim-Kuan Chang, Cheng-Chien Su, and Han-Jhen Kuo
Subjects: Computer science, Pipeline (computing), Routing table, Byte, Parallel computing, Reconfigurable computing, Theoretical Computer Science, Prefix, Computational Theory and Mathematics, Hardware and Architecture, Trie, Hardware_ARITHMETICANDLOGICSTRUCTURES, Throughput (business), Software
Abstract: Multibit trie-based pipelines for IP lookups have been demonstrated to be able to achieve the throughput of over 100 Gbps. However, it is hard to store the entire multibit trie into the on-chip memory of reconfigurable hardware devices. Thus, their performance is limited by the speed of off-chip memory. In this paper, we propose a new pipeline design called LayeredTrees that overcomes the shortcomings of the multibit trie-based pipelines. LayeredTrees pipelines the multi-layered multiway balanced prefix trees based on the concept of most specific prefixes. LayeredTrees is optimized to fit the entire routing table into the on-chip memory of reconfigurable hardware devices. No prefix duplication is needed and each ${\mbi {W}}$ -bit prefix is encoded in a (${\mbi {W}} + {\bf 1}$ )-bit format to save memory. Assume the minimal packet size is 40 bytes. Our experimental results on Virtex-6 XC6VSX315T FPGA chip show that the throughputs of 33.6 and 120.8 Gbps can be achieved by the proposed single search engine and multiple search engines running in parallel, respectively. Furthermore, the impact of update operations on the search performance is minimal. With the same FPGA device, an IPv6 routing table of 290,503 distinct entries can also be supported.
Published: 2014
Full Text: View/download PDF

40. A Power- and Performance-Aware Software Framework for Control System Applications.

Author: Giardino, Michael, Klawitter, Eric, Ferri, Bonnie, and Ferri, Aldo
Subjects: SOFTWARE frameworks, COMPUTING platforms, SITUATIONAL awareness, MOBILE robots, MOBILE operating systems, CYBER physical systems, HIGH performance computing
Abstract: This article describes the development of a software architectural framework for implementing compute-aware control systems, where the term “compute-aware” describes controllers that can modify existing low-level computing platform power managers in response to the needs of the physical system controller. This level of interaction means that high-level decisions can be made as to when to operate the computing platform in a power-savings mode or a high-performance mode in response to situation awareness of the physical system. The framework is demonstrated experimentally on a mobile robot platform. In this example, a situation-aware governor is developed that adjusts the speed of the processor based on the physical performance of the robot as it traverses a path through obstacles. The results show that the situation-aware governor results in overall power savings of up to 38.9 percent with 1.3 percent degradation in performance compared to the static high-power strategy. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

41. Pipelined Hardware Implementation of COPA, ELmD, and COLM.

Author: Bossuet, Lilian, Mancillas-Lopez, Cuauhtemoc, and Ovilla-Martinez, Brisbane
Subjects: DATA integrity, HARDWARE
Abstract: Authenticated encryption algorithms offer privacy, authentication, and data integrity, as well. In recent years, they have received special attention after the call for submissions of Competition for Authenticated Encryption: Security, Applicability, and Robustness (CAESAR) was published. The CAESAR goal is to generate a portfolio with recommendations of authenticated encryption algorithms for three different scenarios: Lightweight, high speed, and defense in deep. ELmD and COPA are two on-line authenticated encryption algorithms submitted to CAESAR; because of their similarities, they were merged as COLM during the third-round of CAESAR. COLM is a finalist in the use case 3 defense in depth. ELmD, COPA, and COLM are based on the ECB-mix-ECB structure, which is highly parallelizable and pipelineable. In this paper, we present optimized single-chip implementations of ELmD, COPA, and COLM using pipelining. For ELmD, we present implementations for eight combinations of its parameters set: For intermediate tags, fixed, variable tag length, and 10 and 6 AES rounds. COLM implementation is for variable tag length without intermediate tags. In the case of COPA, it does not have parameters set. The implementation results with a Xilinx Virtex 6 FPGA show that ELmD is the best option concerning area and speed for single-chip implementation. The area of COPA and COLM are 1.65 and 1.69 times ELmD's respectively. Regarding throughput, the range of our implementations goes from 33.34 Gbits/s for COLM to more than 35 Gbits/s for several versions of ELmD. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

42. Efficient Software Implementation of Ring-LWE Encryption on IoT Processors.

Author: Liu, Zhe, Azarderakhsh, Reza, Kim, Howon, and Seo, Hwajeong
Subjects: GAUSSIAN distribution, INTERNET of things, MATHEMATICAL optimization, NEON, COMPUTER software
Abstract: Embedded processors have been widely used for building up Internet of Things (IoT) platforms, in which the security issue is becoming critical. This paper studies efficient techniques of lattice-based cryptography on these processors and presents the first implementation of ring-LWE encryption on ARM NEON and MSP430 architectures. For ARM NEON architecture, we propose a vectorized version of Iterative Number Theoretic Transform (NTT) for high-speed computation of polynomial multiplication on ARM NEON platforms and a 32-bit variant of SAMS2 technique for fast reduction. For MSP430 architecture, we propose an optimized SWAMS2 reduction technique, which consists of five different basic operations, including Shifting, Swapping, Addition, and two Multiplication-Subtractions. Regarding of the sampling from the discrete Gaussian distribution, we adopt Knuth-Yao sampler, accompanied with optimized methods such as Look-Up Table (LUT) and byte-scanning. Subsequently, a full-fledged implementation of Ring-LWE is presented by both taking advantage of our proposed method and previous optimization techniques re-designed for desired platforms. Our ring-LWE implementation of encryption/decryption at a classical security level of 128 bits requires only $149.4k/32.8k$ 149. 4 k / 32. 8 k clock cycles on ARM NEON, and $2126.3k/244.5k$ 2126. 3 k / 244. 5 k clock cycles on MSP430. These results are roughly 7 times faster than the fastest ECC implementation on desired platforms with same security level. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

43. Neuromorphic System for Spatial and Temporal Information Processing.

Author: Zyarah, Abdullah M., Gomez, Kevin, and Kudithipudi, Dhireesha
Subjects: SPATIAL systems, INFORMATION processing, UBIQUITOUS computing, SPATIOTEMPORAL processes, FAULT-tolerant computing
Abstract: Neuromorphic systems that learn and predict from streaming inputs hold significant promise in pervasive edge computing and its applications. In this article, a neuromorphic system that processes spatio-temporal information on the edge is proposed. Algorithmically, the system is based on hierarchical temporal memory that inherently offers online learning, resiliency, and fault tolerance. Architecturally, it is a full custom mixed-signal design with an underlying digital communication scheme and analog computational modules. Therefore, the proposed system features reconfigurability, real-time processing, low power consumption, and low-latency processing. The proposed architecture is benchmarked to predict on real-world streaming data. The network's mean absolute percentage error on the mixed-signal system is 1.129 X lower compared to its baseline algorithm model. This reduction can be attributed to device non-idealities and probabilistic formation of synaptic connections. We demonstrate that the combined effect of Hebbian learning and network sparsity also plays a major role in extending the overall network lifespan. We also illustrate that the system offers 3.46 X reduction in latency and 77.02 X reduction in power consumption when compared to a custom CMOS digital design implemented at the same technology node. By employing specific low power techniques, such as clock gating, we observe 161.37 X reduction in power consumption. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

44. Machine Learning Computers With Fractal von Neumann Architecture.

Author: Zhao, Yongwei, Fan, Zhe, Du, Zidong, Zhi, Tian, Li, Ling, Guo, Qi, Liu, Shaoli, Xu, Zhiwei, Chen, Tianshi, and Chen, Yunji
Subjects: MACHINE learning, COMPUTER architecture, COMPUTERS, FRACTALS, GRAPHICS processing units, ARCHITECTURAL design, SERVER farms (Computer network management)
Abstract: Machine learning techniques are pervasive tools for emerging commercial applications and many dedicated machine learning computers on different scales have been deployed in embedded devices, servers, and data centers. Currently, most machine learning computer architectures still focus on optimizing performance and energy efficiency instead of programming productivity. However, with the fast development in silicon technology, programming productivity, including programming itself and software stack development, becomes the vital reason instead of performance and power efficiency that hinders the application of machine learning computers. In this article, we propose Cambricon-F, which is a series of homogeneous, sequential, multi-layer, layer-similar, and machine learning computers with same ISA. A Cambricon-F machine has a fractal von Neumann architecture to iteratively manage its components: it is with von Neumann architecture and its processing components (sub-nodes) are still Cambricon-F machines with von Neumann architecture and the same ISA. Since different Cambricon-F instances with different scales can share the same software stack on their common ISA, Cambricon-Fs can significantly improve the programming productivity. Moreover, we address four major challenges in Cambricon-F architecture design, which allow Cambricon-F to achieve a high efficiency. We implement two Cambricon-F instances at different scales, i.e., Cambricon-F100 and Cambricon-F1. Compared to GPU based machines (DGX-1 and 1080Ti), Cambricon-F instances achieve 2.82x, 5.14x better performance, 8.37x, 11.39x better efficiency on average, with 74.5, 93.8 percent smaller area costs, respectively. We further propose Cambricon-FR, which enhances the Cambricon-F machine learning computers to flexibly and efficiently support all the fractal operations with a reconfigurable fractal instruction set architecture. Compared to the Cambricon-F instances, Cambricon-FR machines achieve 1.96x, 2.49x better performance on average. Most importantly, Cambricon-FR computers are able to save the code length with a factor of 5.83, thus significantly improving the programming productivity. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

45. Graph Similarity and its Applications to Hardware Security.

Author: Fyrbiak, Marc, Wallat, Sebastian, Reinhard, Sascha, Bissantz, Nicolai, and Paar, Christof
Subjects: INTELLECTUAL property infringement, GRAPH algorithms, REVERSE engineering, HARDWARE
Abstract: Hardware reverse engineering is a powerful and universal tool for both security engineers and adversaries. From a defensive perspective, it allows for detection of intellectual property infringements and hardware Trojans, while it simultaneously can be used for product piracy and malicious circuit manipulations. From a designer's perspective, it is crucial to have an estimate of the costs associated with reverse engineering, yet little is known about this, especially when dealing with obfuscated hardware. The contribution at hand provides new insights into this problem, based on algorithms with sound mathematical underpinnings. Our contributions are threefold: First, we present the graph similarity problem for automating hardware reverse engineering. To this end, we improve several state-of-the-art graph similarity heuristics with optimizations tailored to the hardware context. Second, we propose a novel algorithm based on multiresolutional spectral analysis of adjacency matrices. Third, in three extensively evaluated case studies, namely (1) gate-level netlist reverse engineering, (2) hardware Trojan detection, and (3) assessment of hardware obfuscation, we demonstrate the practical nature of graph similarity algorithms. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

46. Exploiting Hardware-Based Data-Parallel and Multithreading Models for Smart Edge Computing in Reconfigurable FPGAs

Author: Eduardo de la Torre, Alfonso Rodriguez, Marco Platzner, and Andres Otero
Subjects: business.industry, Data parallelism, Computer science, Reconfigurable computing, Theoretical Computer Science, Software, Computational Theory and Mathematics, Hardware and Architecture, Multithreading, Programming paradigm, Enhanced Data Rates for GSM Evolution, business, Evolvable hardware, Computer hardware, Edge computing
Abstract: Current edge computing systems are deployed in highly complex application scenarios with dynamically changing requirements. In order to provide the expected performance and energy efficiency values in these situations, the use of heterogeneous hardware/software platforms at the edge has become widespread. However, these computing platforms still suffer from the lack of unified software-driven programming models to efficiently deploy multi-purpose hardware-accelerated solutions. In parallel, edge computing systems also face another huge challenge: operating under multiple conditions that were not taken into account during any of the design stages. Moreover, these conditions may change over time, forcing self-adaptation mechanisms to become a must. This paper presents an integrated architecture to exploit hardware-accelerated data-parallel models and transparent hardware/software multithreading. In particular, the proposed architecture leverages the \ARTICo framework and ReconOS to allow developers to select the most suitable programming model to deploy their edge computing applications onto run-time reconfigurable hardware devices. An evolvable hardware system is used as an additional architectural component during validation, providing support for continuous lifelong learning in smart edge computing scenarios. In particular, the proposed setup exhibits online learning capabilities that include learning by imitation from software-based reference algorithms.
Published: 2022
Full Text: View/download PDF

47. Architectures and Execution Models for Hardware/Software Compilation and Their System-Level Realization.

Author: Lange, Holger and Koch, Andreas
Subjects: ADAPTIVE computing systems, FIELD programmable gate arrays, COMPUTER input-output equipment, COMPUTER software, COMPUTER storage devices, SYSTEM integration, COMPUTER operating systems, VIRTUAL storage (Computer science)
Abstract: We propose an execution model that orchestrates the fine-grained interaction of a conventional general-purpose processor (GPP) and a high-speed reconfigurable hardware accelerator (HA), the latter having full master-mode access to memory. We then describe how the resulting requirements can actually be realized efficiently in a custom computer by hardware architecture and system software measures. One of these is a low-latency HA-to-GPP signaling scheme with latency up to 23\times times shorter than conventional approaches. Another one is a high-bandwidth shared memory interface that does not interfere with time-critical operating system functions executing on the GPP, and still makes 89 percent of the physical memory bandwidth available to the HA. Finally, we show two schemes with different flexibility/performance trade-offs for running the HA in protected virtual memory scenarios. All of the techniques and their interactions are evaluated at the system level using the full-scale virtual memory variant of the Linux operating system on actual hardware. [ABSTRACT FROM PUBLISHER]
Published: 2010
Full Text: View/download PDF

48. Generation of Finely-Pipelined GF($P$P) Multipliers for Flexible Curve Based Cryptography on FPGAs.

Author: Gallin, Gabriel and Tisserand, Arnaud
Subjects: MULTIPLIERS (Mathematical analysis), CRYPTOGRAPHY, FLEXIBLE printed circuits, ELLIPTIC curve cryptography, FIELD programmable gate arrays, TIMING circuits
Abstract: In this paper, we present modular multipliers for hardware implementations of (hyper)-elliptic curve cryptography on FPGAs. The prime modulus $P$P is generic and can be configured at run-time to provide flexible circuits. A finely-pipelined architecture is proposed for overlapping the partial products and reductions steps in the pipeline of hardwired DSP slices. For instance, 2, 3, or 4 independent multiplications can share the hardware resources at the same time to overlap internal latencies. We designed a tool, distributed as open source, for generating VHDL codes with various parameters: width of operands, number of logical multipliers per physical one, speed or area optimization, possible use of BRAMs, target FPGA. Our modular multipliers lead to, at least, 2 times faster as well as 2 times smaller circuits than state of the art operators. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

49. On the Construction of Composite Finite Fields for Hardware Obfuscation.

Author: Zhang, Xinmiao and Lao, Yingjie
Subjects: FINITE fields, COMPOSITE construction
Abstract: Hardware obfuscation is a technique that modifies the circuit to hide the functionality. Obfuscations through algorithmic modifications add protection in addition to circuit-level techniques, and their effects on the data paths can be analyzed and controlled at the architectural level. Many error-correcting coding and cryptography algorithms are based on finite field arithmetic. For the first time, this paper proposes a hardware obfuscation scheme achieved through varying finite field constructions and primitive element representations. Also the variations are effectively transformed to bit permuters controlled by obfuscation keys to achieve high level of security with very small complexity overheads. To illustrate the effectiveness, the proposed scheme is applied to obfuscate Reed-Solomon decoders, which are broadly used in communication and storage systems. For a (255, 239) RS decoder over finite field $GF(256)$GF(256), the proposed scheme achieves 1239 bits of independent obfuscation key with 4.4 percent area overhead, while yielding no penalty on the throughput and only one extra clock cycle of latency. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

50. Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU.

Author: Vogel, Pirmin, Marongiu, Andrea, and Benini, Luca
Subjects: SYSTEMS on a chip, MEMORY
Abstract: A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development. In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

205 results on '"Reconfigurable hardware"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources