205 results on '"Reconfigurable hardware"'
Search Results
2. Area-time efficient implementation of the elliptic curve method of factoring in reconfigurable hardware for application in the number field sieve
- Author
-
Gaj, K., Soonhak Kwon, Baier, P., Kohlbrenner, P., Hoang Le, Khaleeluddin, M., Bachimanchi, R., and Rogawski, M.
- Subjects
Programmable logic array ,Computers -- Design and construction ,Curves, Elliptic -- Usage ,Ellipse -- Usage ,Digital integrated circuits -- Design and construction - Published
- 2010
3. High-performance designs for linear algebra operations on reconfigurable hardware
- Author
-
Zhuo Ling and Prasanna, Viktor K.
- Subjects
Programmable logic array ,Algebras, Linear -- Usage ,Digital integrated circuits -- Analysis ,Matrices -- Usage - Published
- 2008
4. Reconfigurable hardware SAT solvers: A survey of systems
- Author
-
Skliarova, Iouliia and Ferrari, Antonio de Brito
- Subjects
Microprocessor ,Microprocessor upgrade ,Microprocessors -- Testing - Published
- 2004
5. High-radix Montgomery modular exponentiation on reconfigurable hardware
- Author
-
Blum, Thomas and Paar, Christof
- Subjects
Computers -- Safety and security measures ,Cryptography -- Research ,Modulation (Electronics) -- Research ,Computer programming -- Models - Published
- 2001
6. Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware.
- Author
-
Ramanathan, Nadesh, Wickerson, John, and Constantinides, George A.
- Subjects
- *
SCHEDULING software , *FIELD programmable gate arrays , *ALGORITHMS , *COMPUTER storage devices , *ARRAY processors - Abstract
Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations (‘atomics’), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis(HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further inter-iteration constraints. We implement our approach on two constraint-based scheduling HLS tools: LegUp 4.0 and LegUp 5.1. We extend both tools to support two memory models that are capable of synthesising atomics correctly. The first memory model only supports sequentially consistent (SC) atomics and the second supports weakly consistent (‘weak’) atomics as defined by the 2011 revision of the C standard. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many multi-threaded algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that on average circuits synthesised from programs that schedule atomics correctly can be 6x faster than an existing lock-based implementation of atomics, that weak atomics can yield a further 1.3x speedup, and that pipelining can yield a further 1.3x speedup. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. An Embedded Memory-Centric Reconfigurable Hardware Accelerator for Security Applications
- Author
-
Robert Karam, Christopher Babecki, Somnath Paul, Swarup Bhunia, and Wenchao Qian
- Subjects
business.industry ,Computer science ,020208 electrical & electronic engineering ,02 engineering and technology ,Security kernel ,Reconfigurable computing ,020202 computer hardware & architecture ,Theoretical Computer Science ,Software ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,Datapath ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,business ,Field-programmable gate array ,Efficient energy use - Abstract
Security has emerged as a critical need in today’s computer applications. Unfortunately, most security algorithms are computationally expensive and often do not map efficiently to general purpose processors. Fixed-function accelerators offer significant improvement in energy-efficiency, but they do not allow more than one application to reuse hardware resources. Mapping applications to generic reconfigurable fabrics can achieve the desired flexibility, but at the cost of area and energy efficiency. This paper presents a novel reconfigurable framework, referred to as hardware accelerator for security kernel (HASK), for accelerating a wide array of security applications. This framework incorporates a coarse-grained datapath, supports for lookup functions, and flexible interconnect optimizations, which enable on-demand pipelining and parallel computations in multiple ultralight-weight processing elements. These features are highly effective for energy-efficient operation in a diverse set of security applications. Through simulations, we have compared the performance of HASK to software and field programmable gate array (FPGA) platforms. Simulation results for a set of six common security applications show comparable latency between HASK and FPGA with 2.5X improvement in energy-delay product and 4X improvement in iso-area throughput. HASK also shows 5X improvement in iso-area throughput and 45X improvement in energy-delay product compared to optimized software implementations.
- Published
- 2016
- Full Text
- View/download PDF
8. Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants
- Author
-
Mohamed Bakhouya, Vikram K. Narayana, Miaoqing Huang, Jaafar Gaber, and Tarek El-Ghazawi
- Subjects
Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Computer science ,Genetic algorithm ,FpgaC ,Throughput (business) ,Execution time ,Software ,Reconfigurable computing ,Theoretical Computer Science ,Task (project management) - Abstract
High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. Mapping application task graphs onto reconfigurable hardware is, therefore, of rising attention. In this work, we approach the mapping problem by incorporating multiple architectural variants for each hardware task; the variants reflect tradeoffs between the logic resources consumed and the task execution throughput. We propose a mapping approach based on the genetic algorithm, and show its effectiveness for random task graphs as well as an N-body simulation application, demonstrating improvements of up to 78.6 percent in the execution time compared with choosing a fixed implementation variant for all tasks. We then validate our methodology through experiments on real hardware, an SRC-6 reconfigurable computer.
- Published
- 2012
- Full Text
- View/download PDF
9. Lattice-Based Signatures: Optimization and Implementation on Reconfigurable Hardware.
- Author
-
Guneysu, Tim, Lyubashevsky, Vadim, and Poppelmann, Thomas
- Subjects
- *
LATTICE theory , *DIGITAL signatures , *MATHEMATICAL optimization , *ADAPTIVE computing systems , *QUANTUM computers - Abstract
Nearly all of the currently used signature schemes, such as RSA or DSA, are based either on the factoring assumption or the presumed intractability of the discrete logarithm problem. As a consequence, the appearance of quantum computers or algorithmic advances on these problems may lead to the unpleasant situation that a large number of today’s schemes will most likely need to be replaced with more secure alternatives. In this work we present such an alternative—an efficient signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based cryptography and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly $1.5$
kB and $0.3$- Published
- 2015
- Full Text
- View/download PDF
10. Reconfigurable Hardware Implementations of Tweakable Enciphering Schemes
- Author
-
Cuauhtemoc Mancillas-López, Debrup Chakraborty, and Francisco Rodriguez Henriquez
- Subjects
Block cipher mode of operation ,Computer science ,business.industry ,Hash function ,Cryptography ,Parallel computing ,Encryption ,Pseudorandom permutation ,Reconfigurable computing ,Theoretical Computer Science ,Computational Theory and Mathematics ,Disk encryption ,Hardware and Architecture ,Embedded system ,business ,Software ,Block cipher - Abstract
Tweakable enciphering schemes are length-preserving block cipher modes of operation that provide a strong pseudorandom permutation. It has been suggested that these schemes can be used as the main building blocks for achieving in-place disk encryption. In the past few years, there has been an intense research activity toward constructing secure and efficient tweakable enciphering schemes. But actual experimental performance data of these newly proposed schemes are yet to be reported. In this paper, we present optimized FPGA implementations of six tweakable enciphering schemes, namely, HCH, HCTR, XCB, EME, HEH, and TET, using a 128-bit AES core as the underlying block cipher. We report the performance timings of these modes when using both pipelined and sequential AES structures. The universal polynomial hash function included in the specification of HCH, HCHfp (a variant of HCH), HCTR, XCB, TET, and HEH was implemented using a Karatsuba multiplier as the main building block. We provide detailed algorithm analysis of each of the schemes trying to exploit their inherent parallelism as much as possible. Our experiments show that a sequential AES core is not an attractive option for the design of these modes as it leads to rather poor throughput. In contrast, according to our place-and-route results on a Xilinx Virtex 4 FPGA, our designs achieve a throughput of 3.95 Gbps for HEH when using an encryption/decryption pipelined AES core, and a throughput of 5.71 Gbps for EME when using a encryption-only pipeline AES core. The performance results reported in this paper provide experimental evidence that hardware implementations of tweakable enciphering schemes can actually match and even outperform the data rates achieved by state-of-the-art disk controllers, thus showing that they might be used for achieving provably secure in-place hard disk encryption.
- Published
- 2010
- Full Text
- View/download PDF
11. Area-Time Efficient Implementation of the Elliptic Curve Method of Factoring in Reconfigurable Hardware for Application in the Number Field Sieve
- Author
-
Mohammed Khaleeluddin, Marcin Rogawski, Kris Gaj, Patrick Baier, Hoang Le, Soonhak Kwon, Paul Kohlbrenner, and Ramakrishna Bachimanchi
- Subjects
Hardware architecture ,business.industry ,Computer science ,Parallel computing ,Porting ,Reconfigurable computing ,Theoretical Computer Science ,General number field sieve ,Public-key cryptography ,Elliptic curve ,Memory management ,Software ,Computational Theory and Mathematics ,Hardware and Architecture ,business ,Field-programmable gate array - Abstract
A novel portable hardware architecture of the Elliptic Curve Method of factoring, designed and optimized for application in the relation collection step of the Number Field Sieve, is described and analyzed. A comparison with an earlier proof-of-concept design by Pelzl et al. has been performed, and a substantial improvement has been demonstrated in terms of both the execution time and the area-time product. The ECM architecture has been ported across five different families of FPGA devices in order to select the family with the best performance to cost ratio. A timing comparison with the highly optimized software implementation, GMP-ECM, has been performed. Our results indicate that low-cost families of FPGAs, such as Spartan-3 and Spartan-3E, offer at least an order of magnitude improvement over the same generation of microprocessors in terms of the performance to cost ratio, without the use of embedded FPGA resources, such as embedded multipliers.
- Published
- 2010
- Full Text
- View/download PDF
12. An Embedded Memory-Centric Reconfigurable Hardware Accelerator for Security Applications
- Author
-
Babecki, Christopher, primary, Qian, Wenchao, additional, Paul, Somnath, additional, Karam, Robert, additional, and Bhunia, Swarup, additional
- Published
- 2016
- Full Text
- View/download PDF
13. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware
- Author
-
Viktor K. Prasanna and Ling Zhuo
- Subjects
Numerical linear algebra ,Floating point ,Computer science ,Parallel algorithm ,Dot product ,Memory bandwidth ,Parallel computing ,computer.software_genre ,Reconfigurable computing ,Matrix multiplication ,Theoretical Computer Science ,Matrix decomposition ,Computer Science::Hardware Architecture ,Computational Theory and Mathematics ,Hardware and Architecture ,Linear algebra ,Hardware acceleration ,Multiplication ,Field-programmable gate array ,computer ,Software - Abstract
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (field programmable gate arrays) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor based designs. We also show that with faster floating-point units and larger devices, the performance of our designs increases accordingly.
- Published
- 2008
- Full Text
- View/download PDF
14. High-radix Montgomery modular exponentiation on reconfigurable hardware
- Author
-
Christof Paar and T. Blum
- Subjects
Modular exponentiation ,Exponentiation ,Modular arithmetic ,business.industry ,Computer science ,Modulus ,Systolic array ,Cryptography ,Operand ,Reconfigurable computing ,Theoretical Computer Science ,Public-key cryptography ,Computational Theory and Mathematics ,Montgomery reduction ,Computer architecture ,Integer ,Hardware and Architecture ,Discrete logarithm ,Radix ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,business ,Software - Abstract
It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. Central tools for achieving system security are cryptographic algorithms. This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine a high-radix Montgomery modular multiplication algorithm with a new systolic array design. The designs are flexible, allowing any choice of operand and modulus. The new architecture also allows the use of high radices. Unlike previous approaches, we systematically implement and compare several variants of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture. The results allow conclusions about the feasibility and time-space trade-offs of our architecture for implementation on commercially available FPGAs. We found that 1,024-bit RSA decryption can be done in 3.1 ms with our fastest architecture.
- Published
- 2001
- Full Text
- View/download PDF
15. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware.
- Author
-
Ling Zhuo and Prasanna, Viktor K.
- Subjects
- *
BROADBAND communication systems , *FIELD programmable gate arrays , *MATHEMATICAL analysis , *MATRICES (Mathematics) , *DIGITAL communications , *DATA transmission systems , *COMBINATORICS , *COMPUTER programming - Abstract
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using field-programmable gate arrays (FPGAs) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication, and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor-based designs. We also show that, with faster floating-point units and larger devices, the performance of our designs increases accordingly. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
16. Reconfigurable Hardware SAT Solvers: A Survey of Systems.
- Author
-
Skilarova, Iouliia and Ferrari, António de Brito
- Subjects
- *
PROGRAMMABLE logic devices , *NETWORK processors , *ELECTRONIC equipment , *ALGORITHMS , *COMBINATORIAL optimization , *COMPUTER programming - Abstract
By adapting to computations that are not so well-supported by general-purpose processors, reconfigurable systems achieve significant increases in performance. Such computational systems use high-capacity programmable logic devices and are based on processing units customized to the requirements of a particular application. A great deal of the research effort in this area is aimed at accelerating the solution of combinatorial optimization problems. Special attention in this context was given to the Boolean satisfiability (SAT) problem resulting in a considerable number of different architectures being proposed. This paper presents the state- of-the-art in reconfigurable hardware SAT satisfiers. The analysis and classification of existing systems has been performed according to such criteria as algorithmic issues, reconfiguration modes, the execution model, the programming model, logic capacity, and performance. [ABSTRACT FROM AUTHOR]
- Published
- 2004
17. Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants
- Author
-
Huang, Miaoqing, primary, Narayana, Vikram K., additional, Bakhouya, Mohamed, additional, Gaber, Jaafar, additional, and El-Ghazawi, Tarek, additional
- Published
- 2012
- Full Text
- View/download PDF
18. A Dynamically Reconfigurable System for Closed-Loop Measurements of Network Traffic
- Author
-
Khan, Faisal, Ghiasi, Soheil, and Chuah, Chen-Nee
- Subjects
Distributed Computing and Systems Software ,Information and Computing Sciences ,Engineering ,Reconfigurable hardware ,network monitoring ,parallel circuits ,Computer Software ,Distributed Computing ,Computer Hardware ,Computer Hardware & Architecture ,Electronics ,sensors and digital hardware ,Distributed computing and systems software - Abstract
Streaming network traffic measurement and analysis is critical for detecting and preventing any real-time anomalies in the network. The high speeds and complexity of today's networks, coupled with ever evolving threats, necessitate closing of the loop between measurements and their analysis in real time. The ensuing system demands high levels of programmability and processing where streaming measurements adapt to the changing network behavior in a goal-oriented manner. In this work, we exploit the features and requirements of the problem and develop an application-specific FPGA-based closed-loop measurement (CLM) system. We make novel use of fine-grained partial dynamic reconfiguration (PDR) as underlying reprogramming paradigm, performing low-latency just-in-time compiled logic changes in FPGA fabric corresponding to the dynamic measurement requirements. Our innovative dynamically reconfigurable socket offers 3× logic savings over conventional static solutions, while offering much reduced reconfiguration latencies over conventional PDR mechanisms. We integrate multiple sockets in a highly parallel CLM framework and demonstrate its effectiveness in identifying heavy flows in streaming network traffic. The results using an FPGA prototype offer 100 percent detection accuracy while sustaining increasing link speeds. © 1968-2012 IEEE.
- Published
- 2014
19. Lattice-Based Signatures: Optimization and Implementation on Reconfigurable Hardware
- Author
-
Vadim Lyubashevsky, Tim Güneysu, and Thomas Pöppelmann
- Subjects
Theoretical computer science ,business.industry ,Lattice problem ,Cryptography ,02 engineering and technology ,Parallel computing ,Reconfigurable computing ,020202 computer hardware & architecture ,Theoretical Computer Science ,Public-key cryptography ,Computational Theory and Mathematics ,Hardware and Architecture ,Discrete logarithm ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Lattice-based cryptography ,business ,Software ,Quantum computer ,Mathematics - Abstract
Nearly all of the currently used signature schemes, such as RSA or DSA, are based either on the factoring assumption or the presumed intractability of the discrete logarithm problem. As a consequence, the appearance of quantum computers or algorithmic advances on these problems may lead to the unpleasant situation that a large number of today’s schemes will most likely need to be replaced with more secure alternatives. In this work we present such an alternative—an efficient signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based cryptography and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly $1.5$ kB and $0.3$ kB long, while the signature size is approximately $1.1$ kB for a security level of around $80$ bits. We provide implementation results on reconfigurable hardware (Spartan/Virtex-6) and demonstrate that the scheme is scalable, has low area consumption, and even outperforms classical schemes.
- Full Text
- View/download PDF
20. Exploiting Hardware-Based Data-Parallel and Multithreading Models for Smart Edge Computing in Reconfigurable FPGAs.
- Author
-
Rodriguez, Alfonso, Otero, Andres, Platzner, Marco, and de la Torre, Eduardo
- Subjects
EDGE computing ,ADAPTIVE computing systems ,FIELD programmable gate arrays ,COMPUTER systems ,COMPUTING platforms ,ONLINE exhibitions - Abstract
Current edge computing systems are deployed in highly complex application scenarios with dynamically changing requirements. In order to provide the expected performance and energy efficiency values in these situations, the use of heterogeneous hardware/software platforms at the edge has become widespread. However, these computing platforms still suffer from the lack of unified software-driven programming models to efficiently deploy multi-purpose hardware-accelerated solutions. In parallel, edge computing systems also face another huge challenge: operating under multiple conditions that were not taken into account during any of the design stages. Moreover, these conditions may change over time, forcing self-adaptation mechanisms to become a must. This paper presents an integrated architecture to exploit hardware-accelerated data-parallel models and transparent hardware/software multithreading. In particular, the proposed architecture leverages the ARTICo3 framework and ReconOS to allow developers to select the most suitable programming model to deploy their edge computing applications onto run-time reconfigurable hardware devices. An evolvable hardware system is used as an additional architectural component during validation, providing support for continuous lifelong learning in smart edge computing scenarios. In particular, the proposed setup exhibits online learning capabilities that include learning by imitation from software-based reference algorithms. Experimental results show the benefits of the proposed approach, exposing different run-time tradeoffs (e.g., computing performance versus functional correctness of the evolved solutions), and highlighting the benefits of using scalable data-parallel models to perform circuit evolution under dynamically changing application scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. An Extensive Study of Flexible Design Methods for the Number Theoretic Transform.
- Author
-
Mert, Ahmet Can, Karabulut, Emre, Ozturk, Erdinc, Savas, Erkay, and Aysu, Aydin
- Subjects
POLYNOMIAL rings ,EXPERIMENTAL design ,DIGITAL signatures ,DESIGN software ,CRYPTOGRAPHY ,COMPUTATIONAL complexity ,SOFTWARE architecture ,HOMOMORPHISMS ,ADAPTIVE computing systems - Abstract
Efficient lattice-based cryptosystems operate with polynomial rings with the Number Theoretic Transform (NTT) to reduce the computational complexity of polynomial multiplication. NTT has therefore become a major arithmetic component (thus computational bottleneck) in various cryptographic constructions like hash functions, key-encapsulation mechanisms, digital signatures, and homomorphic encryption. Although there exist several hardware designs in prior work for NTT, they all are isolated design instances fixed for specific NTT parameters or parallelization level. This article provides an extensive study of flexible design methods for NTT implementation. To that end, we evaluate three cases: (1) parametric hardware design, (2) high-level synthesis (HLS) design approach, and (3) design for software implementation compiled on soft-core processors, where all are targeted on reconfigurable hardware devices. We evaluate the designs that implement multiple NTT parameters and/or processing elements, demonstrate the design details for each case, and provide a fair comparison with each other and prior work. On a Xilinx Virtex-7 FPGA, compared to HLS and processor-based methods, the results show that the parametric hardware design is on average $4.4\times$ 4. 4 × and $73.9\times$ 73. 9 × smaller and $22.5\times$ 22. 5 × and $19.3\times$ 19. 3 × faster, respectively. Surprisingly, HLS tools can yield less efficient solutions than processor-based approaches in some cases. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Bypassing Multicore Memory Bugs With Coarse-Grained Reconfigurable Logic.
- Author
-
Lee, Doowon and Bertacco, Valeria
- Subjects
- *
CACHE memory , *FINITE state machines , *ARM microprocessors , *MEMORY , *SYSTEMS design , *LOGIC - Abstract
Multicore systems deploy sophisticated memory hierarchies to improve memory operations’ throughput and latency by exploiting multiple levels of cache hierarchy and several complex memory-access instructions. As a result, the functional verification of the memory subsystem is one of the most challenging tasks in the overall system design effort, leading to many bugs in the released product. In this work, we propose MemPatch, a novel reconfigurable hardware solution to bypass such escaped bugs. To design MemPatch, we first analyzed publicly available errata documents and classified memory-related bugs by root cause and symptoms. We then leveraged that learning to design a specialized, reconfigurable detection fabric, implementing finite state machines that can model the bug-triggering events at the microarchitectural level. Finally, we complemented this detection logic with hardware offering multiple bug-bypassing options. Our evaluation of MemPatch mapped a multicore RISC-V out-of-order processor, augmented with our logic, to a Xilinx ZCU102 FPGA board. When configured to detect up to 32 distinct bugs, MemPatch entails 7.6% area and 7.3% power overheads. An estimate on a commercial ARM Cortex-A57 processor target indicates that the area overhead would be much lower, 1.0%. The performance impact was found to be no more than 1% in all cases. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks.
- Author
-
Steiger, Christoph, Walder, Herbert, and Platzner, Marco
- Subjects
COMPUTER operating systems ,REAL-time computing ,FIELD programmable gate arrays ,GATE array circuits ,PROGRAMMABLE logic devices ,ALGORITHMS - Abstract
Today's reconfigurable hardware devices have huge densities and are partially reconfigurable, allowing for the configuration and execution of hardware tasks in a true multitasking manner. This makes reconfigurable platforms an ideal target for many modern embedded systems that combine high computation demands with dynamic task sets. A rather new line of research is engaged in the construction of operating systems for reconfigurable embedded platforms. Such an operating system provides a minimal programming model and a runtime system. The runtime system performs online task and resource management. In this paper, we first discuss design issues for reconfigurable hardware operating systems. Then, we focus on a runtime system for guarantee- based scheduling of hard real-time tasks. We formulate the scheduling problem for the 1 D and 2D resource models and present two heuristics, the horizon and the stuffing technique, to tackle it. Simulation experiments conducted with synthetic work loads evaluate the performance and the runtime efficiency of the proposed schedulers. The scheduling performance for the 1D resource model is strongly dependent on the aspect ratios of the tasks. Compared to the 1D model, the 2D resource model is clearly superior. Finally, the runtime overhead of the scheduling algorithms is shown to be acceptably low. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
24. LayeredTrees: Most Specific Prefix-Based Pipelined Design for On-Chip IP Address Lookups.
- Author
-
Chang, Yeim-Kuan, Kuo, Fang-Chen, Kuo, Han-Jhen, and Su, Cheng-Chien
- Subjects
INTERNET protocol address ,INTERNET protocols ,COMPUTER network resources ,COMPUTER storage devices ,WEB search engines ,ROUTING (Computer network management) - Abstract
Multibit trie-based pipelines for IP lookups have been demonstrated to be able to achieve the throughput of over 100 Gbps. However, it is hard to store the entire multibit trie into the on-chip memory of reconfigurable hardware devices. Thus, their performance is limited by the speed of off-chip memory. In this paper, we propose a new pipeline design called LayeredTrees that overcomes the shortcomings of the multibit trie-based pipelines. LayeredTrees pipelines the multi-layered multiway balanced prefix trees based on the concept of most specific prefixes. LayeredTrees is optimized to fit the entire routing table into the on-chip memory of reconfigurable hardware devices. No prefix duplication is needed and each \mbi W-bit prefix is encoded in a (\mbi W + \bf 1)-bit format to save memory. Assume the minimal packet size is 40 bytes. Our experimental results on Virtex-6 XC6VSX315T FPGA chip show that the throughputs of 33.6 and 120.8 Gbps can be achieved by the proposed single search engine and multiple search engines running in parallel, respectively. Furthermore, the impact of update operations on the search performance is minimal. With the same FPGA device, an IPv6 routing table of 290,503 distinct entries can also be supported. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
25. The MOLEN Polymorphic Processor.
- Author
-
Vassiliadis, Stamatis, Wong, Stephan, Gaydadjiev, Georgi, Bertels, Koen, Kuzmanov, Georgi, and Panainte, Elena Moscu
- Subjects
HIGH performance processors ,COMPUTER programmers ,DECODERS (Electronics) ,COMPILERS (Computer programs) ,SYSTEMS software ,COMPUTER software - Abstract
In this paper, we present a polymorphic processor paradigm incorporating both general purpose and custom computing processing. The proposal incorporates an arbitrary number of programmable units, exposes the hardware to the programmers/ designers, and allows them to modify and extend the processor functionality at will. To achieve the previously stated attributes, we present a new programming paradigm, a new instruction set architecture, a microcode-based microarchitecture, and a compiler methodology. The programming paradigm, in contrast with the conventional programming paradigms, allows general-purpose conventional code and hardware descriptions to coexist in a program. In our proposal, for a given instruction set architecture, a one- time instruction set extension of eight instructions is sufficient to implement the reconfigurable functionality of the processor. We propose a microarchitecture based on reconfigurable hardware emulation to allow high-speed reconfiguration and execution. To prove the viability of the proposal, we experimented with the MPEG-2 encoder and decoder and a Xilinx Virtex II Pro FPGA. We have implemented three operations, SAD, DCT, and IDCT. The overall attainable application speedup for the MPEG-2 encoder and decoder is between 2.64-3.18 and between 1.56-1.94, respectively, representing between 93 percent and 98 percent of the theoretically obtainable speedups. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
26. A Novel Fault Tolerant and Runtime Reconfigurable Platform for Satellite Payload Processing.
- Author
-
Sterpone, Luca, Porrmann, Mario, and Hagemeyer, Jens
- Subjects
FAULT-tolerant computing ,ADAPTIVE computing systems ,ROCKET payloads ,COMPUTER input-output equipment ,INFORMATION processing ,COMPUTER storage devices - Abstract
Reconfigurable hardware is gaining a steadily growing interest in the domain of space applications. The ability to reconfigure the information processing infrastructure at runtime together with the high computational power of today's FPGA architectures at relatively low power makes these devices interesting candidates for data processing in space applications. Partial dynamic reconfiguration of FPGAs enables maximum flexibility and can be utilized for performance optimization, for improving energy efficiency, and for enhanced fault tolerance. To be able to prove the effectiveness of these novel approaches for satellite payload processing, a highly scalable prototyping environment has been developed, combining dynamically reconfigurable FPGAs with the required interfaces such as SpaceWire, MIL-STD-1553B, and SpaceFibre. The developed systems have been enabled to space harsh environments thanks to an analytical analysis of the radiation effects on its most critical reconfigurable components. Aiming at that scope, a new algorithm for the analysis of critical radiation effects, in particular, related to Single Event Upsets (SEUs) and Multiple Event Upsets (MEUs) has been developed to obtain an effective estimation of the radiation impact and enabling the tuning of the component mapping reducing the routing interaction between the reconfigurable placed modules in their different feasible positions. The experimental performance of the system has been evaluated by a proper dynamic reconfiguration scenario, demonstrating a partial reconfiguration at 400 MByte/s, blind and readback scrubbing is supported and the scrub rate can be adapted individually for different parts of the design. The fault tolerance capability has been proven by means of a new analysis algorithm and by fault injection campaigns of SEUs and MCUs into the FPGA configuration memory. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
27. Memristor-Based Neural Logic Blocks for Nonlinearly Separable Functions.
- Author
-
Soltiz, Michael, Kudithipudi, Dhireesha, Merkel, Cory, Rose, Garrett S., and Pino, Robinson E.
- Subjects
MEMRISTORS ,ARTIFICIAL neural networks ,COMPUTER logic ,NONLINEAR theories ,COMPUTER input-output equipment ,OPTICAL character recognition - Abstract
Neural logic blocks (NLBs) enable the realization of biologically inspired reconfigurable hardware. Networks of NLBs can be trained to perform complex computations such as multilevel Boolean logic and optical character recognition (OCR) in an area- and energy-efficient manner. Recently, several groups have proposed perceptron-based NLB designs with thin-film memristor synapses. These designs are implemented using a static threshold activation function, limiting the set of learnable functions to be linearly separable. In this work, we propose two NLB designs-robust adaptive NLB (RANLB) and multithreshold NLB (MTNLB)—which overcome this limitation by allowing the effective activation function to be adapted during the training process. Consequently, both designs enable any logic function to be implemented in a single-layer NLB network. The proposed NLBs are designed, simulated, and trained to implement ISCAS-85 benchmark circuits, as well as OCR. The MTNLB achieves 90 percent improvement in the energy delay product (EDP) over lookup table (LUT)-based implementations of the ISCAS-85 benchmarks and up to a 99 percent improvement over a previous NLB implementation. As a compromise, the RANLB provides a smaller EDP improvement, but has an average training time of only $(\approx)$ 4 cycles for 4-input logic functions, compared to the MTNLBs $(\approx)$ 8-cycle average training time. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
28. Real-Time Management of Hardware and Software Tasks for FPGA-Based Embedded Systems.
- Author
-
Pellizzoni, Rodolfo and Caccamo, Marco
- Subjects
COMPUTER operating systems ,EMBEDDED computer systems ,COMPUTER hardware description languages ,RESOURCE allocation ,ONLINE algorithms ,INFORMATION networks ,SIMULATION methods & models ,ENGINEERING design ,HIGH technology research - Abstract
Operating systems for reconfigurable devices enable the development of embedded systems where software tasks, running on a CPU, can coexist with hardware tasks running on a reconfigurable hardware device (FPGA). In this work, we consider real-time systems subject to dynamic workloads and whose tasks can be computationally intensive. We introduce a novel resource allocation scheme and an online admission control test that achieve high performance and flexibility; in addition, runtime reconfiguration is used to maximize the number of admitted real-time tasks. Moreover, in detail, we first discuss a 1D system architecture and its prototype for a Xilinx Virtex-4 FPGA; then, we concentrate on the online admission control problem. Online task allocation and migration between the CPU and the reconfigurable device are discussed and sufficient feasibility tests are derived for both the commonly used slotted and 1D area models. Finally, the effectiveness of our admission control and relocation strategy is shown through a series of synthetic simulations. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
29. Automatic Design of Area-Efficient Configurable ASIC Cores.
- Author
-
Compton, Katherine and Hauck, Scott
- Subjects
COMPUTER input-output equipment ,COMPUTER software ,STANDARD cells ,LOGIC design ,INTEGRATED circuits ,PROGRAM transformation ,ROUTING (Computer network management) ,HEURISTIC programming - Abstract
Reconfigurable hardware has been shown to provide an efficient compromise between the flexibility of software and the performance of hardware. However, even coarse-grained reconfigurable architectures target the general case and miss optimization opportunities present if characteristics of the desired application set are known. Restricting the structure to support a class or a specific set of algorithms can increase efficiency while still providing flexibility within that set. By generating a custom array for a given computation domain, we explore the design space between an ASIC and an FPGA. However, the manual creation of these customized reprogrammable architectures would be a labor-intensive process, leading to high design costs. Instead, we propose automatic reconfigurable architecture generation specialized to given application sets. This paper discusses configurable ASIC (cASIC) architecture generation that creates hardware on average up to 12.3x smaller than an FPGA solution with embedded multipliers and 2.2x smaller than a standard cell implementation of individual circuits. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
30. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications.
- Author
-
Singh, Hartej and Lee, Ming-Hau
- Subjects
COMPUTER systems - Abstract
Presents a study which introduced MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for world-level, computation-intensive applications. Taxonomy for reconfigurable systems; Components, features and program flow of MorphoSys; Design of MorphoSys components; Mapping applications to MorphoSys.
- Published
- 2000
- Full Text
- View/download PDF
31. Some Conditional Cube Testers for Grain-128a of Reduced Rounds.
- Author
-
Dalai, Deepak Kumar, Pal, Santu, and Sarkar, Santanu
- Subjects
STREAM ciphers ,CUBES ,SHIFT registers ,HEURISTIC ,BOOLEAN functions - Abstract
In this article, a new strategy, maximum last $\alpha$ α round, is proposed to select cubes for cube attacks. This strategy considers the cubes in a particular round where the probability of its superpoly to be 1 is at most $\alpha$ α , where $\alpha$ α is a very small number. A heuristic method to find a number of suitable cubes using this strategy and the previously used strategies (i.e., maximum initial zero, maximum last zero) are proposed. To get a bias at the higher rounds, the heuristic, too, imposes conditions on some state bits of the cipher to make the non-constant superpoly of a cube as zero for the first few rounds. Some cube testers are formed by using those suitable cubes to implement a distinguishing attack on Grain-128a of reduced KSA (or initialization) rounds. We present a distinguisher for Grain-128a of 191 (out of 256) KSA round in the single key setup and 201 (out of 256) KSA round in the weak key setup by using the cubes of dimension 5. The number of rounds is the highest till today, and the cube dimension is smaller than the previous results. Further, we tested our algorithm on Grain-128 and achieved good results by using small cubes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. R3TOS: A Novel Reliable Reconfigurable Real-Time Operating System for Highly Adaptive, Efficient, and Dependable Computing on FPGAs.
- Author
-
Iturbe, Xabier, Benkrid, Khaled, Hong, Chuan, Ebrahim, Ali, Torrego, Raul, Martinez, Imanol, Arslan, Tughrul, and Perez, Jon
- Subjects
COMPUTER reliability ,ADAPTIVE computing systems ,REAL-time computing ,COMPUTER operating systems ,FIELD programmable gate arrays ,COMPUTER users - Abstract
Despite the clear potential of FPGAs to push the current power wall beyond what is possible with general-purpose processors, as well as to meet ever more exigent reliability requirements, the lack of standard tools and interfaces to develop reconfigurable applications limits FPGAs' user base and makes their programming not productive. R3TOS is our contribution to tackle this problem. It provides systematic OS support for FPGAs, allowing the exploitation of some of the most advanced capabilities of FPGA technology by inexperienced users. What makes R3TOS special is its nonconventional way of exploiting on-chip resources: These are used indistinguishably for carrying out either computation or communication tasks at different times. Indeed, R3TOS does not rely on any static infrastructure apart from its own core circuitry, which is constrained to a specific region within the FPGA where it is implemented. Thus, the rest of the device is kept free of obstacles, with the spare resources ready to be used as and whenever needed. At runtime, the hardware tasks are scheduled and allocated with the dual objective of improving computation density and circumventing damaged resources on the FPGA. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
33. An Automated Framework for Accelerating Numerical Algorithms on Reconfigurable Platforms Using Algorithmic/Architectural Optimization.
- Author
-
Jung Sub Kim, Lanping Deng, Prasanth Mangalagiri, Irick, Kevin, Kanwaldeep Sobti, Mahmut Kandemir, Vijaykrishnan Narayanan, Chaitali Chakrabarti, Pitsianis, Nikos, and Xiaobai Sun
- Subjects
- *
AUTOMATION , *ALGORITHMS , *PROGRAM transformation , *COMPUTER input-output equipment , *FIELD programmable gate arrays , *KERNEL functions , *SIGNAL processing - Abstract
This paper describes TANOR, an automated framework for designing hardware accelerators for numerical computation on reconfigurable platforms. Applications utilizing numerical algorithms on large-size data sets require high-throughput computation platforms. The focus is on N-body interaction problems which have a wide range of applications spanning from astrophysics to molecular dynamics. The TANOR design flow starts with a MATLAB description of a particular interaction function, its parameters, and certain architectural constraints specified through a graphical user interface. Subsequently, TANOR automatically generates a configuration bitstream for a target FPGA along with associated drivers and control software necessary to direct the application from a host PC. Architectural exploration is facilitated through support for fully custom fixed-point and floating-point representations in addition to standard number representations such as single-precision floating point. Moreover, TANOR enables joint exploration of algorithmic and architectural variations in realizing efficient hardware accelerators. TANOR's capabilities have been demonstrated for three different N-body interaction applications: the calculation of gravitational potential in astrophysics, the diffusion or convolution with Gaussian kernel common in image processing applications, and the force calculation with vector-valued kernel function in molecular dynamics simulation. Experimental results show that TANOR-generated hardware accelerators achieve lower resource utilization without compromising numerical accuracy, in comparison to other existing custom accelerators. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
34. Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems.
- Author
-
Ling Zhuo and Prasanna, Viktor K.
- Subjects
- *
FIELD programmable gate arrays , *COMPUTER systems , *MATRICES (Mathematics) , *DIGITAL communications , *DATA transmission systems , *SYSTEMS design , *BROADBAND communication systems , *GATE array circuits , *PROGRAMMABLE logic devices , *COMPUTER networks - Abstract
Recently, high-end reconfigurable computing systems that employ Field-Programmable Gate Arrays (FPGAs) as hardware accelerators for general-purpose processors have been built. These systems provide new opportunities for high-performance computing. However, the coexistence of the processors and the FPGAs in them also poses new challenges to application developers. In this paper, we build a design model for hybrid designs, that is, designs that utilize both the processors and the FPGAs for computations. The model characterizes a reconfigurable computing system using various parameters, including the floating-point computing power of the processors and the FPGAs, the number of nodes, the size of multiple levels of memory, the memory bandwidth, and the network bandwidth. Based on the model, we propose a design methodology for hardware/software codesign. The methodology partitions workload between the processors and the FPGAs, maintains load balance in the system, and realizes scalability over multiple nodes. Designs are proposed for several computationally intensive applications: matrix multiplication, matrix factorization, and the Floyd-Warshall algorithm for the all-pairs shortest-paths problem. To illustrate our ideas, the proposed hybrid designs are implemented on a Cray XD1. Each node of XD1 contains AMD 2.2-GHz Opteron processors and a Xilinx Virtex-II Pro FPGA. Experimental results show that our designs utilize both the processors and the FPGAs efficiently and overlap most of the data transfer overheads and network communication costs with the computations. Our designs achieve up to 90 percent of the total performance of the nodes and 90 percent of the performance predicted by the design model. In addition, our designs scale over a large number of nodes. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
35. Fast Resource and Timing Aware Design Optimisation for High-Level Synthesis.
- Author
-
Perina, Andre B., Silitonga, Arthur, Becker, Jurgen, and Bonato, Vanderlei
- Subjects
- *
COMPILERS (Computer programs) , *GATE array circuits , *FIELD programmable gate arrays , *SPACE exploration - Abstract
Field-Programmable Gate Arrays (FPGA) are often present in energy-efficient systems, although its non-trivial development flow is an obstacle for massive adoption. High-Level Synthesis (HLS) approaches attempt to mitigate the gap by targetting FPGAs from software languages, however manual tuning is still essential to meet performance demands. We present a high-level design space exploration framework with timing and resource awareness that uses an estimator named Lina to evaluate each design point. Lina is a profiling-based approach that avoids the costly static analyses performed by HLS compilers, allowing a significantly faster exploration of optimisations. Estimations are improved by supporting a continuous range of operating frequencies and by considering resource usage for both floating-point and integer datapaths. For a given set of C kernels, the estimated solutions are among the best 1% for execution time and resource footprint. The exploration of each kernel using Lina was performed on average two orders of magnitude faster than using early HLS compiler reports, and four orders of magnitude faster than fully compiling each design point. By considering the design spaces traversed, our solutions reached 70% of the maximum speed-up achievable. This represents an average speed-up of 14-16× compared to the baseline designs with no optimisations enabled. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
36. OmpSs@FPGA Framework for High Performance FPGA Computing.
- Author
-
de Haro, Juan Miguel, Bosch, Jaume, Filgueras, Antonio, Vidal, Miquel, Jimenez-Gonzalez, Daniel, Alvarez, Carlos, Martorell, Xavier, Ayguade, Eduard, and Labarta, Jesus
- Subjects
- *
HIGH performance computing , *COMPILERS (Computer programs) , *FIELD programmable gate arrays - Abstract
This article presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
37. OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives.
- Author
-
Mizutani, Kenji, Yamaguchi, Hiroshi, Urino, Yutaka, and Koibuchi, Michihiro
- Subjects
DISTRIBUTED computing ,FIELD programmable gate arrays ,COMPUTING platforms - Abstract
Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect packet network using Ethernet's Intellectual Properties imposes an unacceptably large amount of the logic for handling such high-bandwidth interconnects on an FPGA. Besides the indirect network, another approach builds a direct packet network. Existing direct inter-FPGA networks have a low-radix network topology, e.g., 2-D torus. However, the low-radix network has the disadvantage of a large diameter and large average shortest path length that increases the latency of collectives. To mitigate both problems, we propose a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collectives. Since all end-to-end separate communication paths are statically established using onboard optics, raw block data can be transferred with simple link-level synchronization. Once each source FPGA assigns a communication stream to a path by its internal switch logic between memory-mapped and stream interfaces for remote direct memory access (RDMA), a one-hop transfer is provided. Since each FPGA performs input/output of the remote memory access between all FPGAs simultaneously, multiple RDMAs efficiently form collectives. The OPTWEB network provides 0.71-μsec start-up latency of collectives among multiple Intel Stratix 10 MX FPGA cards with onboard optics. The OPTWEB network consumes 31.4 and 57.7 percent of adaptive logic modules for aggregate 400-Gbps and 800-Gbps interconnects on a custom Stratix 10 MX 2100 FPGA, respectively. The OPTWEB network reduces by 40 percent the cost compared to a conventional packet network. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
38. Guest Editors' introduction: Special section on adaptive hardware and systems.
- Author
-
Benkrid, Khaled, Keymeulen, Didier, Patel, Umeshkumar D., and Merodio-Codinachs, David
- Subjects
PERIODICAL editors ,COMPUTER input-output equipment ,COMPUTER systems ,ADAPTIVE computing systems ,COMPUTER periodicals - Abstract
This special section of IEEE Transactions on Computers presents some of the latest research developments in the field of adaptive hardware and systems. The creation of this section was motivated by lively discussions held at the annual NASA/ESA Adaptive Hardware and Systems (AHS) conference, which showed a need for such special section at a top ranked journal. At the end of a rigorous review process, ten papers were selected for publication from a set of high quality submissions consisting of regular papers and extended papers from the AHS 2012 conference proceedings. The articles are then briefly described. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
39. LayeredTrees: Most Specific Prefix-Based Pipelined Design for On-Chip IP Address Lookups
- Author
-
Fang-Chen Kuo, Yeim-Kuan Chang, Cheng-Chien Su, and Han-Jhen Kuo
- Subjects
Computer science ,Pipeline (computing) ,Routing table ,Byte ,Parallel computing ,Reconfigurable computing ,Theoretical Computer Science ,Prefix ,Computational Theory and Mathematics ,Hardware and Architecture ,Trie ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Throughput (business) ,Software - Abstract
Multibit trie-based pipelines for IP lookups have been demonstrated to be able to achieve the throughput of over 100 Gbps. However, it is hard to store the entire multibit trie into the on-chip memory of reconfigurable hardware devices. Thus, their performance is limited by the speed of off-chip memory. In this paper, we propose a new pipeline design called LayeredTrees that overcomes the shortcomings of the multibit trie-based pipelines. LayeredTrees pipelines the multi-layered multiway balanced prefix trees based on the concept of most specific prefixes. LayeredTrees is optimized to fit the entire routing table into the on-chip memory of reconfigurable hardware devices. No prefix duplication is needed and each ${\mbi {W}}$ -bit prefix is encoded in a (${\mbi {W}} + {\bf 1}$ )-bit format to save memory. Assume the minimal packet size is 40 bytes. Our experimental results on Virtex-6 XC6VSX315T FPGA chip show that the throughputs of 33.6 and 120.8 Gbps can be achieved by the proposed single search engine and multiple search engines running in parallel, respectively. Furthermore, the impact of update operations on the search performance is minimal. With the same FPGA device, an IPv6 routing table of 290,503 distinct entries can also be supported.
- Published
- 2014
- Full Text
- View/download PDF
40. A Power- and Performance-Aware Software Framework for Control System Applications.
- Author
-
Giardino, Michael, Klawitter, Eric, Ferri, Bonnie, and Ferri, Aldo
- Subjects
SOFTWARE frameworks ,COMPUTING platforms ,SITUATIONAL awareness ,MOBILE robots ,MOBILE operating systems ,CYBER physical systems ,HIGH performance computing - Abstract
This article describes the development of a software architectural framework for implementing compute-aware control systems, where the term “compute-aware” describes controllers that can modify existing low-level computing platform power managers in response to the needs of the physical system controller. This level of interaction means that high-level decisions can be made as to when to operate the computing platform in a power-savings mode or a high-performance mode in response to situation awareness of the physical system. The framework is demonstrated experimentally on a mobile robot platform. In this example, a situation-aware governor is developed that adjusts the speed of the processor based on the physical performance of the robot as it traverses a path through obstacles. The results show that the situation-aware governor results in overall power savings of up to 38.9 percent with 1.3 percent degradation in performance compared to the static high-power strategy. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
41. Pipelined Hardware Implementation of COPA, ELmD, and COLM.
- Author
-
Bossuet, Lilian, Mancillas-Lopez, Cuauhtemoc, and Ovilla-Martinez, Brisbane
- Subjects
DATA integrity ,HARDWARE - Abstract
Authenticated encryption algorithms offer privacy, authentication, and data integrity, as well. In recent years, they have received special attention after the call for submissions of Competition for Authenticated Encryption: Security, Applicability, and Robustness (CAESAR) was published. The CAESAR goal is to generate a portfolio with recommendations of authenticated encryption algorithms for three different scenarios: Lightweight, high speed, and defense in deep. ELmD and COPA are two on-line authenticated encryption algorithms submitted to CAESAR; because of their similarities, they were merged as COLM during the third-round of CAESAR. COLM is a finalist in the use case 3 defense in depth. ELmD, COPA, and COLM are based on the ECB-mix-ECB structure, which is highly parallelizable and pipelineable. In this paper, we present optimized single-chip implementations of ELmD, COPA, and COLM using pipelining. For ELmD, we present implementations for eight combinations of its parameters set: For intermediate tags, fixed, variable tag length, and 10 and 6 AES rounds. COLM implementation is for variable tag length without intermediate tags. In the case of COPA, it does not have parameters set. The implementation results with a Xilinx Virtex 6 FPGA show that ELmD is the best option concerning area and speed for single-chip implementation. The area of COPA and COLM are 1.65 and 1.69 times ELmD's respectively. Regarding throughput, the range of our implementations goes from 33.34 Gbits/s for COLM to more than 35 Gbits/s for several versions of ELmD. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
42. Efficient Software Implementation of Ring-LWE Encryption on IoT Processors.
- Author
-
Liu, Zhe, Azarderakhsh, Reza, Kim, Howon, and Seo, Hwajeong
- Subjects
GAUSSIAN distribution ,INTERNET of things ,MATHEMATICAL optimization ,NEON ,COMPUTER software - Abstract
Embedded processors have been widely used for building up Internet of Things (IoT) platforms, in which the security issue is becoming critical. This paper studies efficient techniques of lattice-based cryptography on these processors and presents the first implementation of ring-LWE encryption on ARM NEON and MSP430 architectures. For ARM NEON architecture, we propose a vectorized version of Iterative Number Theoretic Transform (NTT) for high-speed computation of polynomial multiplication on ARM NEON platforms and a 32-bit variant of SAMS2 technique for fast reduction. For MSP430 architecture, we propose an optimized SWAMS2 reduction technique, which consists of five different basic operations, including Shifting, Swapping, Addition, and two Multiplication-Subtractions. Regarding of the sampling from the discrete Gaussian distribution, we adopt Knuth-Yao sampler, accompanied with optimized methods such as Look-Up Table (LUT) and byte-scanning. Subsequently, a full-fledged implementation of Ring-LWE is presented by both taking advantage of our proposed method and previous optimization techniques re-designed for desired platforms. Our ring-LWE implementation of encryption/decryption at a classical security level of 128 bits requires only $149.4k/32.8k$ 149. 4 k / 32. 8 k clock cycles on ARM NEON, and $2126.3k/244.5k$ 2126. 3 k / 244. 5 k clock cycles on MSP430. These results are roughly 7 times faster than the fastest ECC implementation on desired platforms with same security level. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
43. Neuromorphic System for Spatial and Temporal Information Processing.
- Author
-
Zyarah, Abdullah M., Gomez, Kevin, and Kudithipudi, Dhireesha
- Subjects
SPATIAL systems ,INFORMATION processing ,UBIQUITOUS computing ,SPATIOTEMPORAL processes ,FAULT-tolerant computing - Abstract
Neuromorphic systems that learn and predict from streaming inputs hold significant promise in pervasive edge computing and its applications. In this article, a neuromorphic system that processes spatio-temporal information on the edge is proposed. Algorithmically, the system is based on hierarchical temporal memory that inherently offers online learning, resiliency, and fault tolerance. Architecturally, it is a full custom mixed-signal design with an underlying digital communication scheme and analog computational modules. Therefore, the proposed system features reconfigurability, real-time processing, low power consumption, and low-latency processing. The proposed architecture is benchmarked to predict on real-world streaming data. The network's mean absolute percentage error on the mixed-signal system is 1.129 X lower compared to its baseline algorithm model. This reduction can be attributed to device non-idealities and probabilistic formation of synaptic connections. We demonstrate that the combined effect of Hebbian learning and network sparsity also plays a major role in extending the overall network lifespan. We also illustrate that the system offers 3.46 X reduction in latency and 77.02 X reduction in power consumption when compared to a custom CMOS digital design implemented at the same technology node. By employing specific low power techniques, such as clock gating, we observe 161.37 X reduction in power consumption. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
44. Machine Learning Computers With Fractal von Neumann Architecture.
- Author
-
Zhao, Yongwei, Fan, Zhe, Du, Zidong, Zhi, Tian, Li, Ling, Guo, Qi, Liu, Shaoli, Xu, Zhiwei, Chen, Tianshi, and Chen, Yunji
- Subjects
MACHINE learning ,COMPUTER architecture ,COMPUTERS ,FRACTALS ,GRAPHICS processing units ,ARCHITECTURAL design ,SERVER farms (Computer network management) - Abstract
Machine learning techniques are pervasive tools for emerging commercial applications and many dedicated machine learning computers on different scales have been deployed in embedded devices, servers, and data centers. Currently, most machine learning computer architectures still focus on optimizing performance and energy efficiency instead of programming productivity. However, with the fast development in silicon technology, programming productivity, including programming itself and software stack development, becomes the vital reason instead of performance and power efficiency that hinders the application of machine learning computers. In this article, we propose Cambricon-F, which is a series of homogeneous, sequential, multi-layer, layer-similar, and machine learning computers with same ISA. A Cambricon-F machine has a fractal von Neumann architecture to iteratively manage its components: it is with von Neumann architecture and its processing components (sub-nodes) are still Cambricon-F machines with von Neumann architecture and the same ISA. Since different Cambricon-F instances with different scales can share the same software stack on their common ISA, Cambricon-Fs can significantly improve the programming productivity. Moreover, we address four major challenges in Cambricon-F architecture design, which allow Cambricon-F to achieve a high efficiency. We implement two Cambricon-F instances at different scales, i.e., Cambricon-F100 and Cambricon-F1. Compared to GPU based machines (DGX-1 and 1080Ti), Cambricon-F instances achieve 2.82x, 5.14x better performance, 8.37x, 11.39x better efficiency on average, with 74.5, 93.8 percent smaller area costs, respectively. We further propose Cambricon-FR, which enhances the Cambricon-F machine learning computers to flexibly and efficiently support all the fractal operations with a reconfigurable fractal instruction set architecture. Compared to the Cambricon-F instances, Cambricon-FR machines achieve 1.96x, 2.49x better performance on average. Most importantly, Cambricon-FR computers are able to save the code length with a factor of 5.83, thus significantly improving the programming productivity. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
45. Graph Similarity and its Applications to Hardware Security.
- Author
-
Fyrbiak, Marc, Wallat, Sebastian, Reinhard, Sascha, Bissantz, Nicolai, and Paar, Christof
- Subjects
INTELLECTUAL property infringement ,GRAPH algorithms ,REVERSE engineering ,HARDWARE - Abstract
Hardware reverse engineering is a powerful and universal tool for both security engineers and adversaries. From a defensive perspective, it allows for detection of intellectual property infringements and hardware Trojans, while it simultaneously can be used for product piracy and malicious circuit manipulations. From a designer's perspective, it is crucial to have an estimate of the costs associated with reverse engineering, yet little is known about this, especially when dealing with obfuscated hardware. The contribution at hand provides new insights into this problem, based on algorithms with sound mathematical underpinnings. Our contributions are threefold: First, we present the graph similarity problem for automating hardware reverse engineering. To this end, we improve several state-of-the-art graph similarity heuristics with optimizations tailored to the hardware context. Second, we propose a novel algorithm based on multiresolutional spectral analysis of adjacency matrices. Third, in three extensively evaluated case studies, namely (1) gate-level netlist reverse engineering, (2) hardware Trojan detection, and (3) assessment of hardware obfuscation, we demonstrate the practical nature of graph similarity algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
46. Exploiting Hardware-Based Data-Parallel and Multithreading Models for Smart Edge Computing in Reconfigurable FPGAs
- Author
-
Eduardo de la Torre, Alfonso Rodriguez, Marco Platzner, and Andres Otero
- Subjects
business.industry ,Data parallelism ,Computer science ,Reconfigurable computing ,Theoretical Computer Science ,Software ,Computational Theory and Mathematics ,Hardware and Architecture ,Multithreading ,Programming paradigm ,Enhanced Data Rates for GSM Evolution ,business ,Evolvable hardware ,Computer hardware ,Edge computing - Abstract
Current edge computing systems are deployed in highly complex application scenarios with dynamically changing requirements. In order to provide the expected performance and energy efficiency values in these situations, the use of heterogeneous hardware/software platforms at the edge has become widespread. However, these computing platforms still suffer from the lack of unified software-driven programming models to efficiently deploy multi-purpose hardware-accelerated solutions. In parallel, edge computing systems also face another huge challenge: operating under multiple conditions that were not taken into account during any of the design stages. Moreover, these conditions may change over time, forcing self-adaptation mechanisms to become a must. This paper presents an integrated architecture to exploit hardware-accelerated data-parallel models and transparent hardware/software multithreading. In particular, the proposed architecture leverages the \ARTICo framework and ReconOS to allow developers to select the most suitable programming model to deploy their edge computing applications onto run-time reconfigurable hardware devices. An evolvable hardware system is used as an additional architectural component during validation, providing support for continuous lifelong learning in smart edge computing scenarios. In particular, the proposed setup exhibits online learning capabilities that include learning by imitation from software-based reference algorithms.
- Published
- 2022
- Full Text
- View/download PDF
47. Architectures and Execution Models for Hardware/Software Compilation and Their System-Level Realization.
- Author
-
Lange, Holger and Koch, Andreas
- Subjects
ADAPTIVE computing systems ,FIELD programmable gate arrays ,COMPUTER input-output equipment ,COMPUTER software ,COMPUTER storage devices ,SYSTEM integration ,COMPUTER operating systems ,VIRTUAL storage (Computer science) - Abstract
We propose an execution model that orchestrates the fine-grained interaction of a conventional general-purpose processor (GPP) and a high-speed reconfigurable hardware accelerator (HA), the latter having full master-mode access to memory. We then describe how the resulting requirements can actually be realized efficiently in a custom computer by hardware architecture and system software measures. One of these is a low-latency HA-to-GPP signaling scheme with latency up to 23\times times shorter than conventional approaches. Another one is a high-bandwidth shared memory interface that does not interfere with time-critical operating system functions executing on the GPP, and still makes 89 percent of the physical memory bandwidth available to the HA. Finally, we show two schemes with different flexibility/performance trade-offs for running the HA in protected virtual memory scenarios. All of the techniques and their interactions are evaluated at the system level using the full-scale virtual memory variant of the Linux operating system on actual hardware. [ABSTRACT FROM PUBLISHER]
- Published
- 2010
- Full Text
- View/download PDF
48. Generation of Finely-Pipelined GF($P$P) Multipliers for Flexible Curve Based Cryptography on FPGAs.
- Author
-
Gallin, Gabriel and Tisserand, Arnaud
- Subjects
MULTIPLIERS (Mathematical analysis) ,CRYPTOGRAPHY ,FLEXIBLE printed circuits ,ELLIPTIC curve cryptography ,FIELD programmable gate arrays ,TIMING circuits - Abstract
In this paper, we present modular multipliers for hardware implementations of (hyper)-elliptic curve cryptography on FPGAs. The prime modulus $P$P is generic and can be configured at run-time to provide flexible circuits. A finely-pipelined architecture is proposed for overlapping the partial products and reductions steps in the pipeline of hardwired DSP slices. For instance, 2, 3, or 4 independent multiplications can share the hardware resources at the same time to overlap internal latencies. We designed a tool, distributed as open source, for generating VHDL codes with various parameters: width of operands, number of logical multipliers per physical one, speed or area optimization, possible use of BRAMs, target FPGA. Our modular multipliers lead to, at least, 2 times faster as well as 2 times smaller circuits than state of the art operators. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
49. On the Construction of Composite Finite Fields for Hardware Obfuscation.
- Author
-
Zhang, Xinmiao and Lao, Yingjie
- Subjects
FINITE fields ,COMPOSITE construction - Abstract
Hardware obfuscation is a technique that modifies the circuit to hide the functionality. Obfuscations through algorithmic modifications add protection in addition to circuit-level techniques, and their effects on the data paths can be analyzed and controlled at the architectural level. Many error-correcting coding and cryptography algorithms are based on finite field arithmetic. For the first time, this paper proposes a hardware obfuscation scheme achieved through varying finite field constructions and primitive element representations. Also the variations are effectively transformed to bit permuters controlled by obfuscation keys to achieve high level of security with very small complexity overheads. To illustrate the effectiveness, the proposed scheme is applied to obfuscate Reed-Solomon decoders, which are broadly used in communication and storage systems. For a (255, 239) RS decoder over finite field $GF(256)$GF(256), the proposed scheme achieves 1239 bits of independent obfuscation key with 4.4 percent area overhead, while yielding no penalty on the throughput and only one extra clock cycle of latency. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
50. Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU.
- Author
-
Vogel, Pirmin, Marongiu, Andrea, and Benini, Luca
- Subjects
SYSTEMS on a chip ,MEMORY - Abstract
A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development. In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.