31 results on '"application-specific processor"'
Search Results
2. Accelerating the Dynamic Time Warping Distance Measure using Logarithmetic Arithmetic
- Author
-
Tarango, Joseph, Keogh, Eamonn, and Brisk, Philip
- Subjects
Engineering ,Electrical Engineering ,Time series ,similarity search ,application-specific processor ,Instruction Set extension ,Euclidean Distance ,Dynamic Time Warping ,floating-point arithmetic ,logarithmic arithmetic - Abstract
This paper describes an application-specific embedded processor with instruction set extensions (ISEs) for the Dynamic Time Warping (DTW) distance measure, which is widely used in time series similarity search. The ISEs in this paper are implemented using a form of logarithmic arithmetic that offers significant performance and power/energy advantages compared to more traditional floating-point operations.
- Published
- 2014
3. No-instruction-set-computer design experience of flexible and efficient architectures for digital communication applications: two case studies on MIMO turbo detection and universal turbo demapping.
- Author
-
Rizk, Mostafa, Baghdadi, Amer, Jezequel, Michel, Mohanna, Yasser, and Atat, Youssef
- Abstract
The emerging flexibility need in designing application-specific processors dedicated for modules of digital receiver imposes a new design metric, which is added to the requirements of efficiency and productivity. In order to cope with the emerging flexibility requirement combined with the best performance efficiency, many application-specific processor design approaches have been proposed and investigated. In general, available design approaches that adopt dynamic scheduling of instructions add an overhead due to the instruction decoding. To minimize this overhead, several approaches have been introduced, which opt static scheduling. In this context, No-Instruction-Set-Computer (NISC) concept has been introduced to design application-specific processors without an instruction set. NISC concept proposes that there is no need to first design and then use an instruction set when the hardware is programmed by its designers rather than its users. NISC designing approach offers a good compromise between flexibility, productivity, and quality for the design of a digital system. In our work, NISC approach is explored through the design of flexible and efficient architectures dedicated for digital communication applications which fulfill the requirements imposed by multiple emergent communication standards. This paper introduces briefly the NISC concept and the corresponding design methodology. Also, it provides an overview of the related design approach. In addition, the relevance of NISC in realizing flexible and efficient implementation in the domain of digital communication is demonstrated through two case studies on MIMO turbo detection and universal turbo demapping. Both designed NISC-based architectures have been compared to state-of-the-art ASIP-based architectures using similar computational resources and supporting same flexibility parameters. The obtained results show that the proposed NISC-based architectures provide a significant improvement in execution performance while having reduced implementation costs. The results also illustrates how the control memory requirements depend on the application and the devised architecture choices. In the detector module, the adopted re-usability of allocated resources imposes separate controlling of each component; hence, additional control signals are implied. Whereas for the demapper module, implemented hardware components are considered to perform specific operations and to deal with the same type of data; hence, the number of control signals can be reduced significantly. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
4. ASAP: Accelerated Short-Read Alignment on Programmable Hardware.
- Author
-
Banerjee, Subho Sankar, El-Hadedy, Mohamed, Lim, Jong Bin, Kalbarczyk, Zbigniew T., Chen, Deming, Lumetta, Steven S., and Iyer, Ravishankar K.
- Subjects
- *
FIELD programmable gate arrays , *COMPUTER input-output equipment , *INFORMATION superhighway , *KERNEL operating systems , *BIOINFORMATICS - Abstract
The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preprocessing sequenced genomes. We focus on the Levenshtein distance (edit-distance) computation kernel and propose the ASAP accelerator, which utilizes the intrinsic delay of circuits for edit-distance computation elements as a proxy for computation. Our design is implemented on an Xilinx Virtex 7 FPGA in an IBM POWER8 system that uses the CAPI interface for cache coherence across the CPU and FPGA. Our design is $200\times$ faster than an equivalent Smith-Waterman-C implementation of the kernel running on the host processor, $40-60\times$ faster than an equivalent Landau-Vishkin-C++ implementation of the kernel running on the IBM Power8 host processor, and $2\times$ faster for an end-to-end alignment tool for 120–150 base-pair short-read sequences. Further the design represents a $3760\times$ improvement over the CPU in performance/Watt terms. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm CMOS.
- Author
-
Mohamed, Mohamed I. A., Mohammed, Karim, and Daneshrad, Babak
- Subjects
PROGRAMMABLE controllers ,MIMO systems ,DECODERS (Electronics) ,COMPLEMENTARY metal oxide semiconductors ,ORTHOGONAL frequency division multiplexing - Abstract
This paper presents an energy efficient programmable hardware accelerator that targets multiple-input-multiple-output (MIMO) decoding tasks of orthogonal frequency-division multiplexing (OFDM) systems. The work is motivated by the adoption of MIMO and OFDM by almost all existing and emerging high-speed wireless data communication systems. The accelerator was fabricated in 65-nm CMOS technology and occupies a core area of 2.48 mm^2 . It delivers full programmability across different wireless standards (i.e., WiFi, 3G-long term evolution, and WiMax) as well as different MIMO decoding algorithms (i.e., minimum mean square error, singular value decomposition, and maximum likelihood) with extreme energy efficiency. The energy efficiency of our MIMO accelerator chip was compared against dedicated application specific integrated circuits for 4 $\,\times\,$ 4 QR decomposition, 4 $\,\times\,$ 4 singular value decomposition, and 2 $\,\times\,$ 2 minimum mean square error decoding. Despite the programmable nature of our design, it delivered energy efficiencies that were 18% to 28% better than the dedicated solutions reported in the literature. This paper presents the VLSI implementation of the architecture discussed in
[14] –[16] . It discusses the implementation decisions and tradeoffs used to ensure minimum overall energy consumption of the resulting accelerator chip without sacrificing programmability. Given its programmability and extreme energy efficiency, the accelerator is an ideal solution for today's smart phones that implement multiple MIMO-OFDM waveforms on the same platform. [ABSTRACT FROM AUTHOR]- Published
- 2014
- Full Text
- View/download PDF
6. Embedded processor optimised for vascular pattern recognition.
- Author
-
Park, Gi‐Tae and Kim, Soo‐Won
- Abstract
In this study, the authors propose an efficient embedded processing architecture that uses the vascular pattern extraction (VPE) algorithm to authenticate a user to an embedded system. This study first considers the use of direction‐based vascular pattern extraction (DBVPE), and analyses the computational workload involved in running software implementations on an embedded processor. The authors then present a comprehensive performance analysis of the VPE algorithm and examine in detail the various factors that contribute to processing latencies, including VPE recognition processing. In order to improve the efficiency of VPE processing in embedded devices, the authors offer details regarding the process needed to create a highly efficient application‐specific processor and extend the base instruction set of the processor by using custom instructions for recognition processing. The authors implemented our proposed methodology in the context of a commercial extensible processor design flow using the Xtensa platform from Tensilica Inc. Our experiments show that our proposed methodology achieves a 3.95‐fold increase in the vascular pattern recognition speed. Hence, the authors consider our technique to be efficient. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
7. Instruction set architectural guidelines for embedded packet-processing engines
- Author
-
Salehi, Mostafa E., Fakhraie, Sied Mehdi, and Yazdanbakhsh, Amir
- Subjects
- *
COMPUTER architecture , *EMBEDDED computer systems , *BANDWIDTHS , *MATHEMATICAL models , *DATA analysis , *PERFORMANCE , *ECONOMIC demand , *COMPUTER networks - Abstract
Abstract: This paper presents instruction set architectural guidelines for improving general-purpose embedded processors to optimally accommodate packet-processing applications. Similar to other embedded processors such as media processors, packet-processing engines are deployed in embedded applications, where cost and power are as important as performance. In this domain, the growing demands for higher bandwidth and performance besides the ongoing development of new networking protocols and applications call for flexible power- and performance-optimized engines. The instruction set architectural guidelines are extracted from an exhaustive simulation-based profile-driven quantitative analysis of different packet-processing workloads on 32-bit versions of two well-known general-purpose processors, ARM and MIPS. This extensive study has revealed the main performance challenges and tradeoffs in development of evolution path for survival of such general-purpose processors with optimum accommodation of packet-processing functions for future switching-intensive applications. Architectural guidelines include types of instructions, branch offset size, displacement and immediate addressing modes for memory access along with the effective size of these fields, data types of memory operations, and also new branch instructions. The effectiveness of the proposed guidelines is evaluated with the development of a retargetable compilation and simulation framework. Developing the HDL model of the optimized base processor for networking applications and using a logic synthesis tool, we show that enhanced area, power, delay, and power per watt measures are achieved. [Copyright &y& Elsevier]
- Published
- 2012
- Full Text
- View/download PDF
8. A Parameterized Programmable MIMO Decoding Architecture With a Scalable Instruction Set and Compiler.
- Author
-
Mohammed, Karim, Mohamed, M. I. A., and Daneshrad, Babak
- Subjects
MIMO systems ,COMPUTER input-output equipment ,BANDWIDTHS ,ELECTROSTATIC accelerators ,ALGORITHMS ,DIGITAL signal processing ,SIMULATION methods & models ,CODING theory - Abstract
We present a novel multiple-input multiple-output (MIMO) decoder accelerator and its associated integrated design environment. The accelerator architecture allows tradeoffs in decoding algorithm, antenna configuration, modulation scheme, and bandwidth at run-time via user programming. The accelerator delivers an improvement over a general purpose digital signal processor (DSP) reaching three orders of magnitude for matrix processing and linear MIMO decoding. The hardware architecture is user-configurable through ten independently set parameters. The parameterization allows independent control over the size and structure of the processing core as well as the structure, size, and access scheme of data memory. We provide a custom high level script and a scalable machine level instruction set and compiler. The elements of hardware configuration and programmability are combined in a user-friendly design flow that takes the MIMO decoder designer from simulation to hardware with dedicated-hardware-like performance in no time. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
9. Fixed-Size Quadruples for a New, Hardware-Oriented Representation of the 4D Clifford Algebra.
- Author
-
Franchini, Silvia, Gentile, Antonio, Sorbello, Filippo, Vassallo, Giorgio, and Vitabile, Salvatore
- Abstract
Clifford algebra (geometric algebra) offers a natural and intuitive way to model geometry in fields as robotics, machine vision and computer graphics. This paper proposes a new representation based on fixed-size elements ( quadruples) of 4D Clifford algebra and demonstrates that this choice leads to an algorithmic simplification which in turn leads to a simpler and more compact hardware implementation of the algebraic operations. In order to prove the advantages of the new, quadruple-based representation over the classical representation based on homogeneous elements, a coprocessing core supporting the new fixed-size Clifford operands, namely Quad-CliffoSor (Quadruple-based Clifford coprocesSor) was designed and prototyped on an FPGA board. Test results show the potential to achieve a 23× speedup for Clifford products and a 33× speedup for Clifford sums and differences compared to the same operations executed by a software library running on a general-purpose processor. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
10. A MIMO Decoder Accelerator for Next Generation Wireless Communications.
- Author
-
Mohammed, Karim and Daneshrad, Babak
- Abstract
In this paper, we present a multi-input–multi-output (MIMO) decoder accelerator architecture that offers versatility and reprogrammability while maintaining a very high performance-cost metric. The accelerator is meant to address the MIMO decoding bottlenecks associated with the convergence of multiple high-speed wireless standards onto a single device. It is scalable in the number of antennas, bandwidth, modulation format, and most importantly, present and emerging decoder algorithms. It features a Harvard-like architecture with complex vector operands and a deeply pipelined fixed-point complex arithmetic processing unit. When implemented on a Xilinx Virtex-4 LX200FF1513 field-programmable gate array (FPGA), the design occupied 43% of overall FPGA resources. The accelerator shows an advantage of up to three orders of magnitude (1000 times) in power-delay product for typical MIMO decoding operations relative to a general purpose DSP. When compared to dedicated application-specific IC (ASIC) implementations of mmse MIMO decoders, the accelerator showed a degradation of 340%–17%, depending on the actual ASIC being considered. In order to optimize the design for both speed and area, specific challenges had to be overcome. These include: definition of the processing units and their interconnection; proper dynamic scaling of the signal; and memory partitioning and parallelism. [ABSTRACT FROM PUBLISHER]
- Published
- 2010
- Full Text
- View/download PDF
11. High-Performance Rekeying Processor Architecture for Group Key Management.
- Author
-
Shoufan, Abdulhadi and Huss, Sorin A.
- Subjects
- *
INTERNET , *CRYPTOGRAPHY , *DATA encryption , *COMMUNICATION , *SOFTWARE architecture - Abstract
Group key management is a critical task in secure multicast applications such as Pay-TV over the Internet. The communication group key must be updated and distributed after every change in the group membership. Many solutions have been proposed in the last years to minimize the cost of this rekeying process on the server side. Most of these solutions are tree-based approaches such as the logical key hierarchy. These approaches suffer from three problems. First, tree-based solutions aim at minimizing rekeying costs only by reducing the number of needed cryptographic operations such as encryption or secure hashing. Second, these solutions do not treat the time-consuming digital signing needed to authenticate rekeying messages. Third, tree-based approaches manage huge amounts of keys by software which compromises security. In this paper, a novel hardware/software architecture is proposed, which optimizes the rekeying performance not only by minimizing the number of cryptographic operations, but also by reducing the execution times of these operations including digital signing with the aid of hardware acceleration. All help-keys are generated, managed, and stored on hardware, which enhances the system security. To keep flexibility, control-intensive tasks such as tree management are performed as software functions on the embedded processor. The presented rekeying processor is designed based on a comprehensive security analysis with the aid of a novel illustration for security threats, requirements, and technical solutions, a so-called Security Y-Diagram. A performance measurement on a prototype implementation shows that the rekeying processor can join and disjoin members much faster than software solutions besides supporting much larger groups. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
12. Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development
- Author
-
Salehi, Mostafa E. and Fakhraie, Sied Mehdi
- Subjects
- *
DATA packeting , *ELECTRONIC data processing , *QUANTITATIVE research , *COMPUTER architecture , *COMPUTER network protocols , *EMBEDDED computer systems , *COMPUTER engineering , *COMPUTER systems , *COMPUTER programming , *BOTTLENECKS (Manufacturing) - Abstract
Abstract: This paper presents a simulation-based profile-driven quantitative analysis of packet-processing applications. In this domain, demands for increasing the performance and the ongoing development of network protocols both call for flexible and performance-optimized engines. Based on the achieved profiling results, we introduce platform-independent analysis that locates the performance bottlenecks and architectural challenges of a packet-processing engine. Finally based on these results, we extract helpful architectural guidelines for design of a flexible and high-performance embedded processor that is optimized for packet-processing operations in high-performance and cost-sensitive network embedded applications. [Copyright &y& Elsevier]
- Published
- 2009
- Full Text
- View/download PDF
13. An embedded, FPGA-based computer graphics coprocessor with native geometric algebra support
- Author
-
Franchini, Silvia, Gentile, Antonio, Sorbello, Filippo, Vassallo, Giorgio, and Vitabile, Salvatore
- Subjects
- *
EMBEDDED computer systems , *COMPUTERS , *DIGITAL image processing , *COMPUTER architecture - Abstract
Abstract: The representation of geometric objects and their transformation are the two key aspects in computer graphics applications. Traditionally, computer-intensive matrix calculations are involved in modeling and rendering three-dimensional (3D) scenery. Geometric algebra (aka Clifford algebra) is attracting attention as a natural way to model geometric facts and as a powerful analytical tool for symbolic calculations. In this paper, the architecture of Clifford coprocessor (CliffoSor) is introduced. CliffoSor is an embedded parallel coprocessing core that offers direct hardware support to Clifford algebra operators. A prototype implementation on a programmable gate array (FPGA) board is detailed. Initial test results show the potential to achieve a 20× speedup for 3D vector rotations, a 12× speedup for Clifford sums and differences, and more than a 4× speedup for Clifford products, compared to the analogous operations in GAIGEN, a standard geometric algebra library generator for general-purpose processors. An execution analysis of a raytracing application is also presented. [Copyright &y& Elsevier]
- Published
- 2009
- Full Text
- View/download PDF
14. Architecture of an application-specific processor for real-time implementation of H.264/AVC sub-pixel interpolation.
- Author
-
Dang, Philip
- Abstract
This paper presents an efficient VLSI architecture for fast implementation of sub-pixel interpolation of H.264/AVC. Several optimization techniques at different design levels, such as parallel processing, vector register, pipeline architecture, and in-place computation, are utilized to reduce the number of memory access and accelerate the interpolation computations. The proposed application-specific processor can meet the real-time constraint of the sub-pixel interpolation algorithm for the 16:9 video format (4,690 × 2,304) at 30 frames per second (fps) at 100 MHz clock rate. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
15. Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration.
- Author
-
Clark, Nathan, Hongtao Zhong, Tang, Wilkin, and Mahlke, Scott
- Subjects
- *
AUTOMATION , *COST , *DATA flow computing , *ELECTRONIC data processing , *COMPUTER input-output equipment , *COMPUTER graphics - Abstract
General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
16. Comparing vertical and horizontal SIMD vector processor architectures for accelerated image feature extraction.
- Author
-
Weißbrich, M., García-Ortiz, A., and Payá-Vayá, G.
- Subjects
- *
ARCHITECTURE , *COMPUTER vision , *AUTOMOTIVE electronics , *IMAGE processing , *MOBILE operating systems , *PARALLEL programming , *FEATURE extraction - Abstract
• Implementation of parameterizable horizontal/vertical SIMD vector processor architectures. • Performance-optimized hardware and ISA for image processing tasks. • Complete SIFT implementation with detailed horizontal/vertical vectorization strategies. • Extensive architecture comparison for ASIC implementation (performance, area, energy). Embedded automotive Computer Vision systems for real-time motion tracking and 3D scene reconstruction demand for high image feature extraction performance and have a heavily constrained energy budget unable to be met by general-purpose CPUs and GPUs. Due to the required programming flexibility for software updates and algorithmic extensions, the use of fully dedicated hardware accelerators is not advisable in most cases. In this paper, a vertical and a horizontal SIMD vector processor architecture are implemented and compared for accelerating the Scale-Invariant Feature Transform feature extraction algorithm, exploiting inherent data-level parallelism prevalent in this application and considering different programming code strategies for the different vectorization paradigms. An evaluation for a 45 nm ASIC technology shows an overall performance gain of up to 24.8x, and up to 151.3x higher total performance-area-energy efficiency compared to a reference scalar two-issue VLIW processor. Compared to other implementations on programmable ASIP and mobile GPU platforms, the proposed vertical SIMD vector processor achieves a performance gain of up to 5.1x and up to 31.3x higher performance-energy efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
17. Design and Prototyping Flow of Flexible and Efficient NISC-based Architectures for MIMO Turbo Equalization and Demapping
- Author
-
Youssef Atat, Michel Jezequel, Yasser Mohanna, Amer Baghdadi, Mostafa Rizk, School of Engineering (Lebanese International University) ( LIU ), Lab-STICC_TB_CACS_IAS, Laboratoire des sciences et techniques de l'information, de la communication et de la connaissance ( Lab-STICC ), École Nationale d'Ingénieurs de Brest ( ENIB ) -Université de Bretagne Sud ( UBS ) -Université de Brest ( UBO ) -Télécom Bretagne-Institut Brestois du Numérique et des Mathématiques ( IBNM ), Université de Brest ( UBO ) -Université européenne de Bretagne ( UEB ) -ENSTA Bretagne-Institut Mines-Télécom [Paris]-Centre National de la Recherche Scientifique ( CNRS ) -École Nationale d'Ingénieurs de Brest ( ENIB ) -Université de Bretagne Sud ( UBS ) -Université de Brest ( UBO ) -Télécom Bretagne-Institut Brestois du Numérique et des Mathématiques ( IBNM ), Université de Brest ( UBO ) -Université européenne de Bretagne ( UEB ) -ENSTA Bretagne-Institut Mines-Télécom [Paris]-Centre National de la Recherche Scientifique ( CNRS ), Département Electronique ( ELEC ), Université européenne de Bretagne ( UEB ) -Télécom Bretagne-Institut Mines-Télécom [Paris], Faculty of science (Lebanese University), School of Engineering [Lebanese International University] (SOE/LIU), Lebanese International University (LIU), Laboratoire des sciences et techniques de l'information, de la communication et de la connaissance (Lab-STICC), École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-Télécom Bretagne-Institut Brestois du Numérique et des Mathématiques (IBNM), Université de Brest (UBO)-Université européenne de Bretagne - European University of Brittany (UEB)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-Télécom Bretagne-Institut Brestois du Numérique et des Mathématiques (IBNM), Université de Brest (UBO)-Université européenne de Bretagne - European University of Brittany (UEB)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), Département Electronique (ELEC), Université européenne de Bretagne - European University of Brittany (UEB)-Institut Mines-Télécom [Paris] (IMT)-Télécom Bretagne, Faculty of Sciences [Lebanese University], and Lebanese University [Beirut] (LU)
- Subjects
Engineering ,Computer Networks and Communications ,MIMO ,Demapping ,lcsh:TK7800-8360 ,02 engineering and technology ,[ SPI.SIGNAL ] Engineering Sciences [physics]/Signal and Image processing ,Instruction set ,Datapath ,0202 electrical engineering, electronic engineering, information engineering ,Flexible implementation ,Wireless ,Electrical and Electronic Engineering ,prototype flow ,Field-programmable gate array ,Implementation ,FPGA ,Register transfer language ,NISC ,business.industry ,Iterative equalization ,lcsh:Electronics ,flexible implementation ,application-specific processor ,iterative ,equalization ,demapping ,020206 networking & telecommunications ,Prototyping ,WiMAX ,Application-specific processor ,[SPI.TRON]Engineering Sciences [physics]/Electronics ,020202 computer hardware & architecture ,[ SPI.TRON ] Engineering Sciences [physics]/Electronics ,Hardware and Architecture ,Control and Systems Engineering ,Embedded system ,Signal Processing ,business ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing - Abstract
International audience; In the domain of digital wireless communication, flexible design implementations are increasingly explored for different applications in order to cope with diverse system configurations imposed by the emerging wireless communication standards. In fact, shrinking the design time to meet market pressure, on the one hand, and adding the emerging flexibility requirement and, hence, increasing system complexity, on the other hand, require a productive design approach that also ensures final design quality. The no instruction set computer (NISC) approach fulfills these design requirements by eliminating the instruction set overhead. The approach offers static scheduling of the datapath, automated register transfer language (RTL)synthesis and allows the designer to have direct control of hardware resources. This paper presents a complete NISC-based design and prototype flow, from architecture specification till FPGA implementation. The proposed design and prototype flow is illustrated through two case studies of flexible implementations, which are dedicated to low-complexity MIMO turbo-equalizer and a universal turbo-demapper. Moreover, the flexibility of the proposed prototypes allows supporting all communication modes defined in the emerging wireless communication standards, such LTE, LTE-Advanced, WiMAX, WiFi and DVB-RCS. For each prototype, its functionality is evaluated, and the resultant performance is verified for all system configurations.
- Published
- 2016
18. Accelerating the Dynamic Time Warping Distance Measure using Logarithmetic Arithmetic
- Author
-
Eamonn Keogh, Philip Brisk, Joseph Tarango, and Matthews, Michael B
- Subjects
Dynamic time warping ,Floating point ,Time series ,logarithmic arithmetic ,floating-point arithmetic ,Nearest neighbor search ,Logarithmic number system ,similarity search ,Measure (mathematics) ,Power (physics) ,Instruction set ,Euclidean Distance ,application-specific processor ,Instruction Set extension ,Arithmetic ,Energy (signal processing) ,Mathematics ,Dynamic Time Warping - Abstract
© 2014 IEEE. This paper describes an application-specific embedded processor with instruction set extensions (ISEs) for the Dynamic Time Warping (DTW) distance measure, which is widely used in time series similarity search. The ISEs in this paper are implemented using a form of logarithmic arithmetic that offers significant performance and power/energy advantages compared to more traditional floating-point operations.
- Published
- 2014
19. Application-specific Processor Architecture: Then and Now
- Author
-
Cappello, Peter
- Published
- 2008
- Full Text
- View/download PDF
20. A Specialized Architecture for Color Image Edge Detection Based on Clifford Algebra
- Author
-
Antonio Gentile, Silvia Franchini, Salvatore Vitabile, Filippo Sorbello, Giorgio Vassallo, Franchini, S, Gentile, A, Vassallo, G, Vitabile, S, and Sorbello, F.
- Subjects
Hardware architecture ,Multispectral MR images ,Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni ,Color histogram ,Computer science ,Color image ,business.industry ,Color image edge detection ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,FPGA prototyping ,Application-specific processor ,Color quantization ,Edge detection ,Convolution ,Computer Science::Hardware Architecture ,Computer Science::Computer Vision and Pattern Recognition ,RGB color model ,Computer vision ,Artificial intelligence ,Clifford algebra ,business ,Image gradient - Abstract
Edge detection of color images is usually performed by applying the traditional techniques for gray-scale images to the three color channels separately. However, human visual perception does not differentiate colors and processes the image as a whole. Recently, new methods have been proposed that treat RGB color triples as vectors and color images as vector fields. In these approaches, edge detection is obtained extending the classical pattern matching and convolution techniques to vector fields. This paper proposes a hardware implementation of an edge detection method for color images that exploits the definition of geometric product of vectors given in the Clifford algebra framework to extend the convolution operator and the Fourier transform to vector fields. The proposed architecture has been prototyped on the Celoxica RC203E Field Programmable Gate Array (FPGA) board. Experimental tests on the FPGA prototype show that the proposed hardware architecture allows for an average speedup ranging between 6x and 18x for different image sizes against the execution on a conventional general-purpose processor. Clifford algebra based edge detector can be exploited to process not only color images but also multispectral gray-scale images. The proposed hardware architecture has been successfully used for feature extraction of multispectral magnetic resonance (MR) images.
- Published
- 2013
21. Generic netlist representation for system and PE level design exploration
- Author
-
Bita Gorjiara, Daniel D. Gajski, Mehrdad Reshadi, and Pramod Chandraiah
- Subjects
Architecture description language ,MicroBlaze ,Speedup ,business.industry ,Interface (Java) ,Computer science ,Application-specific instruction-set processor ,Computer architecture ,Embedded system ,High-level synthesis ,Netlist ,Systems design ,GNR ,NISC ,application-specific processor ,architecture description language ,modeling ,synthesis ,system design ,business - Abstract
Designer productivity and design predictability are vital factors for successful embedded system design. Shrinking time-to-market and increasing complexity of these systems require more productive design approaches starting from high-level languages such as C. On the other hand, tight constraints of embedded systems require careful design exploration at system level (coarse grained exploration) and at the processing-element (PE) level (fine grained exploration).In this paper we presented GNR, a formal modeling approach, developed to improve productivity of designing systems and processing elements, the same way that traditional ADLs improved productivity for designing processors. The GNR is an order of magnitude shorter than state-of-the-art ADLs with RTL generation capabilities and yet can capture any structural details that affect the implementation quality. Using relatively short GNR description, we explored several designs for implementing an MP3 decoder and achieved 3.25 speedup compared to MicroBlaze processor. We have also developed a web-based interface for our tools, so that users can upload and evaluate new architectures described in GNR. Our toolset and GNR is an intermediate step towards synthesis of TLM to RTL.
- Published
- 2006
22. Eric: A Special-Purpose Processor for ERI Calculations in Quantum Chemistry Applications
- Author
-
Nakamura, Kenta, Hatae, Hidenori, Harada, Muneyuki, Kuwayama, Yoji, Uehara, Masamitsu, Sato, Hisao, Obara, Shigeru, Honda, Hiroaki, Nagashima, Umpei, Inadomi, Yuichi, and Murakami, Kazuaki
- Subjects
Ab initio molecular orbital calculation ,application-specific processor ,chip-multiprocessor architecture ,electron repulsion integral - Abstract
Ab initio molecular orbital (MO) calculation is useful for solving many challenging problems regarding the development of new drugs, chemicals, polymers, materials, and so on. In the EHPC (Embedded High Performance Computing) project, we are now developing a special-purpose computer system for ab initio MO calculations in order to reduce the calculation time. The sequential execution time of ab initio MO is O( $ N^4 $) where $ N $ is the number of basis functions, the heaviest computation being the electron repulsion integrals (ERI's). In order to accelerate ab initio MO calculations, it is necessary to develop a special-purpose processor for ERI calculation. Using the characteristics of ERI in the Obara algorithm makes it possible to reduce the calculation time. In this work, we investigate a chip-multiprocessor (CMP) architecture, called Eric, for an application-specific processor able to perform fast ERI computations.
- Published
- 2002
23. Implementation of Fast Fourier Transformation on Transport Triggered Architecture
- Author
-
Maršálek, Roman, Slovák, Jiří, Maršálek, Roman, and Slovák, Jiří
- Abstract
V této práci je navrhnut energeticky úsporný procesor typu TTA (Transport Triggered Architecture) pro výpočet rychlé Fourierovy transformace (FFT). Návrh procesoru byl vytvořen na míru použitému algoritmu pomocí speciáoních funkčních jednotek. Algoritmus byl realizován jako posloupnost instrukcí tak, že většina výpočtu probíhá ve smyčce obrahující pouze jedionu paralelní instrukci. Tato instrukce je umístěna do instrukčního bufferu, odkud je potom volána místo instrukční paměti. Díky tomu se dá docílit nižší spotřeby, neboť volání z instrukčního bufferu je efektivnější než volání z instrukční paměti. Program byl zkompilován na časovém modelu procesoru a časová simulace potvrdila správnost návrhu. Součástí práce jsou rovněž pomocné programy v Pythonu, které slouží ke generaci referenčních výsledků a automatické simulaci a porovnání výsledků simulace s referencí., The thesis proposes an energy-efficient processor architecture for computing a Fast Fourier Transform (FFT) using a Transport Triggered Architecture (TTA) template. The architecture was specifically tailored to a custom instruction schedule using several custom functional units (FUs). The instruction schedule for computing the algorithm was developed in a way that most of the computation is done in a loop containing only one instruction word. This word is stored into an instruction loop buffer which is more power-efficient than a regular memory storage. Thus a power consumption can be lowered. A timed model of the processor and the instruction schedule were developed, verified the approach and suggested further improvements. Python programs for generating referencing and an automatic verification of the timed models were developed to aid the design process.
24. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kekely, Lukáš, and Matoušek, Jiří
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
25. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kekely, Lukáš, and Matoušek, Jiří
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
26. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kekely, Lukáš, and Matoušek, Jiří
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
27. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kekely, Lukáš, and Matoušek, Jiří
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
28. Implementation of Fast Fourier Transformation on Transport Triggered Architecture
- Author
-
Maršálek, Roman, Slovák, Jiří, Maršálek, Roman, and Slovák, Jiří
- Abstract
V této práci je navrhnut energeticky úsporný procesor typu TTA (Transport Triggered Architecture) pro výpočet rychlé Fourierovy transformace (FFT). Návrh procesoru byl vytvořen na míru použitému algoritmu pomocí speciáoních funkčních jednotek. Algoritmus byl realizován jako posloupnost instrukcí tak, že většina výpočtu probíhá ve smyčce obrahující pouze jedionu paralelní instrukci. Tato instrukce je umístěna do instrukčního bufferu, odkud je potom volána místo instrukční paměti. Díky tomu se dá docílit nižší spotřeby, neboť volání z instrukčního bufferu je efektivnější než volání z instrukční paměti. Program byl zkompilován na časovém modelu procesoru a časová simulace potvrdila správnost návrhu. Součástí práce jsou rovněž pomocné programy v Pythonu, které slouží ke generaci referenčních výsledků a automatické simulaci a porovnání výsledků simulace s referencí., The thesis proposes an energy-efficient processor architecture for computing a Fast Fourier Transform (FFT) using a Transport Triggered Architecture (TTA) template. The architecture was specifically tailored to a custom instruction schedule using several custom functional units (FUs). The instruction schedule for computing the algorithm was developed in a way that most of the computation is done in a loop containing only one instruction word. This word is stored into an instruction loop buffer which is more power-efficient than a regular memory storage. Thus a power consumption can be lowered. A timed model of the processor and the instruction schedule were developed, verified the approach and suggested further improvements. Python programs for generating referencing and an automatic verification of the timed models were developed to aid the design process.
29. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kučera, Jan, Kekely, Lukáš, Matoušek, Jiří, and Kučera, Jan
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
30. Aplikačně specifický procesor pro stavové zpracování síťových dat
- Author
-
Kekely, Lukáš, Matoušek, Jiří, Kučera, Jan, Kekely, Lukáš, Matoušek, Jiří, and Kučera, Jan
- Abstract
Bakalářská práce se zabývá návrhem a implementací aplikačně specifického procesoru pro vysokorychlostní stavové měření síťových toků. Hlavním cílem je vytvoření komplexního systému pro akceleraci různých aplikací z oblasti monitorování a bezpečnosti počítačových sítí. Aplikačně specifický procesor tvoří hardwarovou část systému implementovanou v FPGA na akcelerační síťové kartě. Návrh procesoru je proveden s ohledem na nasazení na sítích o rychlostech 100 Gb/s a je založen na unikátní kombinaci rychlosti hardwarového zpracování a flexibility softwarového řízení vycházející z konceptu softwarově definovaného monitorování (SDM). Vytvořený systém prošel funkční verifikací a v rámci hardwarového testování byla ověřena jeho reálná propustnost a další výkonové parametry., This bachelor's thesis deals with the design and implementation of an application-specific processor for high-speed network traffic processing. The main goal is to provide complex system for hardware acceleration of various network security and monitoring applications. The application-specific processor (hardware part of the system) is implemented on an FPGA card and has been designed with respect to be used in 100 Gbps networks. The design is based on the unique combination of high-speed hardware processing and flexible software control using a new concept called Software Defined Monitoring (SDM). The performance and throughput of the proposed system has been verified and measured.
31. Implementation of Fast Fourier Transformation on Transport Triggered Architecture
- Author
-
Maršálek, Roman, Slovák, Jiří, Žádník, Jakub, Maršálek, Roman, Slovák, Jiří, and Žádník, Jakub
- Abstract
V této práci je navrhnut energeticky úsporný procesor typu TTA (Transport Triggered Architecture) pro výpočet rychlé Fourierovy transformace (FFT). Návrh procesoru byl vytvořen na míru použitému algoritmu pomocí speciáoních funkčních jednotek. Algoritmus byl realizován jako posloupnost instrukcí tak, že většina výpočtu probíhá ve smyčce obrahující pouze jedionu paralelní instrukci. Tato instrukce je umístěna do instrukčního bufferu, odkud je potom volána místo instrukční paměti. Díky tomu se dá docílit nižší spotřeby, neboť volání z instrukčního bufferu je efektivnější než volání z instrukční paměti. Program byl zkompilován na časovém modelu procesoru a časová simulace potvrdila správnost návrhu. Součástí práce jsou rovněž pomocné programy v Pythonu, které slouží ke generaci referenčních výsledků a automatické simulaci a porovnání výsledků simulace s referencí., The thesis proposes an energy-efficient processor architecture for computing a Fast Fourier Transform (FFT) using a Transport Triggered Architecture (TTA) template. The architecture was specifically tailored to a custom instruction schedule using several custom functional units (FUs). The instruction schedule for computing the algorithm was developed in a way that most of the computation is done in a loop containing only one instruction word. This word is stored into an instruction loop buffer which is more power-efficient than a regular memory storage. Thus a power consumption can be lowered. A timed model of the processor and the instruction schedule were developed, verified the approach and suggested further improvements. Python programs for generating referencing and an automatic verification of the timed models were developed to aid the design process.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.