23 results on '"Luk, Wayne"'
Search Results
2. A fully-customized dataflow engine for 3D earthquake simulation with a complex topography
- Author
-
Chen, Bingwei, Fu, Haohuan, Luk, Wayne, and Yang, Guangwen
- Published
- 2022
- Full Text
- View/download PDF
3. Mapping Large LSTMs to FPGAs with Weight Reuse
- Author
-
Que, Zhiqiang, Zhu, Yongxin, Fan, Hongxiang, Meng, Jiuxi, Niu, Xinyu, and Luk, Wayne
- Published
- 2020
- Full Text
- View/download PDF
4. High performance reconfigurable computing for numerical simulation and deep learning
- Author
-
Gan, Lin, Yuan, Ming, Yang, Jinzhe, Zhao, Wenlai, Luk, Wayne, and Yang, Guangwen
- Published
- 2020
- Full Text
- View/download PDF
5. Reconfigurable Hardware Generation for Tensor Flow Models of CNN Algorithms on a Heterogeneous Acceleration Platform
- Author
-
Gao, Jiajun, Zhu, Yongxin, Qiu, Meikang, Tsoi, Kuen Hung, Niu, Xinyu, Luk, Wayne, Zhao, Ruizhe, Que, Zhiqiang, Mao, Wei, Feng, Can, Zha, Xiaowen, Deng, Guobao, Chen, Jiayi, Liu, Tao, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, and Qiu, Meikang, editor
- Published
- 2018
- Full Text
- View/download PDF
6. Distributed large-scale graph processing on FPGAs
- Author
-
Sahebi, Amin, Barbone, Marco, Procaccini, Marco, Luk, Wayne, Gaydadjiev, G., and Giorgi, Roberto
- Subjects
Graph processing ,Grid partitioning ,Accelerators ,Distributed computing ,FPGA - Abstract
Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.
- Published
- 2023
7. Automated Framework for General-Purpose Genetic Algorithms in FPGAs
- Author
-
Guo, Liucheng, Thomas, David B., Luk, Wayne, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Esparcia-Alcázar, Anna I., editor, and Mora, Antonio M., editor
- Published
- 2014
- Full Text
- View/download PDF
8. Parametric Optimization of Reconfigurable Designs Using Machine Learning
- Author
-
Kurek, Maciej, Becker, Tobias, Luk, Wayne, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Brisk, Philip, editor, de Figueiredo Coutinho, José Gabriel, editor, and Diniz, Pedro C., editor
- Published
- 2013
- Full Text
- View/download PDF
9. A Large-Scale Spiking Neural Network Accelerator for FPGA Systems
- Author
-
Cheung, Kit, Schultz, Simon R., Luk, Wayne, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Villa, Alessandro E. P., editor, Duch, Włodzisław, editor, Érdi, Péter, editor, Masulli, Francesco, editor, and Palm, Günther, editor
- Published
- 2012
- Full Text
- View/download PDF
10. Pipelined HAC Estimation Engines for Multivariate Time Series
- Author
-
Guo, Ce and Luk, Wayne
- Published
- 2014
- Full Text
- View/download PDF
11. FLiMS: A Fast Lightweight 2-Way Merger for Sorting.
- Author
-
Papaphilippou, Philippos, Luk, Wayne, and Brooks, Chris
- Subjects
- *
MERGERS & acquisitions , *FIELD programmable gate arrays - Abstract
In this paper, we present FLiMS, a highly-efficient and simple parallel algorithm for merging two sorted lists residing in banked and/or wide memory. On FPGAs, its implementation uses fewer hardware resources than the state-of-the-art alternatives, due to the reduced number of comparators and elimination of redundant logic found on prior attempts. In combination with the distributed nature of the selector stage, a higher performance is achieved for the same amount of parallelism or higher. This is useful in many applications such as in parallel merge trees to achieve high-throughput sorting, where the resource utilisation of the merger is critical for building large trees and internalising the workload for fast computation. Also presented are efficient variations of FLiMS for optimizing throughput for skewed datasets, achieving stable sorting or using fewer dequeue signals. Additionally, FLiMS is shown to perform well as conventional software on modern CPUs supporting single-instruction multiple-data (SIMD) instructions, surpassing the performance of some standard libraries for sorting. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. High Quality Uniform Random Number Generation Using LUT Optimised State-transition Matrices
- Author
-
Thomas, David B. and Luk, Wayne
- Published
- 2007
- Full Text
- View/download PDF
13. Enhancing High-Level Synthesis Using a Meta-Programming Approach.
- Author
-
Vandebon, Jessica, de Figueiredo Coutinho, Jose, Luk, Wayne, and Nurvitadhi, Eriko
- Subjects
HETEROGENEOUS computing ,SUPPLY & demand ,ARCHITECTURAL details ,PYTHON programming language ,MAGNITUDE (Mathematics) ,TASK analysis - Abstract
In today's increasingly heterogeneous compute landscape, there is high demand for design tools that offer seemingly contradictory features: portable programming abstractions that hide underlying architectural detail, and the capability to optimise and exploit architectural features. Our meta-programming approach, Artisan, decouples application functionality from optimisation concerns to address the complexity of mapping high-level application descriptions onto heterogeneous platforms from which they are abstracted. With Artisan, application experts focus on algorithmic behaviour, while platform and domain experts focus on optimisation and mapping. Artisan offers complete design-flow orchestration in a unified programming environment based on Python 3 to enable accessible codification of reusable optimisation strategies that can be automatically applied to high-level application descriptions. We have developed and evaluated an Artisan prototype and a set of customised meta-programs used to automatically optimise six case study applications for CPU+FPGA targets. In our experiments, Artisan-optimised designs achieve the same order of magnitude speedup as manually optimised designs compared to corresponding unoptimised software. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
14. NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.
- Author
-
Cheung, Kit, Schultz, Simon R., Luk, Wayne, Smith, Leslie Samuel, and Camunas, Luis
- Subjects
BIOLOGICAL neural networks ,FIELD programmable gate arrays ,NEUROMORPHICS - Abstract
NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms.With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
15. On the use of programmable hardware and reduced numerical precision in earth-system modeling.
- Author
-
Düben, Peter D., Russell, Francis P., Niu, Xinyu, Luk, Wayne, and Palmer, T. N.
- Subjects
GEOPHYSICAL fluid dynamics ,FIELD programmable gate arrays ,WEATHER forecasting ,COMPUTATIONAL fluid dynamics ,CHAOS theory ,EARTH system science - Abstract
Programmable hardware, in particular Field Programmable Gate Arrays (FPGAs), promises a significant increase in computational performance for simulations in geophysical fluid dynamics compared with CPUs of similar power consumption. FPGAs allow adjusting the representation of floating-point numbers to specific application needs. We analyze the performance-precision trade-off on FPGA hardware for the two-scale Lorenz '95 model. We scale the size of this toy model to that of a high-performance computing application in order to make meaningful performance tests. We identify the minimal level of precision at which changes in model results are not significant compared with a maximal precision version of the model and find that this level is very similar for cases where the model is integrated for very short or long intervals. It is therefore a useful approach to investigate model errors due to rounding errors for very short simulations (e.g., 50 time steps) to obtain a range for the level of precision that can be used in expensive long-term simulations. We also show that an approach to reduce precision with increasing forecast time, when model errors are already accumulated, is very promising. We show that a speed-up of 1.9 times is possible in comparison to FPGA simulations in single precision if precision is reduced with no strong change in model error. The single-precision FPGA setup shows a speed-up of 2.8 times in comparison to our model implementation on two 6-core CPUs for large model setups. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
16. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration.
- Author
-
Betkaoui, Brahim, Wang, Yu, Thomas, David B., and Luk, Wayne
- Abstract
In many application domains, data are represented using large graphs involving millions of vertices and billions of edges. Graph exploration algorithms, such as breadth-first search (BFS), are largely dominated by memory latency and are challenging to process efficiently. In this paper, we present a reconfigurable hardware methodology for efficient parallel processing of large-scale graph exploration problems. Our methodology is based on a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the hardware capabilities of both FPGAs and the parallel memory subsystem. To validate our methodology, we provide a detailed design description of the Breadth-First Search algorithm on an FPGA-based high performance computing system. Using graph data based on the power-law graphs found in real-word problems, we are able to achieve performance results that are superior to those of high performance multi-core systems in the recent literature for large graph instances, and a throughput in excess of 2.5 billion traversed edges per second on RMAT graphs with 16 million vertices and over a billion edges. Using four Virtex-5 LX330 FPGAs based on 65nm technology and running at 75MHz, our BFS design achieves more than twice the speed of a 32-core Xeon X7560 based on 45nm technology and running at 2.26GHz. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
17. Specifying Compiler Strategies for FPGA-based Systems.
- Author
-
Cardoso, João M.P., Teixeira, João, Alves, Jose C., Nobre, Ricardo, Diniz, Pedro C., Coutinho, Jose G.F., and Luk, Wayne
- Abstract
The development of applications for high-performance Field Programmable Gate Array (FPGA) based embedded systems is a long and error-prone process. Typically, developers need to be deeply involved in all the stages of the translation and optimization of an application described in a high-level programming language to a lower-level design description to ensure the solution meets the required functionality and performance. This paper describes the use of a novel aspect-oriented hardware/software design approach for FPGA-based embedded platforms. The design-flow uses LARA, a domain-specific aspect-oriented programming language designed to capture high-level specifications of compilation and mapping strategies, including sequences of data/computation transformations and optimizations. With LARA, developers are able to guide a design-flow to partition and map an application between hardware and software components. We illustrate the use of LARA on two complex real-life applications using high-level compilation and synthesis strategies for achieving complete hardware/software implementations with speedups of 2.5e and 6.8e over software-only implementations. By allowing developers to maintain a single application source code, this approach promotes developer productivity as well as code and performance portability. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
18. A Mixed Precision Methodology for Mathematical Optimisation.
- Author
-
Chow, Gary C.T., Luk, Wayne, and Leong, Philip H.W.
- Abstract
This paper introduces a novel mixed precision methodology for mathematical optimisation. It involves the use of reduced precision FPGA optimisers for searching potential regions containing the global optimum, and double precision optimisers on a general purpose processor (GPP) for verifying the results. An empirical method is proposed to determine parameters of the mixed precision methodology running on a reconfigurable accelerator consisting of FPGA and GPP. The effectiveness of our approach is evaluated using a set of optimisation benchmarks. Using our mixed precision methodology and a modern reconfigurable accelerator, we can locate the global optima 1.7 to 6 times faster compared with quad-core optimiser. The mixed precision optimisations search up to 40.3 times more starting vector per unit time compared with quad core optimisers and only 0.7% to 2.7% of these searches are refined using GPP double precision optimisers. The proposed methodology also allows us to accelerate problems with more complicated functions or to solve problems involving higher dimensions. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
19. Field‐programmable gate arrays and quantum Monte Carlo: Power efficient coprocessing for scalable high‐performance computing.
- Author
-
Cardamone, Salvatore, Kimmitt, Jonathan R. R., Burton, Hugh G. A., Todman, Timothy J., Li, Shurui, Luk, Wayne, and Thom, Alex J. W.
- Subjects
GATE array circuits ,QUANTUM gates ,FIELD programmable gate arrays ,PARALLEL processing ,SCIENTIFIC computing ,MONTE Carlo method ,UNITS of measurement - Abstract
Massively parallel architectures offer the potential to significantly accelerate an application relative to their serial counterparts. However, not all applications exhibit an adequate level of data and/or task parallelism to exploit such platforms. Furthermore, the power consumption associated with these forms of computation renders "scaling out" for exascale levels of performance incompatible with modern sustainable energy policies. In this work, we investigate the potential for field‐programmable gate arrays (FPGAs) to feature in future exascale platforms, and their capacity to improve performance per unit power measurements for the purposes of scientific computing. We have focused our efforts on variational Monte Carlo, and report on the benefits of coprocessing with a FPGA relative to a purely multicore system. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
20. FP-BNN: Binarized neural network on FPGA.
- Author
-
Liang, Shuang, Yin, Shouyi, Liu, Leibo, Luk, Wayne, and Wei, Shaojun
- Subjects
- *
ARTIFICIAL neural networks , *FIELD programmable gate arrays , *COMPUTER vision , *ARTIFICIAL intelligence , *ENERGY consumption - Abstract
Deep neural networks (DNNs) have attracted significant attention for their excellent accuracy especially in areas such as computer vision and artificial intelligence. To enhance their performance, technologies for their hardware acceleration are being studied. FPGA technology is a promising choice for hardware acceleration, given its low power consumption and high flexibility which makes it suitable particularly for embedded systems. However, complex DNN models may need more computing and memory resources than those available in many current FPGAs. This paper presents FP-BNN, a binarized neural network (BNN) for FPGAs, which drastically cuts down the hardware consumption while maintaining acceptable accuracy. We introduce a Resource-Aware Model Analysis (RAMA) method, and remove the bottleneck involving multipliers by bit-level XNOR and shifting operations, and the bottleneck of parameter access by data quantization and optimized on-chip storage. We evaluate the FP-BNN accelerator designs for MNIST multi-layer perceptrons (MLP), Cifar-10 ConvNet, and AlexNet on a Stratix-V FPGA system. An inference performance of Tera opartions per second with acceptable accuracy loss is obtained, which shows improvement in speed and energy efficiency over other computing platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
21. Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures.
- Author
-
Russell, Francis P., Düben, Peter D., Niu, Xinyu, Luk, Wayne, and Palmer, T.N.
- Subjects
- *
COMPUTER architecture , *DATA libraries , *CHAOS theory , *ATMOSPHERIC models - Abstract
Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
22. In-circuit tuning of deep learning designs.
- Author
-
Que, Zhiqiang, Noronha, Daniel Holanda, Zhao, Ruizhe, Niu, Xinyu, Wilton, Steven J.E., and Luk, Wayne
- Subjects
- *
DEEP learning , *DEBUGGING , *OVERLAY networks , *GATE array circuits , *INSTRUCTIONAL systems - Abstract
This paper presents OTune, a novel overlay-based approach for rapid in-circuit debugging and tuning of Deep Neural Network (DNN) designs targeting Field-Programmable Gate Array (FPGA). We first propose overlay-based instruments that provide hardware profiling information to FPGA-based DNN developers for tuning and debugging their designs. Our instrumentation is optimized to take advantage of characteristics of the DNN application domain and traces useful information for in-circuit domain-specific development. Besides, a light-weight overlay-based DNN processing engine is implemented to support rapid word length tuning, which allows adjusting each DNN layer's datapath without time-consuming FPGA compilation. Furthermore, our approach enables tuning of FPGA-based DNN designs for edge systems, which would benefit developing adaptive learning systems. Evaluation results show that OTune can tune a fixed-point design to the same accuracy as a floating-point one with less than 4% added FPGA area. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
23. Reconfigurable FPGA-based switching path frequency-domain echo canceller with applications to voice control device
- Author
-
Yiu, Ka Fai Cedric, Lu, Yao, Hok Ho, Chun, Luk, Wayne, Huo, Jiaquan, and Nordholm, Sven
- Subjects
- *
FIELD programmable gate arrays , *TELECOMMUNICATION equipment , *ECHO suppression , *ACOUSTIC filters , *AUTOMATIC speech recognition , *ALGORITHMS , *COMPUTER software - Abstract
Abstract: Acoustic echo control is of vital interest for hands-free operation of telecommunications equipment. An important property of an acoustic echo canceller is its capability to handle double-talk and be able to operate in real time. When it is applied to intelligent voice control device, it is important to suppress the speech from the device and enhance the speech of the user for speech recognition, where double-talk situation is frequently occurred. In this paper, we propose a novel hardware architecture to support a robust adaptive algorithm in combination with a switching path model to tackle the double-talk situation. The proposed switching path model avoids adapting two filters at the same time during double-talk and prevents the disadvantage of the conventional two-path model. In order to achieve computational efficiency and to meet the rigorous timing requirements, the echo canceller is operated in the frequency domain and its computing power is raised by a hardware accelerator implemented in the FPGA fabric surrounding a PowerPC on a Xilinx XUP V2P platform. Results obtained show the echo canceller is successful in handling double-talk situation and the sub-band implementation has improved convergence significantly. An overall improvement by 82.5 times is achieved when a hardware accelerator is used to perform the critical part of the algorithm over a pure software implementation running on a 300 MHz embedded PowerPC processor. [Copyright &y& Elsevier]
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.