12 results on '"Zhou, Xuegong"'
Search Results
2. A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network.
- Author
-
An, Fubang, Wang, Lingli, and Zhou, Xuegong
- Subjects
CONVOLUTIONAL neural networks ,RECURRENT neural networks ,NONLINEAR functions ,ARCHITECTURAL design - Abstract
Since the lightweight convolutional neural network EfficientNet was proposed by Google in 2019, the series of models have quickly become very popular due to their superior performance with a small number of parameters. However, the existing convolutional neural network hardware accelerators for EfficientNet still have much room to improve the performance of the depthwise convolution, squeeze-and-excitation module and nonlinear activation functions. In this paper, we first design a reconfigurable register array and computational kernel to accelerate the depthwise convolution. Next, we propose a vector unit to implement the nonlinear activation functions and the scale operation. An exchangeable-sequence dual-computational kernel architecture is proposed to improve the performance and the utilization. In addition, the memory architectures are designed to complete the hardware accelerator for the above computing architecture. Finally, in order to evaluate the performance of the hardware accelerator, the accelerator is implemented based on Xilinx XCVU37P. The results show that the proposed accelerator can work at the main system clock frequency of 300 MHz with the DSP kernel at 600 MHz. The performance of EfficientNet-B3 in our architecture can reach 69.50 FPS and 255.22 GOPS. Compared with the latest EfficientNet-B3 accelerator, which uses the same FPGA development board, the accelerator proposed in this paper can achieve a 1.28-fold improvement of single-core performance and 1.38-fold improvement of performance of each DSP. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. Fast Exact NPN Classification by Co-Designing Canonical Form and Its Computation Algorithm.
- Author
-
Zhou, Xuegong, Wang, Lingli, and Mishchenko, Alan
- Subjects
- *
ALGORITHMS , *LOGIC design , *BOOLEAN functions , *CLASSIFICATION , *APPROXIMATION algorithms - Abstract
NPN classification of Boolean functions is a powerful technique used in many practical applications, including logic synthesis, technology mapping, architecture exploration, circuit restructuring, and approximate logic synthesis. Computing the canonical form of a function is the most common approach to NPN classification. Exact classification of practical functions is an open problem because there are difficult functions beyond the capability of the state-of-the-art exact algorithms, which may take several months to compute a canonical form. This article proposes a new approach to exact NPN classification, in which a series of canonical forms and the algorithms to compute them are designed together. As a result, the runtime of the exact classification for difficult functions is effectively controlled by making both representation and computation cost-aware. Experimental results show that the proposed algorithm can perform exact classification of the worst-case 16-input functions in less than 3 minutes. This indicates that, for the first time, the problem of exact classification can be effectively solved for any Boolean functions with up to 16 inputs arising in practical applications. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
4. ARBSA: Adaptive Range-Based Simulated Annealing for FPGA Placement.
- Author
-
Yuan, Junqi, Chen, Jialing, Wang, Lingli, Zhou, Xuegong, Xia, Yinshui, and Hu, Jianping
- Subjects
SIMULATED annealing ,DIGITAL signal processing ,FIELD programmable gate arrays ,RANDOM access memory - Abstract
Placement has always been the most time-consuming part of the field programmable gate array (FPGA) compilation flow. Conventional simulated annealing has been unable to keep pace with ever increasing sizes of designs and FPGA chip resources. Without utilizing information of the circuit topology, it relies on large amounts of random swap operations, which are time-costly. This paper proposes an adaptive range-based algorithm to improve the behavior of swap operations and limit the swap distances by introducing the concept of range-limiting strategy for nets. It avoids unnecessary design space exploration, and thus can converge to near-optimal solutions much more quickly. The experimental results are based on the Titan benchmarks, which contain 4K to 30K blocks, including logic array blocks, inputs and outputs, digital signal processors, and random access memories. This approach achieves $2.82\boldsymbol \times $ speed up, 4.8% reduction on wire length, 4.1% improvement on critical path compared with the SA from VTR with wire length-driven optimization, and $1.78\boldsymbol \times $ speed up, 10% reduction on wire length, 2% reduction on critical path with path timing-driven optimization. It also manifests better scalability on larger benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. An adaptive cross-layer fault recovery solution for reconfigurable SoCs.
- Author
-
Jin, Jifang, Yan, Jian, Zhou, Xuegong, and Wang, Lingli
- Published
- 2015
- Full Text
- View/download PDF
6. An FPGA-cluster-accelerated match engine for content-based image retrieval.
- Author
-
Liang, Chen, Wu, Chenlu, Zhou, Xuegong, Cao, Wei, Wang, Shengye, and Wang, Lingli
- Abstract
In this paper, a high-performance match engine for content-based image retrieval is proposed. Highly customized floating-point(FP) units are designed, to provide the dynamic range and precision of standard FP units, but with considerably less area than standard FP units. Match calculation arrays with various architectures and scales are designed and evaluated. An CBIR system is built on a 12-FPGA cluster. Inter-FPGA connections are based on standard 10-Gigabyte ethernet. The whole FPGA cluster can compare a query image against 150 million library images within 10 seconds, basing on detailed local features. Compared with the Intel Xeon 5650 server based solution, our implementation is 11.35 times faster and 34.81 times more power efficient. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
7. A hardware implementation of Bag of Words and Simhash for image recognition.
- Author
-
Wang, Shengye, Liang, Chen, Zhou, Xuegong, Cao, Wei, Wu, Chenlu, Fan, Xitian, and Wang, Lingli
- Abstract
Algorithms such as Bag of Words and Simhash have been widely used in image recognition. To achieve better performance as well as energy-efficiency, a hardware implementation of these two algorithms is proposed in this paper. To the best of our knowledge, it is the first time that these algorithms have been implemented on hardware for image recognition purpose. The proposed implementation is able to generate a fingerprint of an image and find the closest match in the database accurately. It is implemented on Xilinx's Virtex-6 SX475T FPGA. Tradeoffs between high performance and low hardware overhead are obtained through proper parallelization. The experimental result shows that the proposed implementation can process 1,018 images per second, approximately 17.8x faster than software on Intel's 12-thread Xeon X5650 processor. On the other hand, the power consumption is 0.35x compared to software-based implementation. Thus, the overall advantage in energy-efficiency is as much as 46x. The proposed architecture is scalable, and is able to meet various requirements of image recognition. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
8. Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA.
- Author
-
Fan, Xitian, Wu, Chenlu, Cao, Wei, Zhou, Xuegong, Wang, Shengye, and Wang, Lingli
- Abstract
This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8]. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
9. Repack: A packing algorithm to enhance timing and routability of a circuit.
- Author
-
Huang, Zheng, Li, Zhaotong, Wang, Na, Tao, Ping, Zhou, Xuegong, and Wang, Lingli
- Abstract
With the advent of the great challenge brought by increasing complexity of modern large circuit, a pressing and necessary problem, that is, improving the routability and timing performance, is proposed in front of us. A novel packing algorithm called repack based on enhanced packing attraction function is presented while at the same time an iterative CAD flow tool could provide decreased interconnection resources requirement by applying CLB depopulation at given routing channel width limitation and local congested situations. Experimental results show that, for non-iterative flow, compared to the T-VPack and iRAC, repack can achieve 6.4% and 8.1% improvement respectively in timing performance. However, for iterative flow, when compared to T-VPack, repack has 12.6% and 37.6% improvement in area and routing path width respectively. When compared to iRAC, repack has a 0.9% decrease in area, but it has an improvement of 16.2% in routing path width instead. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
10. A modeling and mapping method for coarse/fine mixed-grained reconfigurable architecture.
- Author
-
Li, Zhaotong, Huang, Zheng, Chen, Shuai, Zhou, Xuegong, Cao, Wei, and Wang, Lingli
- Abstract
With the advantage of making reasonable trade-offs between performance and flexibility, reconfigurable architectures have drawn increasing attention. Automatic tools for mapping applications to reconfigurable architectures, however, are rather complicated and challenging work for the reason that mapping tool is always subject to the specific reconfigurable architecture. In this paper, we explore a general modeling and mapping method for coarse/fine mixed-grained reconfigurable architectures (MGRAs) by reinventing the packing method in traditional FPGA software flow and propose a novel modeling method that can describe both fine and coarse reconfigurable architecture in XML format. After a detailed explanation of our proposed modeling and mapping method, we verify our method by implementing a mapping tool for a reconfigurable architecture and manage to map FFT as an application on it. The experiments demonstrate that our proposed method can be applied to MGRA modeling and mapping and is flexible enough to be extended to other reconfigurable architectures. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
11. Framework of converting C++ class to hardware.
- Author
-
Zhao Xueming, Zhou Xuegong, and Wang Lingli
- Published
- 2008
- Full Text
- View/download PDF
12. SPREAD: A Streaming-Based Partially Reconfigurable Architecture and Programming Model.
- Author
-
Wang, Ying, Zhou, Xuegong, Wang, Lingli, Yan, Jian, Luk, Wayne, Peng, Chenglian, and Tong, Jiarong
- Subjects
ELECTRIC switchgear ,DATA encryption ,FIELD programmable gate arrays ,COMPUTING platforms ,RESOURCE allocation ,CRYPTOGRAPHIC equipment - Abstract
Partially reconfigurable systems are promising computing platforms for streaming applications, which demand both hardware efficiency and reconfigurable flexibility. To realize the full potential of these systems, a streaming-based partially reconfigurable architecture and unified software/hardware multithreaded programming model (SPREAD) is presented in this paper. SPREAD is a reconfigurable architecture with a unified software/hardware thread interface and high throughput point-to-point streaming structure. It supports dynamic computing resource allocation, runtime software/hardware switching, and streaming-based multithreaded management at the operating system level. SPREAD is designed to provide programmers of streaming applications with a unified view of threads, allowing them to exploit thread, data, and pipeline parallelism; it enhances hardware efficiency while simplifying the development of streaming applications for partially reconfigurable systems. Experimental results targeting cryptography applications demonstrate the feasibility and superior performance of SPREAD. Moreover, the parallelized Advanced Encryption Standard (AES), Data Encryption Standard (DES), and Triple DES (3DES) hardware threads on field-programmable gate arrays show 1.61–4.59 times higher power efficiency than their implementations on state-of-the-art graphics processing units. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.