3 results on '"Size Zheng"'
Search Results
2. SuSy
- Author
-
Jie Wang, Zhiru Zhang, Xiuping Cui, Christopher J. Hughes, Yi-Hsiang Lai, Yunshan Jia, Brendan Sullivan, Size Zheng, Hongbo Rong, Youhui Zhang, Nithin George, Jason Cong, Pradeep Dubey, Weihao Zhang, Yun Liang, and Jose Roberto Alvarez
- Subjects
010302 applied physics ,Source lines of code ,Computer science ,Systolic array ,02 engineering and technology ,Parallel computing ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Software framework ,Set (abstract data type) ,High-level synthesis ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Benchmark (computing) ,Field-programmable gate array ,computer - Abstract
Systolic algorithms are one of the killer applications on spatial architectures such as FPGAs and CGRAs. However, it requires a tremendous amount of human effort to design and implement a high-performance systolic array for a given algorithm using the traditional RTL-based methodology. On the other hand, existing high-level synthesis (HLS) tools either (1) force the programmers to do "micro-coding" where too many optimizations must be carried out through tedious code restructuring and insertion of vendor-specific pragmas, or (2) give them too little control to influence a push-button compilation flow to achieve high quality of results. To tackle these challenges, we introduce SuSy, a programming framework composed of a domain-specific language (DSL) and a compilation flow that enables programmers to productively build high-performance systolic arrays on FPGAs. With SuSy, programmers express the design functionality in the form of uniform recurrence equations (UREs), which can describe algorithms from a wide spectrum of applications as long as the underlying computation has a uniform dependence structure. The URE description in SuSy is followed by a set of decoupled spatial mapping primitives that specify how to map the equations to a spatial architecture. More concretely, programmers can apply space-time transformations and several other memory and I/O optimizations to build a highly efficient systolic architecture productively. Experimental results show that SuSy can describe various algorithms with UREs and generate high-performance systolic arrays by spatial optimizations. For instance, the SGEMM benchmark written in SuSy can approach the performance of the manual design optimized by experts, while using 30× fewer lines of code.
- Published
- 2020
- Full Text
- View/download PDF
3. FlexTensor
- Author
-
Kaiwen Sheng, Shuo Wang, Yun Liang, Renze Chen, and Size Zheng
- Subjects
Schedule ,Speedup ,Xeon ,Computer science ,Heuristic (computer science) ,Optimizing compiler ,020207 software engineering ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,CUDA ,Test case ,Tensor (intrinsic definition) ,0202 electrical engineering, electronic engineering, information engineering - Abstract
Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the algorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the application algorithms. In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor computation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for different hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x performance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.