1. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores
- Author
-
Xianyi Zhang, Yutong Lu, Canqun Yang, Chao Yang, Fangfang Liu, Yiqun Liu, Xiangke Liao, Yunfei Du, and Min Xie
- Subjects
020203 distributed computing ,Computer science ,Symmetric multiprocessor system ,Scale (descriptive set theory) ,02 engineering and technology ,Parallel computing ,Hybrid algorithm ,Theoretical Computer Science ,Hardware and Architecture ,Transfer (computing) ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,Benchmark (computing) ,Overhead (computing) ,020201 artificial intelligence & image processing ,Host (network) ,Software - Abstract
In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner–outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world’s largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.
- Published
- 2015
- Full Text
- View/download PDF