1. Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures
- Author
-
Jieyang Chen, Mark Raugas, Jesun Sahariar Firoz, Ang Li, Shuaiwen Leon Song, Chenhao Xie, Kevin J. Barker, and Jiajia Li
- Subjects
FOS: Computer and information sciences ,Speedup ,Computer science ,Parallel computing ,Solver ,Supernode ,Computer Science - Distributed, Parallel, and Cluster Computing ,Hardware Architecture (cs.AR) ,Scalability ,Synchronization (computer science) ,Overhead (computing) ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Partitioned global address space ,Computer Science - Hardware Architecture ,Execution model - Abstract
Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.
- Published
- 2021