Start Over

Optimizing sparse general matrix–matrix multiplication for DCUs.

Authors :: Guo, Hengliang
Wang, Haolei
Chen, Wanting
Zhang, Congxiang
Han, Yubo
Zhu, Shengguang
Zhang, Dujuan
Guo, Yang
Shang, Jiandong
Wan, Tao
Li, Qingyang
Wu, Gang
Source :: Journal of Supercomputing. Sep2024, Vol. 80 Issue 14, p20176-20200. 25p.
Publication Year :: 2024
Abstract: Sparse general matrix–matrix multiplication (SpGEMM) is a crucial and complex computational task in many practical applications. Improving the performance of SpGEMM on SIMT processors like modern GPUs is challenging due to the unpredictable sparsity of sparse matrices. Although existing GPU solutions have made progress in improving performance through advanced algorithm design, they ignore some optimizations related to specific processor architectures. This can result in a partially inefficient implementation of their algorithms. This paper focuses on optimizing four inefficient parts of the NSparse algorithm on DCU (a GPU-like accelerator). The optimizations include: 1) setting parameters to improve the load balance of the second matrix by extracting maximum row information at runtime; 2) reducing overhead of binning operations by making full use of registers and shared memory effectively; 3) improving numerical SpGEMM performance by adjusting its calculation mode; and 4) enhancing global load balance through finer-grained grouping and kernel configurations. Experiment results demonstrate that when compared to five state-of-the-art SpGEMM algorithms (bhSparse, KokkosKernels, NSparse, rocSparse, and spECK), our optimized method achieves an average of 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups on 29 sparse matrices with different sparse structures, respectively. [ABSTRACT FROM AUTHOR]