Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Authors :: Xia, Yufan
Barca, Giuseppe Maria Junior
Source :: 2024 International Parallel and Distributed Processing Symposium (IPDPS)
Publication Year :: 2024
Abstract: BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.<br />Comment: Multi-Thread, Matrix Multiplication, Optimization, BLAS, Machine Learning

Subjects :: Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Machine Learning

Database :: arXiv
Journal :: 2024 International Parallel and Distributed Processing Symposium (IPDPS)
Publication Type :: Report
Accession number :: edsarx.2406.19621
Document Type :: Working Paper

Tools