1. Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters
- Author
-
Sadasivan, Harisankar, Osama, Muhammad, Podkorytov, Maksim, Huang, Carlus, and Liu, Jun
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,D.2 ,I.2 - Abstract
General matrix multiplication (GEMM) operations are crucial in various computational fields. As GPU architectures evolve, optimizing GEMM performance becomes increasingly important. This paper introduces Stream-K++, an enhancement to the promising Stream-K GEMM scheduling algorithm. We expand Stream-K's scheduling policies from three to seven and implement an efficient solution selection mechanism using Bloom filters. Our approach rapidly eliminates up to 95.8% of unsuitable configurations while maintaining a 100% true-negative rate. Implemented using the AMD Composable Kernel library and evaluated on AMD Instinct MI250X GPUs, Stream-K++ demonstrates significant performance gains (up to 43%) in select scenarios. It remains competitive (within 20% of optimal) for 60-97.6% of problem sizes. Our flexible framework, implemented in the Opensieve C++ library, allows for easy adaptation to new problem sizes, scheduling policies, or additional tuning parameters, paving the way for future optimizations in GPU-based GEMM operations.
- Published
- 2024