Start Over

Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Authors :: Ham, Hyungkyu
Hong, Jeongmin
Park, Geonwoo
Shin, Yunseon
Woo, Okkyun
Yang, Wonhyuk
Bae, Jinhoon
Park, Eunhyeok
Sung, Hyojin
Lim, Euicheol
Kim, Gwangsun
Publication Year :: 2024
Abstract: Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL$.$mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory accesses can result in significant slowdowns for memory-bound applications whether they are latency-sensitive or bandwidth-intensive. The near-data processing (NDP) in the CXL controller promises to overcome such limitations of passive CXL memory. However, prior work on NDP in CXL memory proposes application-specific units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but existing CXL$.$io/PCIe-based mechanisms incur $\mu$s-scale latency and are not suitable for fine-grained NDP. To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M$^2$NDP), which comprises memory-mapped functions (M$^2$func) and memory-mapped $\mu$threading (M$^2\mu$thread). M$^2$func is a CXL$.$mem-compatible low-overhead communication mechanism between the host processor and NDP controller in CXL memory. M$^2\mu$thread enables low-cost, general-purpose NDP unit design by introducing lightweight $\mu$threads that support highly concurrent execution of kernels with minimal resource wastage. Combining them, M$^2$NDP achieves significant speedups for various workloads by up to 128x (14.5x overall) and reduces energy by up to 87.9% (80.3% overall) compared to baseline CPU/GPU hosts with passive CXL memory.<br />Comment: Accepted at the 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024