1. Off-Chip Memory Allocation for Neural Processing Units
- Author
-
Andrey Kvochko, Evgenii Maltsev, Artem Balyshev, Stanislav Malakhov, and Alexander Efimov
- Subjects
NPU ,memory allocation ,neural network runtime ,tiling ,strip-packing problem ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Many modern Systems-on-Chip (SoCs) are equipped with specialized Machine Learning (ML) accelerators that use both on-chip and off-chip memory to execute neural networks. While on-chip memory usually has a hard limit, off-chip memory is often considered large enough to hold the network’s inputs, outputs, weights, and any intermediate results that may occur during model execution. This assumption may not hold for edge devices, such as smartphones, which usually have a limit on the amount of memory a process can use. In this study, we propose a novel approach for minimizing a neural network’s off-chip memory usage by introducing a tile-aware allocator capable of reusing memory occupied by parts of a tensor before the entire tensor expires. We describe the necessary conditions for such an off-chip memory allocation approach and provide the results, showing that it can save up to 33% of the peak off-chip memory usage in some common network architectures.
- Published
- 2024
- Full Text
- View/download PDF