1. A simple one-step index algorithm for implementation of lattice Boltzmann method on GPU.
- Author
-
Ma, Kuang, Wang, Yaning, Jiang, Maoqiang, and Liu, Zhaohui
- Subjects
- *
LATTICE Boltzmann methods , *BOLTZMANN'S equation , *GRAPHICS processing units , *FLOW simulations , *ALGORITHMS , *GRIDS (Cartography) - Abstract
We proposed a simple one-step index (OSI) algorithm for solving the lattice Boltzmann equation, particularly the streaming of particle distribution functions (PDFs) on a single grid system. The OSI algorithm is derived from the conventional A-B pattern. The memory addresses of the PDFs are fixed in this algorithm and consistent with collision principles. The streaming process is implicitly computed by reassigning their indexes corresponding to the time steps, spatial coordinates, and directions of the PDFs. The algorithm is simple to program because it reads and writes the PDFs only once per time step and does not require the synchronization of odd and even time steps. In this implementation, the data layout of the PDFs is the structure of arrays (SoA), suitable for the memory access pattern of graphics processing units (GPUs). The accuracy and single-precision performance of the proposed algorithm for the three-dimensional lid-driven cavity flow simulation with the D3Q19 model were validated and tested on an NVIDIA A100 having a 40 GB PCIe using CUDA and OpenACC. Performances of 8.4 and 8.1 giga lattice updates per second were obtained for CUDA and OpenACC, respectively. OpenACC can outperform CUDA by up to 95% with significantly less programming work. The bandwidth usage rates on a single GPU were 96% and 94% for CUDA and OpenACC, respectively, close to the theoretical values. Lattice Boltzmann method parallelism is implemented using CUDA and MPI for multi-GPU usage. Finally, computation and communication overlaps were implemented to optimize the parallel efficiency, where the weak scaling parallel efficiency exceeded 0.98 on up to 512 GPUs. • The algorithm integrated LB collision and streaming in a single step on one grid. • The PDFs are only read and written once every time step. • The performance is 8.4 and 8.1 GLUPS on A100 using CUDA and OpenACC, respectively. • The weak scaling parallel efficiency is higher than 0.98 on up to 512 GPUs. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF