Start Over

Exploiting memory customization in FPGA for 3D stencil computations

Authors :: Nacho Navarro
Raúl de la Cruz
Mauricio Araya-Polo
Eduard Ayguadé
Muhammad Shafiq
Miquel Pericas
Source :: FPT
Publication Year :: 2009
Publisher :: IEEE, 2009.
Abstract: 3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n × (n + 1) × n, where n = {2, 4, 6, 8, …}. The performance of the architecture is evaluated using two design approaches, “Multi-Volume” and “Single-Volume”. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the “Single-Volume” approach and 103 GFLOPS in the “Multi-Volume” design approach in a 100–200MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.