1. A Domain-Specific Architecture for Accelerating Sparse Matrix Vector Multiplication on FPGAs
- Author
-
Lisa Liu, Henri Fraisse, Mansimran Benipal, Hossein Omidian, Abhishek Kumar Jain, and Dinesh D. Gaitonde
- Subjects
010302 applied physics ,Modularity (networks) ,Memory hierarchy ,Plug and play ,Computer science ,Sparse matrix-vector multiplication ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Routing (electronic design automation) ,Field-programmable gate array ,Block (data storage) ,Efficient energy use - Abstract
FPGAs allow custom memory hierarchy and flexible data movement with highly fine-grained control. These capabilities are critical for building high performance and energy efficient domain-specific architectures (DSAs), especially for workloads with irregular memory access and data-dependent communication patterns. Sparse linear algebra operations, especially sparse matrix vector multiplication (SpMV), are examples of such workloads and are becoming important due to their use in numerous areas of science and engineering. Existing FPGA-based DSAs for SpMV do not allow customization through plug and play of the building blocks. For example, most of these DSAs require switching network/crossbar architecture as a building block for routing matrix data to banked vector memory blocks. In this paper, we first present an approach where a custom network is built using simple blocks arranged in a regular fashion to exploit low-level architecture details. Further, we make use of this network to replace expensive crossbars employed in GEMX SpMV engine and develop an end-to-end tool-flow around mixed IP approach (HLS/RTL). Due to the modularity of our design, our tool-flow allows us to insert an additional block in the design to guarantee zero-stall from the accumulation stage. On Alveo U200, we report performance numbers of up to 4.4 GFLOPS (92% peak bandwidth utilization) using our accelerator (attached with one DDR4).
- Published
- 2020
- Full Text
- View/download PDF