Back to Search Start Over

FPGA Checkpointing for Scientific Computing

Authors :
Osman Unsal
Leonardo Bautista-Gomez
Marc Perello Bacardit
Barcelona Supercomputing Center
Source :
2021 IEEE 27th International Symposium on, Testing and Robust System Design (IOLTS), IOLTS
Publisher :
IEEE

Abstract

The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU. This research has received funding from the European Union’s Horizon 2020 research and innovation programme under projects EuroEXA (grant agreement nº 754337) and eProcessor (grant agreement nº 956702).

Details

Language :
English
ISBN :
978-1-66543-370-9
ISBNs :
9781665433709
Database :
OpenAIRE
Journal :
2021 IEEE 27th International Symposium on, Testing and Robust System Design (IOLTS), IOLTS
Accession number :
edsair.doi.dedup.....79dbf8795ecb828ae14859cebbed13f5
Full Text :
https://doi.org/10.1109/iolts52814.2021.9486693