Back to Search Start Over

Checkpoint restart support for heterogeneous HPC applications

Authors :
Kai Keller
Osman Unsal
Konstantinos Parasyris
Leonardo Bautista-Gomez
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
Barcelona Supercomputing Center
Source :
UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), CCGRID, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
Publication Year :
2020
Publisher :
Institute of Electrical and Electronics Engineers (IEEE), 2020.

Abstract

As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs.In this paper, we extend an application level checkpoint library, called fault tolerance interface (FTI), to support multi-node/multi-GPU checkpoints. In contrast to previous work, our library includes a memory manager, which upon a checkpoint invocation tracks the actual location of the data to be stored and handles the data accordingly. We analyze the overhead of the checkpoint/restart procedure and we present a series of optimization steps to massively decrease the checkpoint and recovery time of our implementation. To further reduce the checkpoint time we present a differential checkpoint approach which writes only the updated data to the checkpoint file. Our approach is evaluated and, in the best case scenario, the execution time of a normal checkpoint is reduced by 15x in contrast with a non-optimized version, in the case of differential checkpoint the overhead can drop to 2.6% when checkpointing every 30s. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement #780681.

Details

ISBN :
978-1-72816-095-5
ISBNs :
9781728160955
Database :
OpenAIRE
Journal :
UPCommons. Portal del coneixement obert de la UPC, Universitat Politècnica de Catalunya (UPC), CCGRID, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
Accession number :
edsair.doi.dedup.....d1ba7b1040fe9189f4986e702bd686b5