Back to Search Start Over

A Message Logging Protocol Based on User Level Failure Mitigation

Authors :
Yuhua Tang
Xunyun Liu
Xinhai Xu
Xiaoguang Ren
Ziqing Dai
Source :
Algorithms and Architectures for Parallel Processing ISBN: 9783319038582, ICA3PP (1)
Publication Year :
2013
Publisher :
Springer International Publishing, 2013.

Abstract

Fault-tolerance and its associated overheads are of great concern for current high performance computing systems and future exascale systems. In such systems, message logging is an important transparent rollback recovery technique considering its beneficial feature of avoiding global restoration process. Most previous work designed and implemented message logging at the library level or even lower software hierarchy. In this paper, we propose a new message logging protocol, which elevates payload copy, failure handling and recovery procedure to the user level to present a better handling of sender-based logging for collective operations and guarantee a certain level of portability. The proposed approach does not record collective communications as a set of point-to-point messages in MPI library; instead, we preserve application data related to the communications to ensure that there exists a process which can serve the original result in case of failure. We implement our protocol in Open MPI and evaluate it by NPB benchmarks on a subsystem of Tianhe-1A. Experimental results outline a improvement on failure free performance and recovery time reduction.

Details

ISBN :
978-3-319-03858-2
ISBNs :
9783319038582
Database :
OpenAIRE
Journal :
Algorithms and Architectures for Parallel Processing ISBN: 9783319038582, ICA3PP (1)
Accession number :
edsair.doi...........2d16b557ed5f8e634f3eaa940d2a766d
Full Text :
https://doi.org/10.1007/978-3-319-03859-9_27