Back to Search
Start Over
A Message Logging Protocol Based on User Level Failure Mitigation
- Source :
- Algorithms and Architectures for Parallel Processing ISBN: 9783319038582, ICA3PP (1)
- Publication Year :
- 2013
- Publisher :
- Springer International Publishing, 2013.
-
Abstract
- Fault-tolerance and its associated overheads are of great concern for current high performance computing systems and future exascale systems. In such systems, message logging is an important transparent rollback recovery technique considering its beneficial feature of avoiding global restoration process. Most previous work designed and implemented message logging at the library level or even lower software hierarchy. In this paper, we propose a new message logging protocol, which elevates payload copy, failure handling and recovery procedure to the user level to present a better handling of sender-based logging for collective operations and guarantee a certain level of portability. The proposed approach does not record collective communications as a set of point-to-point messages in MPI library; instead, we preserve application data related to the communications to ensure that there exists a process which can serve the original result in case of failure. We implement our protocol in Open MPI and evaluate it by NPB benchmarks on a subsystem of Tianhe-1A. Experimental results outline a improvement on failure free performance and recovery time reduction.
Details
- ISBN :
- 978-3-319-03858-2
- ISBNs :
- 9783319038582
- Database :
- OpenAIRE
- Journal :
- Algorithms and Architectures for Parallel Processing ISBN: 9783319038582, ICA3PP (1)
- Accession number :
- edsair.doi...........2d16b557ed5f8e634f3eaa940d2a766d
- Full Text :
- https://doi.org/10.1007/978-3-319-03859-9_27