Back to Search Start Over

Self-refined Fault Tolerance in HPC Using Dynamic Dependent Process Groups.

Authors :
Pal, Ajit
Kshemkalyani, Ajay D.
Kumar, Rajeev
Gupta, Arobinda
Gopalan, N.P.
Nagarajan, K.
Source :
Distributed Computing - IWDC 2005; 2005, p153-158, 6p
Publication Year :
2005

Abstract

This paper proposes a novel method for achieving a distributed self-refined fault tolerance by dynamically partitioning the processes into smaller groups, which are mutually disjoint and collectively exhaustive of the whole system. The present model provides tolerance for frequent faults, makes the roll back recovery simple and less time consuming. An optimal checkpoint interval is found using a mathematical approximation and a spare process is made to capture all the in-transit messages when a process fails at its ends. Piggybacking the events of dependent processes on the outgoing messages is used for process grouping. A process with maximum information can scatter chunk values to the other dependent processes in its group. Each process constructs a checkpoint when the received chunk matches with its log. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISBNs :
9783540309598
Database :
Supplemental Index
Journal :
Distributed Computing - IWDC 2005
Publication Type :
Book
Accession number :
32903052
Full Text :
https://doi.org/10.1007/11603771_18