Back to Search Start Over

Algorithm-Based Fault Tolerance for Fail-Stop Failures.

Authors :
Zizhong Chen
Dongarra, Jack
Source :
IEEE Transactions on Parallel & Distributed Systems; Dec2008, Vol. 19 Issue 12, p1628-1641, 14p, 3 Diagrams, 7 Charts, 10 Graphs
Publication Year :
2008

Abstract

Abstract-Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10459219
Volume :
19
Issue :
12
Database :
Complementary Index
Journal :
IEEE Transactions on Parallel & Distributed Systems
Publication Type :
Academic Journal
Accession number :
35574719
Full Text :
https://doi.org/10.1109/TPDS.2008.58