Back to Search Start Over

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Authors :
Robert C. Armstrong
Maher Salloum
Jackson R. Mayo
Source :
Euro-Par 2019: Parallel Processing Workshops ISBN: 9783030483395, Euro-Par Workshops
Publication Year :
2020
Publisher :
Springer International Publishing, 2020.

Abstract

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

Details

ISBN :
978-3-030-48339-5
ISBNs :
9783030483395
Database :
OpenAIRE
Journal :
Euro-Par 2019: Parallel Processing Workshops ISBN: 9783030483395, Euro-Par Workshops
Accession number :
edsair.doi...........6cfaecda2de704556130e315a041f925
Full Text :
https://doi.org/10.1007/978-3-030-48340-1_52