1. Assessing the impact of timing errors on HPC applications
- Author
-
Mattan Erez, Wenqi Yin, and Chun-Kai Chang
- Subjects
010302 applied physics ,Speedup ,Computer science ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,02 engineering and technology ,Resilience (network) ,Scale (map) ,01 natural sciences ,020202 computer hardware & architecture ,Reliability engineering - Abstract
Timing errors are a growing concern for system resilience as technology continues to scale. It is problematic to use low-fidelity errors such as single-bit flips to model realistic timing errors. We address the lack of holistic methodology and tool for evaluating resilience of applications against timing errors. The proposed technique is able to rapidly inject high-fidelity and configurable timing errors to applications at the instruction level. Our implementation has no runtime dependencies on proprietary tools, enabling full parallelism of error injection campaign. Furthermore, because an injection point may not generate an actual error for a particular application run, we propose an acceleration technique to maximize the likelihood of generating errors that contribute to the overall campaign with speedup up to 7X. With our tool, we show that realistic timing errors lead to distinct error profiles from those of radiation-induced errors at both the instruction level and the application level.
- Published
- 2019
- Full Text
- View/download PDF