201. Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience
- Author
-
Babatunde Egbantan, Devesh Tiwari, Fritz Previlon, David Kaeli, and Paolo Rech
- Subjects
Engineering ,business.industry ,Reliability (computer networking) ,Automotive industry ,02 engineering and technology ,Fault injection ,020202 computer hardware & architecture ,Soft error ,Computer engineering ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Transient (computer programming) ,Resilience (network) ,business ,Vulnerability (computing) ,System software - Abstract
Transient faults continue to be a critical concern in a range of computing domains including: High-Performance Computing (HPC), scientific computing, and the automotive industry. While radiation-induced faults have been well studied and understood in microprocessors, their impact on computations on Graphic Processing Units (GPU) has received less attention. GPUs are now being used in a large number of HPC and automotive markets. Mitigating the effects of transient faults requires a thorough understanding of the interaction between applications, system software, and the underlying hardware. Developing this understanding is quite challenging mainly due to our limited ability to capture and study cross-layer reliability interactions. In this paper, we consider the combination of neutron beam testing experiments with architectural fault injection experiments to gain a deeper understanding of the relationship between the vulnerability of GPUs and the underlying workload characteristics of applications targeted for GPU devices.
- Published
- 2017
- Full Text
- View/download PDF