A failure index for HPC applications.

Authors :: Păun, Andrei
Chandler, Clayton
Leangsuksun, Chokchai Box
Păun, Mihaela
Source :: Journal of Parallel & Distributed Computing. Jul2016, Vol. 93/94, p146-153. 8p.
Publication Year :: 2016
Abstract: This paper conducts an examination of log files originating from High Performance Computing (HPC) applications with known reliability problems. The results of this study further the maturation and adoption of meaningful metrics representing HPC system and application failure characteristics. Quantifiable metrics representing the reliability of HPC applications are foundational for building an application resilience methodology critical in the realization of exascale supercomputing. In this examination, statistical inequality methods originating from the study of economics are applied to health and status information contained in HPC application log files. The main result is the derivation of a new failure index metric for HPC—a normalized representation of parallel application volatility and/or resiliency to complement existing reliability metrics such as mean time between failure (MTBF), which aims for a better presentation of HPC application resilience. This paper provides an introduction to a Failure Index (FI) for HPC reliability and takes the reader through a use-case wherein the FI is used to expose various run-time fluctuations in the failure rate of applications running on a collection of HPC platforms. [ABSTRACT FROM AUTHOR]