Back to Search
Start Over
A failure index for HPC applications.
- Source :
-
Journal of Parallel & Distributed Computing . Jul2016, Vol. 93/94, p146-153. 8p. - Publication Year :
- 2016
-
Abstract
- This paper conducts an examination of log files originating from High Performance Computing (HPC) applications with known reliability problems. The results of this study further the maturation and adoption of meaningful metrics representing HPC system and application failure characteristics. Quantifiable metrics representing the reliability of HPC applications are foundational for building an application resilience methodology critical in the realization of exascale supercomputing. In this examination, statistical inequality methods originating from the study of economics are applied to health and status information contained in HPC application log files. The main result is the derivation of a new failure index metric for HPC—a normalized representation of parallel application volatility and/or resiliency to complement existing reliability metrics such as mean time between failure (MTBF), which aims for a better presentation of HPC application resilience. This paper provides an introduction to a Failure Index (FI) for HPC reliability and takes the reader through a use-case wherein the FI is used to expose various run-time fluctuations in the failure rate of applications running on a collection of HPC platforms. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 07437315
- Volume :
- 93/94
- Database :
- Academic Search Index
- Journal :
- Journal of Parallel & Distributed Computing
- Publication Type :
- Academic Journal
- Accession number :
- 115743877
- Full Text :
- https://doi.org/10.1016/j.jpdc.2016.04.009