Back to Search Start Over

Exploiting the behavior of the failed job in high performance computing system

Authors :
Eunhye Kim
Ju-Won Park
Source :
ICCSA (6)
Publication Year :
2018
Publisher :
IEEE, 2018.

Abstract

As demand for high-performance computing power is increasing, operation management technologies like check-pointing, failure-aware task scheduling, and system simulations are becoming more important for the stable operation of the system. To maintain and manage a stable system, a detailed analysis of failed tasks is necessary. For this, this paper intends to analyze the characteristics of failed jobs in high performance computing system. Our contributions can be viewed in three ways. Firstly, it offers detailed analysis results of failed jobs based on the job logs of a currently operating supercomputer. Secondly, it offers not only an overall statistical analysis result but also identifies the distribution of the failed job submission inter-arrival time. Thirdly, it analyzes the occurrence probability of the event using hazard rate.

Details

Database :
OpenAIRE
Journal :
2018 18th International Conference on Computational Science and Applications (ICCSA)
Accession number :
edsair.doi...........74800bf04d576ee28bbfdccc7eb2d4af