Author: "Ghiasvand, Siavash" / Publisher: technische universitat dresden - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ghiasvand, Siavash"' showing total 2 results

Start Over Author "Ghiasvand, Siavash" Publisher technische universitat dresden

2 results on '"Ghiasvand, Siavash"'

1. Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

Author: Ghiasvand, Siavash, Nagel, Wolfgang E., and Schulz, Martin
Subjects: anomaly detection, failure prediction, high performance computing, system logs, resilience, ddc:004
Abstract: Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Failure Prediction 4.1 Null Hypothesis 4.2 Failure Correlation 4.2.1 Node Vicinities 4.2.2 Impact of Vicinities 4.3 Anomaly Detection 4.3.1 Statistical Analysis (frequency) 4.3.2 Pattern Detection (order) 4.3.3 Machine Learning 4.4 Adaptive resilience 5 Results 5.1 Taurus System Logs 5.2 System-wide Failure Patterns 5.3 Failure Correlations 5.4 Taurus Failures Statistics 5.5 Jam-e Jam Prototype 5.6 Summary and Discussion 6 Conclusion and Future Works Bibliography List of Figures List of Tables Appendix A Neural Network Models Appendix B External Tools Appendix C Structure of Failure Metadata Databse Appendix D Reproducibility Appendix E Publicly Available HPC Monitoring Datasets Appendix F Glossary Appendix G Acronyms
Published: 2020

2. Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

Author: Nagel, Wolfgang E., Schulz, Martin, Ghiasvand, Siavash, Nagel, Wolfgang E., Schulz, Martin, and Ghiasvand, Siavash
Abstract: Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Fail
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

2 results on '"Ghiasvand, Siavash"'

1. Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

2. Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Language

Publication Type

Database

2 results on '"Ghiasvand, Siavash"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources