Start Over

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Authors :: Don Maxwell
Nathan DeBardeleben
Saurabh Gupta
Daniel Oliveira
Luigi Carro
Devesh Tiwari
Arthur S. Bland
James H. Rogers
Paolo Rech
Sudharshan S. Vazhkudai
Dave Londo
Philippe O. A. Navaux
Source :: HPCA
Publication Year :: 2015
Publisher :: IEEE, 2015.
Abstract: Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

Subjects :: ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION
Java
Computer science
Graphics hardware
GPU cluster
Supercomputer
computer.software_genre
Titan (supercomputer)
Computer architecture
Performance efficiency
Operating system
Systems design
National laboratory
computer
computer.programming_language

Details

Database :: OpenAIRE
Journal :: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
Accession number :: edsair.doi...........279954d50dbf5a26b01c5f59e4cf01ae
Full Text :: https://doi.org/10.1109/hpca.2015.7056044

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources