1. DRAM Errors and Cosmic Rays: Space Invaders or Science Fiction?
- Author
-
Boixaderas, Isaac, Amaya, Jorge, Moré, Sergi, Bartolome, Javier, Vicente, David, Unsal, Osman, Gizopoulos, Dimitris, Carpenter, Paul M., Radojković, Petar, and Ayguadé, Eduard
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
It is widely accepted that cosmic rays are a plausible cause of DRAM errors in high-performance computing (HPC) systems, and various studies suggest that they could explain some aspects of the observed DRAM error behavior. However, this phenomenon is insufficiently studied in production environments. We analyze the correlations between cosmic rays and DRAM errors on two HPC clusters: a production supercomputer with server-class DDR3-1600 and a prototype with LPDDR3-1600 and no hardware error correction. Our error logs cover 2000 billion MB-hours for the MareNostrum 3 supercomputer and 135 million MB-hours for the Mont-Blanc prototype. Our analysis combines quantitative analysis, formal statistical methods and machine learning. We detect no indications that cosmic rays have any influence on the DRAM errors. To understand whether the findings are specific to systems under study, located at 100 meters above the sea level, the analysis should be repeated on other HPC clusters, especially the ones located on higher altitudes. Also, analysis can (and should) be applied to revisit and extend numerous previous studies which use cosmic rays as a hypothetical explanation for some aspects of the observed DRAM error behaviors., Comment: Accepted for publication in SBAC-PAD'24
- Published
- 2024