Back to Search Start Over

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications.

Authors :
Lunardi, Caio
Quinn, Heather
Monroe, Laura
Oliveira, Daniel
Navaux, Philippe
Rech, Paolo
Source :
IEEE Transactions on Nuclear Science; Aug2017, Vol. 64 Issue 8 Part 1, p2169-2178, 10p
Publication Year :
2017

Abstract

In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude. [ABSTRACT FROM PUBLISHER]

Details

Language :
English
ISSN :
00189499
Volume :
64
Issue :
8 Part 1
Database :
Complementary Index
Journal :
IEEE Transactions on Nuclear Science
Publication Type :
Academic Journal
Accession number :
125531034
Full Text :
https://doi.org/10.1109/TNS.2017.2727499