30 results on '"Paolo Rech"'
Search Results
2. Efficient Error Detection for Matrix Multiplication with Systolic Arrays on FPGAs
- Author
-
Fabiano Libano, Paolo Rech, and John Brunhaver
- Subjects
Computational Theory and Mathematics ,Hardware and Architecture ,Software ,Theoretical Computer Science - Published
- 2023
- Full Text
- View/download PDF
3. Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements
- Author
-
Paolo Rech, George N. Papadimitriou, Pablo Bodmann, Rubens Luiz Rech Junior, and Dimitris Gizopoulos
- Subjects
reliability ,fault injection ,business.industry ,Computer science ,neutrons ,soft error ,ARM ,Word error rate ,02 engineering and technology ,Fault injection ,Fault (power engineering) ,Chip ,020202 computer hardware & architecture ,Theoretical Computer Science ,Soft error ,Software ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Central processing unit ,business ,Reliability (statistics) - Abstract
Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices, either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The challenges associated with reliability evaluation grow with the increase of complexity of the hardware and the software. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 and an Arm Cortex-A9 integrated in an SoC and evaluate their reliability in bare-metal and Linux-based configurations. We find that both the SoC integration and the OS presence increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that, even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates.
- Published
- 2022
- Full Text
- View/download PDF
4. Evaluating and Mitigating Neutrons Effects on COTS EdgeAI Accelerators
- Author
-
Paolo Rech, Carlo Cazzaniga, Sebastian Blower, Maria Kastriotou, and Christopher D. Frost
- Subjects
Nuclear and High Energy Physics ,Reliability (semiconductor) ,Nuclear Energy and Engineering ,Computer science ,Redundancy (engineering) ,Word error rate ,Overhead (computing) ,Lower cost ,Neutron ,Electrical and Electronic Engineering ,Power level ,Reliability engineering - Abstract
EdgeAI is an emerging artificial intelligence (AI) accelerator technology, which is capable of delivering improved AI performance at both a lower cost and a lower power level. With the aim of implementation in large quantities and in safety-critical environments, it is imperative to understand how single-event effects (SEEs) affect the reliability of this new family of devices and to propose efficient hardening solutions. Through neutron beam experiments and fault-injection analysis of a commercial-off-the-shelf (COTS) EdgeAI device, we are able to identify the device’s SEE failure-modes, separate the error rate contributions of the device’s different resources, and characterize the device’s SEE reliability. During this analysis, we discovered that the vast majority of single-bit flips have no appreciable effect on the output. After this analysis, we propose a hardening solution that implements triple-modular redundancy (TMR) in the device without changing its physical architecture. We experimentally validate this solution and show that we are able to correct 96% of the misclassifications (critical errors) with nearly zero overhead.
- Published
- 2021
- Full Text
- View/download PDF
5. How Reduced Data Precision and Degree of Parallelism Impact the Reliability of Convolutional Neural Networks on FPGAs
- Author
-
Michael Wirthlin, Paolo Rech, F. Libano, J. Leavitt, John Brunhaver, and B. Neuman
- Subjects
Nuclear and High Energy Physics ,Artificial neural network ,010308 nuclear & particles physics ,Computer science ,Reliability (computer networking) ,Degree of parallelism ,Failure rate ,01 natural sciences ,Convolutional neural network ,Reduction (complexity) ,Nuclear Energy and Engineering ,Computer engineering ,0103 physical sciences ,Hardware acceleration ,Electrical and Electronic Engineering ,Field-programmable gate array - Abstract
Convolutional neural networks (CNNs) are becoming attractive alternatives to traditional image-processing algorithms in self-driving vehicles for automotive, military, and aerospace applications. The high computational demand of state-of-the-art CNN architectures requires the use of hardware acceleration on parallel devices. Field-programmable gate arrays (FPGAs) offer a great level of design flexibility, low power consumption, and are relatively low cost, which make them very good candidates for efficiently accelerating neural networks. Unfortunately, the configuration memories of SRAM-based FPGAs are sensitive to radiation-induced errors, which can compromise the circuit implemented on the programmable fabric and the overall reliability of the system. Through neutron beam experiments, we evaluate how lossless quantization processes and subsequent data precision reduction impact the area, performance, radiation sensitivity, and failure rate of neural networks on FPGAs. Our results show that an 8-bit integer design can deliver over six times more fault-free executions than a 32-bit floating-point implementation. Moreover, we discuss the tradeoffs associated with varying degrees of parallelism in a neural network accelerator. We show that, although increased parallelism increases radiation sensitivity, the performance gains generally outweigh it in terms of global failure rate.
- Published
- 2021
- Full Text
- View/download PDF
6. Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy
- Author
-
Ivan Lamb, Paolo Rech, Marcio M. Goncalves, Raphael M. Brum, and Jose Rodrigo Azambuja
- Subjects
Nuclear and High Energy Physics ,010308 nuclear & particles physics ,Computer science ,Reliability (computer networking) ,Register file ,Fault tolerance ,Fault injection ,01 natural sciences ,Nuclear Energy and Engineering ,Computer engineering ,0103 physical sciences ,Fault coverage ,Transient (computer programming) ,Electrical and Electronic Engineering ,Graphics ,Vulnerability (computing) - Abstract
The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to the transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relax the result accuracy. Finally, it uses the acquired data to improve the selective fault tolerance techniques in terms of occupation and performance. The results show that it was possible to improve the GPU register file’s reliability on average by 71.6% by relaxing the application accuracy and, when compared with the selective hardening techniques, it was able to reduce the replicated registers by an average of 41.4%, while maintaining 100% fault coverage.
- Published
- 2020
- Full Text
- View/download PDF
7. Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs
- Author
-
Pedro Martins Basso, Paolo Rech, and Fernando Fernandes dos Santos
- Subjects
Nuclear and High Energy Physics ,Correctness ,010308 nuclear & particles physics ,Computer science ,Word error rate ,Chip ,01 natural sciences ,Convolutional neural network ,Object detection ,Matrix multiplication ,Computational science ,Nuclear Energy and Engineering ,Kernel (image processing) ,0103 physical sciences ,Electrical and Electronic Engineering ,Graphics - Abstract
Matrix multiplication (MxM) is a cornerstone application for both high-performance computing and safety-critical applications. Most of the operations in convolutional neural networks for object detection, in fact, are MxM related. Chip designers are proposing novel solutions to improve the efficiency of the execution of MxM. In this article, we investigate the impact of two novel architectures for MxM (i.e., tensor cores and mixed precision) on the graphics processing units (GPUs) reliability. In addition, we evaluate how effective the embedded error-correcting code is in reducing the MxM error rate. Our results show that low-precision operations are more reliable, and the tensor core increases the amount of data correctly produced by the GPU. However, reducing precision and the use of tensor core significantly increase the impact of faults in the output correctness.
- Published
- 2020
- Full Text
- View/download PDF
8. High-Energy Versus Thermal Neutron Contribution to Processor and Memory Error Rates
- Author
-
Fernando Fernandes dos Santos, Daniel Oliveira, Robert Baumann, Paolo Rech, Carlo Cazzaniga, Gabriel Piscoya Davila, and C. Frost
- Subjects
Physics ,Nuclear and High Energy Physics ,010308 nuclear & particles physics ,Word error rate ,Hardware_PERFORMANCEANDRELIABILITY ,01 natural sciences ,Neutron temperature ,Computational science ,Nuclear Energy and Engineering ,Gate array ,0103 physical sciences ,Thermal ,Neutron ,Central processing unit ,Electrical and Electronic Engineering ,Field-programmable gate array ,Sensitivity (electronics) - Abstract
We present the results of accelerated radiation testing on an AMD accelerated processing unit, three Nvidia graphic processing units, an Intel accelerator, a field-programmable gate array, and two double-data-rate memories under thermal and high-energy neutrons separately. The sensitivity depends on the device type and the code being executed and we show that thermal neutrons contribute to the error rate of modern computing devices under certain conditions.
- Published
- 2020
- Full Text
- View/download PDF
9. Selective Fault Tolerance for Register Files of Graphics Processing Units
- Author
-
Paolo Rech, Jose Rodrigo Azambuja, Ivan Lamb, Márcio A D Gonçalves, and Fernando Antonio Da Silva Fernandes
- Subjects
Nuclear and High Energy Physics ,Computer science ,business.industry ,Computation ,Reliability (computer networking) ,Register file ,Automotive industry ,Fault tolerance ,Fault (power engineering) ,Software ,Nuclear Energy and Engineering ,Embedded system ,Electrical and Electronic Engineering ,Graphics ,business - Abstract
The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device’s computing efficiency. We advance the knowledge of GPUs’ reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.
- Published
- 2019
- Full Text
- View/download PDF
10. Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs
- Author
-
David Kaeli, Paolo Rech, Caio Lunardi, Fernando Fernandes dos Santos, Lucas Klein Draghetti, Pedro Foletto Pimenta, and Luigi Carro
- Subjects
021103 operations research ,Artificial neural network ,Computer science ,Reliability (computer networking) ,0211 other engineering and technologies ,Fault tolerance ,02 engineering and technology ,Fault injection ,Convolutional neural network ,Matrix multiplication ,Object detection ,Computer engineering ,Electrical and Electronic Engineering ,Graphics ,Safety, Risk, Reliability and Quality - Abstract
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliability of object detection algorithms, as run on three NVIDIA GPU architectures. We consider three algorithms: 1) you only look once; 2) a faster region-based CNN (Faster R-CNN); and 3) a residual network, exposing live hardware to neutron beams. We complement our beam experiments with fault injection to better characterize fault propagation in CNNs. We show that a single fault occurring in a GPU tends to propagate to multiple active threads, significantly reducing the reliability of a CNN. Moreover, relying on error correcting codes dramatically reduces the number of silent data corruptions (SDCs), but does not reduce the number of critical errors (i.e., errors that could potentially impact safety-critical applications). Based on observations on how faults propagate on GPU architectures, we propose effective strategies to improve CNN reliability. We also consider the benefits of using an algorithm-based fault-tolerance technique for matrix multiplication, which can correct more than 87% of the critical SDCs in a CNN, while redesigning maxpool layers of the CNN to detect up to 98% of critical SDCs.
- Published
- 2019
- Full Text
- View/download PDF
11. Selective Hardening for Neural Networks in FPGAs
- Author
-
F. Libano, Michael Wirthlin, Carlo Cazzaniga, Paolo Rech, C. Frost, B. Wilson, and Jordan L. Anderson
- Subjects
Nuclear and High Energy Physics ,Correctness ,Artificial neural network ,010308 nuclear & particles physics ,business.industry ,Computer science ,Iris recognition ,Automotive industry ,Fault injection ,01 natural sciences ,Convolutional neural network ,Nuclear Energy and Engineering ,Computer engineering ,0103 physical sciences ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,MNIST database - Abstract
Neural networks are becoming an attractive solution for automatizing vehicles in the automotive, military, and aerospace markets. Thanks to their low-cost, low-power consumption, and flexibility, field-programmable gate arrays (FPGAs) are among the promising devices to implement neural networks. Unfortunately, FPGAs are also known to be susceptible to radiation-induced errors. In this paper, we evaluate the effects of radiation-induced errors in the output correctness of two neural networks [Iris Flower artificial neural network (ANN) and Modified National Institute of Standards and Technology (MNIST) convolutional neural network (CNN)] implemented in static random-access memory-based FPGAs. In particular, we notice that radiation can induce errors that modify the output of the network with or without affecting the neural network’s functionality. We call the former critical errors and the latter tolerable errors. Through exhaustive fault injection, we identify the portions of Iris Flower ANN and MNIST CNN implementation on FPGAs that are more likely, once corrupted, to generate a critical or a tolerable error. Based on this analysis, we propose a selective hardening strategy that triplicates only the most vulnerable layers of the neural network. With neutron radiation testing, our selective hardening solution was able to mask 40% of faults with a marginal 8% overhead in one of our tested neural networks.
- Published
- 2019
- Full Text
- View/download PDF
12. Evaluating the Impact of Repetition, Redundancy, Scrubbing, and Partitioning on 28-nm FPGA Reliability Through Neutron Testing
- Author
-
Paolo Rech, Ogun O. Kibar, Ken Mai, and Prashanth Mohan
- Subjects
Triple modular redundancy ,Nuclear and High Energy Physics ,010308 nuclear & particles physics ,Computer science ,Hardware_PERFORMANCEANDRELIABILITY ,01 natural sciences ,Computational science ,Nuclear Energy and Engineering ,Logic gate ,0103 physical sciences ,Redundancy (engineering) ,Static random-access memory ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Electrical and Electronic Engineering ,Error detection and correction ,Dual modular redundancy ,Field-programmable gate array ,Radiation hardening - Abstract
SRAM-based field-programmable gate arrays (FPGAs) are widely deployed in space and high-radiation environments, but they exhibit vulnerability to radiation effects. Designs can be hardened against radiation effects with design-side countermeasures such as redundancy, scrubbing, and partitioning. Through neutron tests, we investigate the impact of these design-side countermeasures on 28-nm FPGAs. We specifically address not only the provided radiation hardness but also the resource utilization and performance overheads. In addition, we evaluate the efficacy of repeating the operation after error detection. The results show that using coarse-grained and fine-grained triple modular redundancy (TMR) over dual modular redundancy (DMR) improves the failure cross section by $3.29\times $ and $11.49\times $ , respectively. The partitioning scheme that we used does not show a significant effect on radiation hardness. Using an internal scrubber and repeating the operation after a failure further decreases DMR, coarse-grained TMR, and fine-grained TMR cross sections by $5.10\times $ , $1.85\times $ , and $1.18\times $ , respectively.
- Published
- 2019
- Full Text
- View/download PDF
13. Reliability–Performance Analysis of Hardware and Software Co-Designs in SRAM-Based APSoCs
- Author
-
Marcilei A. G. Silveira, Paolo Rech, Nemitala Added, Lucas A. Tambara, Nilberto H. Medina, Filipe M. Lins, V. A. P. Aguiar, and Fernanda Lima Kastensmidt
- Subjects
Nuclear and High Energy Physics ,010308 nuclear & particles physics ,business.industry ,Computer science ,02 engineering and technology ,Fault injection ,01 natural sciences ,Software quality ,020202 computer hardware & architecture ,Programmable logic device ,Software ,Nuclear Energy and Engineering ,Gate array ,ESPECTROSCOPIA DE RAIO GAMA ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Field-programmable gate array ,Reliability (statistics) ,Computer hardware - Abstract
All programmable system-on-chip (APSoC) devices provide higher system performance and programmable flexibility at lower costs compared to standalone field-programmable gate array devices and processors. Unfortunately, it has been demonstrated that the high complexity and density of APSoCs increase the system’s susceptibility to radiation-induced errors. This paper investigates the effects of soft errors on APSoCs at design level through reliability and performance analyses. We explore 28 different hardware and software co-designs varying the workload distribution between hardware and software. We also propose a reliability analysis flow based on fault injection (FI) to estimate the reliability trend of hardware-only and software-only designs and hardware–software co-designs. Results obtained from both radiation experiments and FI campaigns reveal that performance and reliability can be improved up to $117\times $ by offloading the workload of an APSoC-based system to its programmable logic core. We also show that the proposed flow is a precise method to estimate the reliability trend of system designs on APSoCs before radiation experiments.
- Published
- 2018
- Full Text
- View/download PDF
14. On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability
- Author
-
David Kaeli, Fritz Previlon, Caio Lunardi, and Paolo Rech
- Subjects
Nuclear and High Energy Physics ,Computer science ,Hardware_PERFORMANCEANDRELIABILITY ,02 engineering and technology ,Silent data corruption ,01 natural sciences ,law.invention ,Reliability (semiconductor) ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Electrical and Electronic Engineering ,Graphics ,Queue ,Hardware_MEMORYSTRUCTURES ,010308 nuclear & particles physics ,business.industry ,Transistor ,Transient analysis ,020202 computer hardware & architecture ,Pipeline transport ,Nuclear Energy and Engineering ,CMOS ,Embedded system ,business - Abstract
Using error-correcting codes (ECCs) is considered one of the most effective ways to mask the effects of radiation-induced faults in memory and computing devices. Unfortunately, with the increased complexity of modern processors, there is a growing amount of hidden logic and memory resources, such as flip-flops in internal pipelines and queues, that cannot be easily protected by ECC. In this paper, we experimentally investigate the efficacy of using ECC to mask neutron-induced faults in modern graphics processing units (GPUs). In our analysis, we consider GPUs fabricated in CMOS and FinFET technologies. We show that changes in transistor technology can be as beneficial as using ECC for reducing silent data corruption rates. Finally, we compare fault-injection results, as carried out both on internal registers and at an instruction level, to better understand the effectiveness of ECC.
- Published
- 2018
- Full Text
- View/download PDF
15. On the Reliability of Linear Regression and Pattern Recognition Feedforward Artificial Neural Networks in FPGAs
- Author
-
F. Libano, Lucas A. Tambara, Fernanda Lima Kastensmidt, Paolo Rech, and Jorge Tonfat
- Subjects
Nuclear and High Energy Physics ,Artificial neural network ,010308 nuclear & particles physics ,Computer science ,business.industry ,Reliability (computer networking) ,Activation function ,Feed forward ,Pattern recognition ,02 engineering and technology ,Fault injection ,Perceptron ,01 natural sciences ,020202 computer hardware & architecture ,Computer Science::Hardware Architecture ,Nuclear Energy and Engineering ,Multilayer perceptron ,0103 physical sciences ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Electrical and Electronic Engineering ,business - Abstract
In this paper, we experimentally and analytically evaluate the reliability of two state-of-the-art neural networks for linear regression and pattern recognition (multilayer perceptron and single-layer perceptron) implemented in a system-on-chip composed of a field-programmable gate array (FPGA) and a microprocessor. We have considered, for each neural network, three different activation function complexities, to evaluate how the implementation affects FPGAs reliability. As we show in this paper, higher complexity increases the exposed area but reduces the probability of one failure to impact the network output. In addition, we propose to distinguish between critical and tolerable errors in artificial neural networks. Experiments using a controlled heavy-ions beam show that, for both networks, only about 30% of the observed output errors actually affect the outputs correctness. We identify the causes of critical errors through fault injection, and found that faults in initial layers are more likely to significantly affect the output.
- Published
- 2018
- Full Text
- View/download PDF
16. Analyzing the Impact of Radiation-Induced Failures in Programmable SoCs
- Author
-
Jorge Tonfat, Lucas A. Tambara, Fernanda Lima Kastensmidt, Eduardo Chielle, and Paolo Rech
- Subjects
Flexibility (engineering) ,Nuclear and High Energy Physics ,Engineering ,010308 nuclear & particles physics ,business.industry ,Event (computing) ,Reliability (computer networking) ,Memory organisation ,Context (language use) ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Nuclear Energy and Engineering ,Computer engineering ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Sensitivity (control systems) ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,Vulnerability (computing) - Abstract
All Programmable System-on-Chip (APSoC) devices are designed to provide higher overall system performance and programmable flexibility at lower power consumption and costs. Although modern commercial APSoCs offer a plethora of advantages, they are prone to experience Single Event Upsets. We investigate the impact of using different system architectures on an APSoC in the overall system failure rate. We consider different memory organization, different communication schemes, and different computing modes. Results show that there are several choices of architectures and resources to be chosen to implement an application in an APSoC, but there are logic resources that can increase or decrease the vulnerability of the entire system to failures in the application execution context.
- Published
- 2016
- Full Text
- View/download PDF
17. Reliability Analysis of Operating Systems and Software Stack for Embedded Systems
- Author
-
Luigi Carro, Paolo Rech, Flávio Rech Wagner, and Thiago Santini
- Subjects
010302 applied physics ,Imagination ,Nuclear and High Energy Physics ,Engineering ,010308 nuclear & particles physics ,business.industry ,media_common.quotation_subject ,Word error rate ,Linux kernel ,computer.software_genre ,01 natural sciences ,Software quality ,Abstraction layer ,Search engine ,Software ,Nuclear Energy and Engineering ,Embedded system ,0103 physical sciences ,Operating system ,Cache ,Electrical and Electronic Engineering ,business ,computer ,media_common - Abstract
In this paper, we investigate how the presence of a general purpose operating system influences the reliability of modern embedded Systems-on-Chips (SoCs). We experimentally study the difference in the neutron-induced error rate of SoCs when executing the application bare to the metal (i.e., without an underlying operating system) and on top of the Linux kernel. Our analysis demonstrates that Linux presence barely affects the Silent Data Corruption (SDC) rate whereas it greatly increases the system Functional Interruption (SEFI) rate (up to 7.48 times) if no preventive measures are taken. Nevertheless, we experimentally demonstrate that cache conflicts between the operating system and application can be leveraged to significantly reduce the Linux-induced SEFI rate increase. Moreover, we evaluate the OS software stack masking effect and show that the higher the abstraction layer in which an application is implemented, the lower its SDC rate. Furthermore, we analyze system reliability taking into account not only the resulting failure rates, but also the execution (and, thus, exposure) times.
- Published
- 2016
- Full Text
- View/download PDF
18. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units
- Author
-
Laércio Lima Pilla, Thiago Santini, Daniel Oliveira, and Paolo Rech
- Subjects
010302 applied physics ,010308 nuclear & particles physics ,Computer science ,Parallel algorithm ,Fault tolerance ,Parallel computing ,01 natural sciences ,Theoretical Computer Science ,Set (abstract data type) ,Computational Theory and Mathematics ,Computer engineering ,Hardware and Architecture ,0103 physical sciences ,Code (cryptography) ,Overhead (computing) ,Sensitivity (control systems) ,Graphics ,Software ,Reliability (statistics) - Abstract
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
- Published
- 2016
- Full Text
- View/download PDF
19. Memory Access Time and Input Size Effects on Parallel Processors Reliability
- Author
-
Daniel Oliveira, Paolo Rech, Luigi Carro, Philippe O. A. Navaux, Laércio Lima Pilla, and Caio Lunardi
- Subjects
Nuclear and High Energy Physics ,Nuclear Energy and Engineering ,Computer science ,Electrical and Electronic Engineering ,Reliability (statistics) ,Access time ,Reliability engineering - Published
- 2015
- Full Text
- View/download PDF
20. Analyzing the Effectiveness of a Frame-Level Redundancy Scrubbing Technique for SRAM-based FPGAs
- Author
-
Heather Quinn, Paolo Rech, Ricardo Reis, Jorge Tonfat, and Fernanda Lima Kastensmidt
- Subjects
Nuclear and High Energy Physics ,Engineering ,business.industry ,Fault tolerance ,Hardware_PERFORMANCEANDRELIABILITY ,Energy consumption ,Fault injection ,MBus ,Nuclear Energy and Engineering ,Redundancy (engineering) ,Electronic engineering ,Static random-access memory ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,Data scrubbing ,Computer hardware - Abstract
Radiation effects such as soft errors are the major threat to the reliability of SRAM-based FPGAs. This work analyzes the effectiveness in correcting soft errors of a novel scrubbing technique using internal frame redundancy called Frame-level Redundancy Scrubbing (FLR-scrubbing). This correction technique can be implemented in a coarse grain TMR design. The FLR-scrubbing technique was implemented on a mid-size Xilinx Virtex-5 FPGA device used as a case study. The FLR-scrubbing technique was tested under neutron radiation and fault injection. Implementation results demonstrated minimum area and energy consumption overhead when compared to other techniques. The time to repair the fault is also improved by using the Internal Configuration Access Port (ICAP). Neutron radiation test results demonstrated that the proposed technique is suitable for correcting accumulated SEUs and MBUs.
- Published
- 2015
- Full Text
- View/download PDF
21. Using Benchmarks for Radiation Testing of Microprocessors and FPGAs
- Author
-
Luca Sterpone, Luis Entrena, Matteo Sonza Reorda, Bradley T. Kiddie, Miguel Aguirre, Fernanda Lima Kastensmidt, Antonio Sanchez-Clemente, Marco Desogus, A. Barnard, Michael Wirthlin, William H. Robinson, Steven M. Guertin, Mario Garcia-Valderas, Heather Quinn, Paolo Rech, and David Kaeli
- Subjects
Nuclear and High Energy Physics ,Engineering ,Radiation testing ,Benchmarks ,Microprocessors ,FPGAs ,Fault Tolerance ,soft errors ,computer.software_genre ,Programmable logic array ,Software ,Software fault tolerance ,Electrical and Electronic Engineering ,soft error rates ,Field-programmable gate array ,software fault tolerance ,business.industry ,Field-programmable gate arrays (FPGAs) ,Fault tolerance ,Reconfigurable computing ,Nuclear Energy and Engineering ,Computer engineering ,Embedded system ,Benchmark (computing) ,Compiler ,business ,computer - Abstract
Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the system's performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. We describe the development process and report neutron test data for the hardware and software benchmarks.
- Published
- 2015
- Full Text
- View/download PDF
22. Reliability Evaluation of Embedded GPGPUs for Safety Critical Applications
- Author
-
Paolo Rech, Davide Sabena, Luca Sterpone, and Luigi Carro
- Subjects
Nuclear and High Energy Physics ,Single-Event Effects ,CPU cache ,Computer science ,Computation ,Reliability (computer networking) ,GPGPU ,Fast Fourier transform ,Parallel algorithm ,Parallel computing ,Supercomputer ,Domain (software engineering) ,Nuclear Energy and Engineering ,radiation testing ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units - Abstract
Thanks to the capability of efficiently executing massive computations in parallel, General Purpose Graphic Processing Units (GPGPUs) have begun to be preferred to CPUs for several parallel applications in different domains. Two are the most relevant fields in which, recently, GPGPUs have begun to be employed: High Performance Computing (HPC), and embedded systems. The reliability requirements are different in these two applications domain. In order to be employed in safety-critical applications, GPGPUs for embedded systems must be qualified as reliable. In this paper, we analyze through neutron irradiation typical parallel algorithms for embedded GPGPUs and we evaluate their reliability. We analyze how caches and threads distributions affect the GPGPU reliability. The data have been acquired through neutron test experiments, performed at the VESUVIO neutron facility at ISIS. The obtained experimental results show that, if the L1 cache of the considered GPGPU is disabled, the algorithm execution is most reliable. Moreover, it is demonstrated that during a FFT execution most errors appear in the stages in which the GPGPU is completely loaded as the number of instantiated parallel tasks is higher
- Published
- 2014
- Full Text
- View/download PDF
23. Neutron Cross-Section of N-Modular Redundancy Technique in SRAM-Based FPGAs
- Author
-
Paolo Rech, Fernanda Lima Kastensmidt, Jimmy Tarrillo, Christopher D. Frost, and Carlos Valderrama
- Subjects
Triple modular redundancy ,Nuclear and High Energy Physics ,Engineering ,business.industry ,Word error rate ,Fault tolerance ,Ranging ,Modular design ,Nuclear Energy and Engineering ,Redundancy (engineering) ,Electronic engineering ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Field-programmable gate array ,Computer hardware - Abstract
This paper evaluates different trade-offs of N-modular redundancy technique in SRAM-based FPGAs. Redundant copies of the same module were implemented and the outputs voted by self-adapted majority voter. The redundant design was exposed to neutrons and the error rate was evaluated. Results in cross-section, area and power consumption were analyzed for different numbers of redundant modules, ranging from 3 copies (standard TMR) up to 7 copies.
- Published
- 2014
- Full Text
- View/download PDF
24. GPUs Reliability Dependence on Degree of Parallelism
- Author
-
Paolo Rech, Gabriel L. Nazar, C. Frost, and Luigi Carro
- Subjects
Nuclear and High Energy Physics ,Nuclear Energy and Engineering ,Computer science ,Data parallelism ,Degree of parallelism ,Task parallelism ,Parallel computing ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,Reliability (statistics) - Published
- 2014
- Full Text
- View/download PDF
25. Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity
- Author
-
Luigi Carro, Heather Quinn, Thomas D. Fairbanks, and Paolo Rech
- Subjects
Nuclear and High Energy Physics ,Nuclear Energy and Engineering ,Computer science ,Parallel algorithm ,Word error rate ,Parallel computing ,Thread (computing) ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,Graphics ,Circuit reliability ,Block size ,Scheduling (computing) - Abstract
Graphic Processing Units offer the possibility of executing several threads in parallel, providing the user with higher throughput with respect to traditional multi-core processors. However, the additional resources required to schedule and handle the parallel processes may have the countermeasure of making GPUs more prone to be corrupted by neutrons. The reported experimental results prove that un-optimized thread distribution may exacerbate the device output error rate. As demonstrated, increasing the parallel algorithm block size minimizes the GPU neutron-induced output error rate. The GPU parallelism management is then analyzed as a method to increase reliability.
- Published
- 2013
- Full Text
- View/download PDF
26. Radiation and Fault Injection Testing of a Fine-Grained Error Detection Technique for FPGAs
- Author
-
Luigi Carro, Paolo Rech, Christopher D. Frost, and Gabriel L. Nazar
- Subjects
Nuclear and High Energy Physics ,Engineering ,business.industry ,Hardware_PERFORMANCEANDRELIABILITY ,Fault injection ,Nuclear Energy and Engineering ,Embedded system ,Electronic engineering ,Redundancy (engineering) ,Overhead (computing) ,Static random-access memory ,Electrical and Electronic Engineering ,Latency (engineering) ,Error detection and correction ,Field-programmable gate array ,Dual modular redundancy ,business - Abstract
We present the experimental evaluation of a fine-grained hardening approach that exploits underused and abundant resources found in state-of-the-art SRAM-based FPGAs to detect radiation-induced errors on configuration memories. The technique's main goal is to provide the benefits of fine-grained redundancy, namely improved diagnosis and reduced error latency, with a reduced area overhead. Neutron experiments, validated with fault injection campaigns, demonstrate the proposed technique's efficiency when compared to the traditional dual modular redundancy.
- Published
- 2013
- Full Text
- View/download PDF
27. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs
- Author
-
Luigi Carro, Christopher D. Frost, C. Aguiar, and Paolo Rech
- Subjects
Nuclear and High Energy Physics ,Software ,Nuclear Energy and Engineering ,Matrix algebra ,business.industry ,Computer science ,Hardware_PERFORMANCEANDRELIABILITY ,Electrical and Electronic Engineering ,Neutron radiation ,business ,Matrix multiplication ,Computational science ,Hardening (computing) - Abstract
Neutron radiation experiment results on matrix multiplication on graphic processing units (GPUs) show that multiple errors are detected at the output in more than 50% of the cases. In the presence of multiple errors, the available hardening strategies may become ineffective or inefficient. Analyzing radiation-induced error distributions, we developed an optimized and experimentally tuned software-based hardening strategy for GPUs. With fault-injection simulations, we compare the performance and correcting capabilities of the proposed technique with the available ones.
- Published
- 2013
- Full Text
- View/download PDF
28. A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility
- Author
-
Marta Bagatin, Paolo Rech, Alessandro Paccagnella, A. Manuzzato, Massimo Violante, Christopher D. Frost, Antonino Pietropaolo, Salvatore Pontarelli, Carla Andreani, Simone Gerardin, Gian Carlo Cardarilli, Luca Sterpone, Giuseppe Gorini, Violante, M, Sterpone, L, Manuzzato, A, Gerardin, S, Rech, P, Bagatin, M, Paccagnella, A, Andreani, C, Gorini, G, Pietropaolo, A, Cardarilli, G, Pontarelli, S, and Frost, C
- Subjects
Nuclear and High Energy Physics ,Engineering ,Hardware_PERFORMANCEANDRELIABILITY ,Tracing ,Single event upset (SEU) ,FPGA ,neutron source ,radiation testing ,Single Event Upset (SEU) ,Computer Science::Hardware Architecture ,Software ,Electronic engineering ,Neutron ,Static random-access memory ,Electrical and Electronic Engineering ,Field-programmable gate array ,business.industry ,Neutron source ,Radiation testing ,Nuclear Energy and Engineering ,Settore FIS/07 - Fisica Applicata(Beni Culturali, Ambientali, Biol.e Medicin) ,Soft error ,neutron sources ,ING-INF/01 - ELETTRONICA ,Physics::Accelerator Physics ,business ,Computer hardware ,Energy (signal processing) - Abstract
We introduce a new hardware/software platform for testing SRAM-based FPGAs under heavy-ion and neutron beams, capable of tracing the bit-flips in the configuration memory back to the physical resources affected in the FPGA. The validation was performed using, for the first time, the neutron source at the RALILSIS facility. The ISlS beam features a 1/E spectrum, which is similar to the terrestrial one with an acceleration between 107 and 10 8 in the energy range 10-100 MeV. The results gathered on Xilinx SRAM-based FPGAs are discussed in terms of cross section and circuit-level modifications. © 2007 IEEE.
- Published
- 2007
- Full Text
- View/download PDF
29. Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications
- Author
-
Caio Lunardi, Heather Quinn, Philippe O. A. Navaux, Daniel Oliveira, Laura Monroe, and Paolo Rech
- Subjects
Nuclear and High Energy Physics ,Engineering ,Sorting algorithm ,010308 nuclear & particles physics ,business.industry ,Radix sort ,Sorted array ,Electrical engineering ,Word error rate ,Hardware_PERFORMANCEANDRELIABILITY ,02 engineering and technology ,Fault injection ,01 natural sciences ,020202 computer hardware & architecture ,Nuclear Energy and Engineering ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,sort ,Electrical and Electronic Engineering ,Merge sort ,business ,Quicksort - Abstract
In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
- Published
- 2017
- Full Text
- View/download PDF
30. Register File Criticality and Compiler Optimization Effects on Embedded Microprocessors Reliability
- Author
-
Paolo Rech, Filipe M. Lins, Lucas A. Tambara, and Fernanda Lima Kastensmidt
- Subjects
Nuclear and High Energy Physics ,Reduced instruction set computing ,010308 nuclear & particles physics ,Computer science ,Processor register ,Register file ,Optimizing compiler ,02 engineering and technology ,Register window ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,law.invention ,Microprocessor ,Nuclear Energy and Engineering ,law ,Status register ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Electrical and Electronic Engineering ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,computer ,Register allocation - Abstract
In this paper, we investigate the impact of register file errors in modern embedded microprocessors reliability through fault-injection and heavy-ion experiments. Additionally, we evaluate how different levels of compiler optimization modify the usage and failure probability of a processor register file. We select six representative benchmarks, each one compiled with three different levels of compiler optimization. We performed exhaustive fault-injection campaigns to measure the register’s architectural vulnerability factor of each code and configuration, identifying the registers that are more likely to generate silent data corruption or single event functional interruption. Moreover, we correlate the observed reliability variations with register file utilization. Finally, we irradiated with heavy ions two of the selected benchmarks compiled with two levels of optimization, and correlated experimental results with fault-injection analysis.
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.