Author: "Paolo Rech" / Publisher: institute of electrical and electronics engineers (ieee) - Searchworks@Jio Institute Digital Library Search Results

1. Atmospheric Neutron-Induced Fault Generation and Propagation in Quantum Bits and Quantum Circuits

Author: Daniel Oliveira, Elizabeth Auden, and Paolo Rech
Subjects: Nuclear and High Energy Physics, Nuclear Energy and Engineering, Electrical and Electronic Engineering
Published: 2023
Full Text: View/download PDF

2. Efficient Error Detection for Matrix Multiplication with Systolic Arrays on FPGAs

Author: Fabiano Libano, Paolo Rech, and John Brunhaver
Subjects: Computational Theory and Mathematics, Hardware and Architecture, Software, Theoretical Computer Science
Published: 2023
Full Text: View/download PDF

3. Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

Author: Paolo Rech, George N. Papadimitriou, Pablo Bodmann, Rubens Luiz Rech Junior, and Dimitris Gizopoulos
Subjects: reliability, fault injection, business.industry, Computer science, neutrons, soft error, ARM, Word error rate, 02 engineering and technology, Fault injection, Fault (power engineering), Chip, 020202 computer hardware & architecture, Theoretical Computer Science, Soft error, Software, Computational Theory and Mathematics, Hardware and Architecture, Embedded system, 0202 electrical engineering, electronic engineering, information engineering, Central processing unit, business, Reliability (statistics)
Abstract: Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices, either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The challenges associated with reliability evaluation grow with the increase of complexity of the hardware and the software. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 and an Arm Cortex-A9 integrated in an SoC and evaluate their reliability in bare-metal and Linux-based configurations. We find that both the SoC integration and the OS presence increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that, even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates.
Published: 2022
Full Text: View/download PDF

4. Evaluating and Mitigating Neutrons Effects on COTS EdgeAI Accelerators

Author: Paolo Rech, Carlo Cazzaniga, Sebastian Blower, Maria Kastriotou, and Christopher D. Frost
Subjects: Nuclear and High Energy Physics, Reliability (semiconductor), Nuclear Energy and Engineering, Computer science, Redundancy (engineering), Word error rate, Overhead (computing), Lower cost, Neutron, Electrical and Electronic Engineering, Power level, Reliability engineering
Abstract: EdgeAI is an emerging artificial intelligence (AI) accelerator technology, which is capable of delivering improved AI performance at both a lower cost and a lower power level. With the aim of implementation in large quantities and in safety-critical environments, it is imperative to understand how single-event effects (SEEs) affect the reliability of this new family of devices and to propose efficient hardening solutions. Through neutron beam experiments and fault-injection analysis of a commercial-off-the-shelf (COTS) EdgeAI device, we are able to identify the device’s SEE failure-modes, separate the error rate contributions of the device’s different resources, and characterize the device’s SEE reliability. During this analysis, we discovered that the vast majority of single-bit flips have no appreciable effect on the output. After this analysis, we propose a hardening solution that implements triple-modular redundancy (TMR) in the device without changing its physical architecture. We experimentally validate this solution and show that we are able to correct 96% of the misclassifications (critical errors) with nearly zero overhead.
Published: 2021
Full Text: View/download PDF

5. How Reduced Data Precision and Degree of Parallelism Impact the Reliability of Convolutional Neural Networks on FPGAs

Author: Michael Wirthlin, Paolo Rech, F. Libano, J. Leavitt, John Brunhaver, and B. Neuman
Subjects: Nuclear and High Energy Physics, Artificial neural network, 010308 nuclear & particles physics, Computer science, Reliability (computer networking), Degree of parallelism, Failure rate, 01 natural sciences, Convolutional neural network, Reduction (complexity), Nuclear Energy and Engineering, Computer engineering, 0103 physical sciences, Hardware acceleration, Electrical and Electronic Engineering, Field-programmable gate array
Abstract: Convolutional neural networks (CNNs) are becoming attractive alternatives to traditional image-processing algorithms in self-driving vehicles for automotive, military, and aerospace applications. The high computational demand of state-of-the-art CNN architectures requires the use of hardware acceleration on parallel devices. Field-programmable gate arrays (FPGAs) offer a great level of design flexibility, low power consumption, and are relatively low cost, which make them very good candidates for efficiently accelerating neural networks. Unfortunately, the configuration memories of SRAM-based FPGAs are sensitive to radiation-induced errors, which can compromise the circuit implemented on the programmable fabric and the overall reliability of the system. Through neutron beam experiments, we evaluate how lossless quantization processes and subsequent data precision reduction impact the area, performance, radiation sensitivity, and failure rate of neural networks on FPGAs. Our results show that an 8-bit integer design can deliver over six times more fault-free executions than a 32-bit floating-point implementation. Moreover, we discuss the tradeoffs associated with varying degrees of parallelism in a neural network accelerator. We show that, although increased parallelism increases radiation sensitivity, the performance gains generally outweigh it in terms of global failure rate.
Published: 2021
Full Text: View/download PDF

6. Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy

Author: Ivan Lamb, Paolo Rech, Marcio M. Goncalves, Raphael M. Brum, and Jose Rodrigo Azambuja
Subjects: Nuclear and High Energy Physics, 010308 nuclear & particles physics, Computer science, Reliability (computer networking), Register file, Fault tolerance, Fault injection, 01 natural sciences, Nuclear Energy and Engineering, Computer engineering, 0103 physical sciences, Fault coverage, Transient (computer programming), Electrical and Electronic Engineering, Graphics, Vulnerability (computing)
Abstract: The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to the transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relax the result accuracy. Finally, it uses the acquired data to improve the selective fault tolerance techniques in terms of occupation and performance. The results show that it was possible to improve the GPU register file’s reliability on average by 71.6% by relaxing the application accuracy and, when compared with the selective hardening techniques, it was able to reduce the replicated registers by an average of 41.4%, while maintaining 100% fault coverage.
Published: 2020
Full Text: View/download PDF

7. Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs

Author: Pedro Martins Basso, Paolo Rech, and Fernando Fernandes dos Santos
Subjects: Nuclear and High Energy Physics, Correctness, 010308 nuclear & particles physics, Computer science, Word error rate, Chip, 01 natural sciences, Convolutional neural network, Object detection, Matrix multiplication, Computational science, Nuclear Energy and Engineering, Kernel (image processing), 0103 physical sciences, Electrical and Electronic Engineering, Graphics
Abstract: Matrix multiplication (MxM) is a cornerstone application for both high-performance computing and safety-critical applications. Most of the operations in convolutional neural networks for object detection, in fact, are MxM related. Chip designers are proposing novel solutions to improve the efficiency of the execution of MxM. In this article, we investigate the impact of two novel architectures for MxM (i.e., tensor cores and mixed precision) on the graphics processing units (GPUs) reliability. In addition, we evaluate how effective the embedded error-correcting code is in reducing the MxM error rate. Our results show that low-precision operations are more reliable, and the tensor core increases the amount of data correctly produced by the GPU. However, reducing precision and the use of tensor core significantly increase the impact of faults in the output correctness.
Published: 2020
Full Text: View/download PDF

8. High-Energy Versus Thermal Neutron Contribution to Processor and Memory Error Rates

Author: Fernando Fernandes dos Santos, Daniel Oliveira, Robert Baumann, Paolo Rech, Carlo Cazzaniga, Gabriel Piscoya Davila, and C. Frost
Subjects: Physics, Nuclear and High Energy Physics, 010308 nuclear & particles physics, Word error rate, Hardware_PERFORMANCEANDRELIABILITY, 01 natural sciences, Neutron temperature, Computational science, Nuclear Energy and Engineering, Gate array, 0103 physical sciences, Thermal, Neutron, Central processing unit, Electrical and Electronic Engineering, Field-programmable gate array, Sensitivity (electronics)
Abstract: We present the results of accelerated radiation testing on an AMD accelerated processing unit, three Nvidia graphic processing units, an Intel accelerator, a field-programmable gate array, and two double-data-rate memories under thermal and high-energy neutrons separately. The sensitivity depends on the device type and the code being executed and we show that thermal neutrons contribute to the error rate of modern computing devices under certain conditions.
Published: 2020
Full Text: View/download PDF

9. Selective Fault Tolerance for Register Files of Graphics Processing Units

Author: Paolo Rech, Jose Rodrigo Azambuja, Ivan Lamb, Márcio A D Gonçalves, and Fernando Antonio Da Silva Fernandes
Subjects: Nuclear and High Energy Physics, Computer science, business.industry, Computation, Reliability (computer networking), Register file, Automotive industry, Fault tolerance, Fault (power engineering), Software, Nuclear Energy and Engineering, Embedded system, Electrical and Electronic Engineering, Graphics, business
Abstract: The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device’s computing efficiency. We advance the knowledge of GPUs’ reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.
Published: 2019
Full Text: View/download PDF

10. Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs

Author: David Kaeli, Paolo Rech, Caio Lunardi, Fernando Fernandes dos Santos, Lucas Klein Draghetti, Pedro Foletto Pimenta, and Luigi Carro
Subjects: 021103 operations research, Artificial neural network, Computer science, Reliability (computer networking), 0211 other engineering and technologies, Fault tolerance, 02 engineering and technology, Fault injection, Convolutional neural network, Matrix multiplication, Object detection, Computer engineering, Electrical and Electronic Engineering, Graphics, Safety, Risk, Reliability and Quality
Abstract: Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliability of object detection algorithms, as run on three NVIDIA GPU architectures. We consider three algorithms: 1) you only look once; 2) a faster region-based CNN (Faster R-CNN); and 3) a residual network, exposing live hardware to neutron beams. We complement our beam experiments with fault injection to better characterize fault propagation in CNNs. We show that a single fault occurring in a GPU tends to propagate to multiple active threads, significantly reducing the reliability of a CNN. Moreover, relying on error correcting codes dramatically reduces the number of silent data corruptions (SDCs), but does not reduce the number of critical errors (i.e., errors that could potentially impact safety-critical applications). Based on observations on how faults propagate on GPU architectures, we propose effective strategies to improve CNN reliability. We also consider the benefits of using an algorithm-based fault-tolerance technique for matrix multiplication, which can correct more than 87% of the critical SDCs in a CNN, while redesigning maxpool layers of the CNN to detect up to 98% of critical SDCs.
Published: 2019
Full Text: View/download PDF

11. Selective Hardening for Neural Networks in FPGAs

Author: F. Libano, Michael Wirthlin, Carlo Cazzaniga, Paolo Rech, C. Frost, B. Wilson, and Jordan L. Anderson
Subjects: Nuclear and High Energy Physics, Correctness, Artificial neural network, 010308 nuclear & particles physics, business.industry, Computer science, Iris recognition, Automotive industry, Fault injection, 01 natural sciences, Convolutional neural network, Nuclear Energy and Engineering, Computer engineering, 0103 physical sciences, Electrical and Electronic Engineering, Field-programmable gate array, business, MNIST database
Abstract: Neural networks are becoming an attractive solution for automatizing vehicles in the automotive, military, and aerospace markets. Thanks to their low-cost, low-power consumption, and flexibility, field-programmable gate arrays (FPGAs) are among the promising devices to implement neural networks. Unfortunately, FPGAs are also known to be susceptible to radiation-induced errors. In this paper, we evaluate the effects of radiation-induced errors in the output correctness of two neural networks [Iris Flower artificial neural network (ANN) and Modified National Institute of Standards and Technology (MNIST) convolutional neural network (CNN)] implemented in static random-access memory-based FPGAs. In particular, we notice that radiation can induce errors that modify the output of the network with or without affecting the neural network’s functionality. We call the former critical errors and the latter tolerable errors. Through exhaustive fault injection, we identify the portions of Iris Flower ANN and MNIST CNN implementation on FPGAs that are more likely, once corrupted, to generate a critical or a tolerable error. Based on this analysis, we propose a selective hardening strategy that triplicates only the most vulnerable layers of the neural network. With neutron radiation testing, our selective hardening solution was able to mask 40% of faults with a marginal 8% overhead in one of our tested neural networks.
Published: 2019
Full Text: View/download PDF

12. Evaluating the Impact of Repetition, Redundancy, Scrubbing, and Partitioning on 28-nm FPGA Reliability Through Neutron Testing

Author: Paolo Rech, Ogun O. Kibar, Ken Mai, and Prashanth Mohan
Subjects: Triple modular redundancy, Nuclear and High Energy Physics, 010308 nuclear & particles physics, Computer science, Hardware_PERFORMANCEANDRELIABILITY, 01 natural sciences, Computational science, Nuclear Energy and Engineering, Logic gate, 0103 physical sciences, Redundancy (engineering), Static random-access memory, Hardware_ARITHMETICANDLOGICSTRUCTURES, Electrical and Electronic Engineering, Error detection and correction, Dual modular redundancy, Field-programmable gate array, Radiation hardening
Abstract: SRAM-based field-programmable gate arrays (FPGAs) are widely deployed in space and high-radiation environments, but they exhibit vulnerability to radiation effects. Designs can be hardened against radiation effects with design-side countermeasures such as redundancy, scrubbing, and partitioning. Through neutron tests, we investigate the impact of these design-side countermeasures on 28-nm FPGAs. We specifically address not only the provided radiation hardness but also the resource utilization and performance overheads. In addition, we evaluate the efficacy of repeating the operation after error detection. The results show that using coarse-grained and fine-grained triple modular redundancy (TMR) over dual modular redundancy (DMR) improves the failure cross section by $3.29\times $ and $11.49\times $ , respectively. The partitioning scheme that we used does not show a significant effect on radiation hardness. Using an internal scrubber and repeating the operation after a failure further decreases DMR, coarse-grained TMR, and fine-grained TMR cross sections by $5.10\times $ , $1.85\times $ , and $1.18\times $ , respectively.
Published: 2019
Full Text: View/download PDF

13. Reliability–Performance Analysis of Hardware and Software Co-Designs in SRAM-Based APSoCs

Author: Marcilei A. G. Silveira, Paolo Rech, Nemitala Added, Lucas A. Tambara, Nilberto H. Medina, Filipe M. Lins, V. A. P. Aguiar, and Fernanda Lima Kastensmidt
Subjects: Nuclear and High Energy Physics, 010308 nuclear & particles physics, business.industry, Computer science, 02 engineering and technology, Fault injection, 01 natural sciences, Software quality, 020202 computer hardware & architecture, Programmable logic device, Software, Nuclear Energy and Engineering, Gate array, ESPECTROSCOPIA DE RAIO GAMA, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Static random-access memory, Electrical and Electronic Engineering, business, Field-programmable gate array, Reliability (statistics), Computer hardware
Abstract: All programmable system-on-chip (APSoC) devices provide higher system performance and programmable flexibility at lower costs compared to standalone field-programmable gate array devices and processors. Unfortunately, it has been demonstrated that the high complexity and density of APSoCs increase the system’s susceptibility to radiation-induced errors. This paper investigates the effects of soft errors on APSoCs at design level through reliability and performance analyses. We explore 28 different hardware and software co-designs varying the workload distribution between hardware and software. We also propose a reliability analysis flow based on fault injection (FI) to estimate the reliability trend of hardware-only and software-only designs and hardware–software co-designs. Results obtained from both radiation experiments and FI campaigns reveal that performance and reliability can be improved up to $117\times $ by offloading the workload of an APSoC-based system to its programmable logic core. We also show that the proposed flow is a precise method to estimate the reliability trend of system designs on APSoCs before radiation experiments.
Published: 2018
Full Text: View/download PDF

14. On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability

Author: David Kaeli, Fritz Previlon, Caio Lunardi, and Paolo Rech
Subjects: Nuclear and High Energy Physics, Computer science, Hardware_PERFORMANCEANDRELIABILITY, 02 engineering and technology, Silent data corruption, 01 natural sciences, law.invention, Reliability (semiconductor), law, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES, Electrical and Electronic Engineering, Graphics, Queue, Hardware_MEMORYSTRUCTURES, 010308 nuclear & particles physics, business.industry, Transistor, Transient analysis, 020202 computer hardware & architecture, Pipeline transport, Nuclear Energy and Engineering, CMOS, Embedded system, business
Abstract: Using error-correcting codes (ECCs) is considered one of the most effective ways to mask the effects of radiation-induced faults in memory and computing devices. Unfortunately, with the increased complexity of modern processors, there is a growing amount of hidden logic and memory resources, such as flip-flops in internal pipelines and queues, that cannot be easily protected by ECC. In this paper, we experimentally investigate the efficacy of using ECC to mask neutron-induced faults in modern graphics processing units (GPUs). In our analysis, we consider GPUs fabricated in CMOS and FinFET technologies. We show that changes in transistor technology can be as beneficial as using ECC for reducing silent data corruption rates. Finally, we compare fault-injection results, as carried out both on internal registers and at an instruction level, to better understand the effectiveness of ECC.
Published: 2018
Full Text: View/download PDF

15. On the Reliability of Linear Regression and Pattern Recognition Feedforward Artificial Neural Networks in FPGAs

Author: F. Libano, Lucas A. Tambara, Fernanda Lima Kastensmidt, Paolo Rech, and Jorge Tonfat
Subjects: Nuclear and High Energy Physics, Artificial neural network, 010308 nuclear & particles physics, Computer science, business.industry, Reliability (computer networking), Activation function, Feed forward, Pattern recognition, 02 engineering and technology, Fault injection, Perceptron, 01 natural sciences, 020202 computer hardware & architecture, Computer Science::Hardware Architecture, Nuclear Energy and Engineering, Multilayer perceptron, 0103 physical sciences, Pattern recognition (psychology), 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: In this paper, we experimentally and analytically evaluate the reliability of two state-of-the-art neural networks for linear regression and pattern recognition (multilayer perceptron and single-layer perceptron) implemented in a system-on-chip composed of a field-programmable gate array (FPGA) and a microprocessor. We have considered, for each neural network, three different activation function complexities, to evaluate how the implementation affects FPGAs reliability. As we show in this paper, higher complexity increases the exposed area but reduces the probability of one failure to impact the network output. In addition, we propose to distinguish between critical and tolerable errors in artificial neural networks. Experiments using a controlled heavy-ions beam show that, for both networks, only about 30% of the observed output errors actually affect the outputs correctness. We identify the causes of critical errors through fault injection, and found that faults in initial layers are more likely to significantly affect the output.
Published: 2018
Full Text: View/download PDF

16. Analyzing the Impact of Radiation-Induced Failures in Programmable SoCs

Author: Jorge Tonfat, Lucas A. Tambara, Fernanda Lima Kastensmidt, Eduardo Chielle, and Paolo Rech
Subjects: Flexibility (engineering), Nuclear and High Energy Physics, Engineering, 010308 nuclear & particles physics, business.industry, Event (computing), Reliability (computer networking), Memory organisation, Context (language use), 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Nuclear Energy and Engineering, Computer engineering, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Sensitivity (control systems), Electrical and Electronic Engineering, Field-programmable gate array, business, Vulnerability (computing)
Abstract: All Programmable System-on-Chip (APSoC) devices are designed to provide higher overall system performance and programmable flexibility at lower power consumption and costs. Although modern commercial APSoCs offer a plethora of advantages, they are prone to experience Single Event Upsets. We investigate the impact of using different system architectures on an APSoC in the overall system failure rate. We consider different memory organization, different communication schemes, and different computing modes. Results show that there are several choices of architectures and resources to be chosen to implement an application in an APSoC, but there are logic resources that can increase or decrease the vulnerability of the entire system to failures in the application execution context.
Published: 2016
Full Text: View/download PDF

17. Reliability Analysis of Operating Systems and Software Stack for Embedded Systems

Author: Luigi Carro, Paolo Rech, Flávio Rech Wagner, and Thiago Santini
Subjects: 010302 applied physics, Imagination, Nuclear and High Energy Physics, Engineering, 010308 nuclear & particles physics, business.industry, media_common.quotation_subject, Word error rate, Linux kernel, computer.software_genre, 01 natural sciences, Software quality, Abstraction layer, Search engine, Software, Nuclear Energy and Engineering, Embedded system, 0103 physical sciences, Operating system, Cache, Electrical and Electronic Engineering, business, computer, media_common
Abstract: In this paper, we investigate how the presence of a general purpose operating system influences the reliability of modern embedded Systems-on-Chips (SoCs). We experimentally study the difference in the neutron-induced error rate of SoCs when executing the application bare to the metal (i.e., without an underlying operating system) and on top of the Linux kernel. Our analysis demonstrates that Linux presence barely affects the Silent Data Corruption (SDC) rate whereas it greatly increases the system Functional Interruption (SEFI) rate (up to 7.48 times) if no preventive measures are taken. Nevertheless, we experimentally demonstrate that cache conflicts between the operating system and application can be leveraged to significantly reduce the Linux-induced SEFI rate increase. Moreover, we evaluate the OS software stack masking effect and show that the higher the abstraction layer in which an application is implemented, the lower its SDC rate. Furthermore, we analyze system reliability taking into account not only the resulting failure rates, but also the execution (and, thus, exposure) times.
Published: 2016
Full Text: View/download PDF

18. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

Author: Laércio Lima Pilla, Thiago Santini, Daniel Oliveira, and Paolo Rech
Subjects: 010302 applied physics, 010308 nuclear & particles physics, Computer science, Parallel algorithm, Fault tolerance, Parallel computing, 01 natural sciences, Theoretical Computer Science, Set (abstract data type), Computational Theory and Mathematics, Computer engineering, Hardware and Architecture, 0103 physical sciences, Code (cryptography), Overhead (computing), Sensitivity (control systems), Graphics, Software, Reliability (statistics)
Abstract: Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
Published: 2016
Full Text: View/download PDF

19. Memory Access Time and Input Size Effects on Parallel Processors Reliability

Author: Daniel Oliveira, Paolo Rech, Luigi Carro, Philippe O. A. Navaux, Laércio Lima Pilla, and Caio Lunardi
Subjects: Nuclear and High Energy Physics, Nuclear Energy and Engineering, Computer science, Electrical and Electronic Engineering, Reliability (statistics), Access time, Reliability engineering
Published: 2015
Full Text: View/download PDF

20. Analyzing the Effectiveness of a Frame-Level Redundancy Scrubbing Technique for SRAM-based FPGAs

Author: Heather Quinn, Paolo Rech, Ricardo Reis, Jorge Tonfat, and Fernanda Lima Kastensmidt
Subjects: Nuclear and High Energy Physics, Engineering, business.industry, Fault tolerance, Hardware_PERFORMANCEANDRELIABILITY, Energy consumption, Fault injection, MBus, Nuclear Energy and Engineering, Redundancy (engineering), Electronic engineering, Static random-access memory, Hardware_ARITHMETICANDLOGICSTRUCTURES, Electrical and Electronic Engineering, Field-programmable gate array, business, Data scrubbing, Computer hardware
Abstract: Radiation effects such as soft errors are the major threat to the reliability of SRAM-based FPGAs. This work analyzes the effectiveness in correcting soft errors of a novel scrubbing technique using internal frame redundancy called Frame-level Redundancy Scrubbing (FLR-scrubbing). This correction technique can be implemented in a coarse grain TMR design. The FLR-scrubbing technique was implemented on a mid-size Xilinx Virtex-5 FPGA device used as a case study. The FLR-scrubbing technique was tested under neutron radiation and fault injection. Implementation results demonstrated minimum area and energy consumption overhead when compared to other techniques. The time to repair the fault is also improved by using the Internal Configuration Access Port (ICAP). Neutron radiation test results demonstrated that the proposed technique is suitable for correcting accumulated SEUs and MBUs.
Published: 2015
Full Text: View/download PDF

21. Using Benchmarks for Radiation Testing of Microprocessors and FPGAs

Author: Luca Sterpone, Luis Entrena, Matteo Sonza Reorda, Bradley T. Kiddie, Miguel Aguirre, Fernanda Lima Kastensmidt, Antonio Sanchez-Clemente, Marco Desogus, A. Barnard, Michael Wirthlin, William H. Robinson, Steven M. Guertin, Mario Garcia-Valderas, Heather Quinn, Paolo Rech, and David Kaeli
Subjects: Nuclear and High Energy Physics, Engineering, Radiation testing, Benchmarks, Microprocessors, FPGAs, Fault Tolerance, soft errors, computer.software_genre, Programmable logic array, Software, Software fault tolerance, Electrical and Electronic Engineering, soft error rates, Field-programmable gate array, software fault tolerance, business.industry, Field-programmable gate arrays (FPGAs), Fault tolerance, Reconfigurable computing, Nuclear Energy and Engineering, Computer engineering, Embedded system, Benchmark (computing), Compiler, business, computer
Abstract: Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the system's performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. We describe the development process and report neutron test data for the hardware and software benchmarks.
Published: 2015
Full Text: View/download PDF

22. Reliability Evaluation of Embedded GPGPUs for Safety Critical Applications

Author: Paolo Rech, Davide Sabena, Luca Sterpone, and Luigi Carro
Subjects: Nuclear and High Energy Physics, Single-Event Effects, CPU cache, Computer science, Computation, Reliability (computer networking), GPGPU, Fast Fourier transform, Parallel algorithm, Parallel computing, Supercomputer, Domain (software engineering), Nuclear Energy and Engineering, radiation testing, Electrical and Electronic Engineering, General-purpose computing on graphics processing units
Abstract: Thanks to the capability of efficiently executing massive computations in parallel, General Purpose Graphic Processing Units (GPGPUs) have begun to be preferred to CPUs for several parallel applications in different domains. Two are the most relevant fields in which, recently, GPGPUs have begun to be employed: High Performance Computing (HPC), and embedded systems. The reliability requirements are different in these two applications domain. In order to be employed in safety-critical applications, GPGPUs for embedded systems must be qualified as reliable. In this paper, we analyze through neutron irradiation typical parallel algorithms for embedded GPGPUs and we evaluate their reliability. We analyze how caches and threads distributions affect the GPGPU reliability. The data have been acquired through neutron test experiments, performed at the VESUVIO neutron facility at ISIS. The obtained experimental results show that, if the L1 cache of the considered GPGPU is disabled, the algorithm execution is most reliable. Moreover, it is demonstrated that during a FFT execution most errors appear in the stages in which the GPGPU is completely loaded as the number of instantiated parallel tasks is higher
Published: 2014
Full Text: View/download PDF

23. Neutron Cross-Section of N-Modular Redundancy Technique in SRAM-Based FPGAs

Author: Paolo Rech, Fernanda Lima Kastensmidt, Jimmy Tarrillo, Christopher D. Frost, and Carlos Valderrama
Subjects: Triple modular redundancy, Nuclear and High Energy Physics, Engineering, business.industry, Word error rate, Fault tolerance, Ranging, Modular design, Nuclear Energy and Engineering, Redundancy (engineering), Electronic engineering, Static random-access memory, Electrical and Electronic Engineering, business, Field-programmable gate array, Computer hardware
Abstract: This paper evaluates different trade-offs of N-modular redundancy technique in SRAM-based FPGAs. Redundant copies of the same module were implemented and the outputs voted by self-adapted majority voter. The redundant design was exposed to neutrons and the error rate was evaluated. Results in cross-section, area and power consumption were analyzed for different numbers of redundant modules, ranging from 3 copies (standard TMR) up to 7 copies.
Published: 2014
Full Text: View/download PDF

24. GPUs Reliability Dependence on Degree of Parallelism

Author: Paolo Rech, Gabriel L. Nazar, C. Frost, and Luigi Carro
Subjects: Nuclear and High Energy Physics, Nuclear Energy and Engineering, Computer science, Data parallelism, Degree of parallelism, Task parallelism, Parallel computing, Electrical and Electronic Engineering, General-purpose computing on graphics processing units, Reliability (statistics)
Published: 2014
Full Text: View/download PDF

25. Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity

Author: Luigi Carro, Heather Quinn, Thomas D. Fairbanks, and Paolo Rech
Subjects: Nuclear and High Energy Physics, Nuclear Energy and Engineering, Computer science, Parallel algorithm, Word error rate, Parallel computing, Thread (computing), Electrical and Electronic Engineering, General-purpose computing on graphics processing units, Graphics, Circuit reliability, Block size, Scheduling (computing)
Abstract: Graphic Processing Units offer the possibility of executing several threads in parallel, providing the user with higher throughput with respect to traditional multi-core processors. However, the additional resources required to schedule and handle the parallel processes may have the countermeasure of making GPUs more prone to be corrupted by neutrons. The reported experimental results prove that un-optimized thread distribution may exacerbate the device output error rate. As demonstrated, increasing the parallel algorithm block size minimizes the GPU neutron-induced output error rate. The GPU parallelism management is then analyzed as a method to increase reliability.
Published: 2013
Full Text: View/download PDF

26. Radiation and Fault Injection Testing of a Fine-Grained Error Detection Technique for FPGAs

Author: Luigi Carro, Paolo Rech, Christopher D. Frost, and Gabriel L. Nazar
Subjects: Nuclear and High Energy Physics, Engineering, business.industry, Hardware_PERFORMANCEANDRELIABILITY, Fault injection, Nuclear Energy and Engineering, Embedded system, Electronic engineering, Redundancy (engineering), Overhead (computing), Static random-access memory, Electrical and Electronic Engineering, Latency (engineering), Error detection and correction, Field-programmable gate array, Dual modular redundancy, business
Abstract: We present the experimental evaluation of a fine-grained hardening approach that exploits underused and abundant resources found in state-of-the-art SRAM-based FPGAs to detect radiation-induced errors on configuration memories. The technique's main goal is to provide the benefits of fine-grained redundancy, namely improved diagnosis and reduced error latency, with a reduced area overhead. Neutron experiments, validated with fault injection campaigns, demonstrate the proposed technique's efficiency when compared to the traditional dual modular redundancy.
Published: 2013
Full Text: View/download PDF

27. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

Author: Luigi Carro, Christopher D. Frost, C. Aguiar, and Paolo Rech
Subjects: Nuclear and High Energy Physics, Software, Nuclear Energy and Engineering, Matrix algebra, business.industry, Computer science, Hardware_PERFORMANCEANDRELIABILITY, Electrical and Electronic Engineering, Neutron radiation, business, Matrix multiplication, Computational science, Hardening (computing)
Abstract: Neutron radiation experiment results on matrix multiplication on graphic processing units (GPUs) show that multiple errors are detected at the output in more than 50% of the cases. In the presence of multiple errors, the available hardening strategies may become ineffective or inefficient. Analyzing radiation-induced error distributions, we developed an optimized and experimentally tuned software-based hardening strategy for GPUs. With fault-injection simulations, we compare the performance and correcting capabilities of the proposed technique with the available ones.
Published: 2013
Full Text: View/download PDF

28. A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility

Author: Marta Bagatin, Paolo Rech, Alessandro Paccagnella, A. Manuzzato, Massimo Violante, Christopher D. Frost, Antonino Pietropaolo, Salvatore Pontarelli, Carla Andreani, Simone Gerardin, Gian Carlo Cardarilli, Luca Sterpone, Giuseppe Gorini, Violante, M, Sterpone, L, Manuzzato, A, Gerardin, S, Rech, P, Bagatin, M, Paccagnella, A, Andreani, C, Gorini, G, Pietropaolo, A, Cardarilli, G, Pontarelli, S, and Frost, C
Subjects: Nuclear and High Energy Physics, Engineering, Hardware_PERFORMANCEANDRELIABILITY, Tracing, Single event upset (SEU), FPGA, neutron source, radiation testing, Single Event Upset (SEU), Computer Science::Hardware Architecture, Software, Electronic engineering, Neutron, Static random-access memory, Electrical and Electronic Engineering, Field-programmable gate array, business.industry, Neutron source, Radiation testing, Nuclear Energy and Engineering, Settore FIS/07 - Fisica Applicata(Beni Culturali, Ambientali, Biol.e Medicin), Soft error, neutron sources, ING-INF/01 - ELETTRONICA, Physics::Accelerator Physics, business, Computer hardware, Energy (signal processing)
Abstract: We introduce a new hardware/software platform for testing SRAM-based FPGAs under heavy-ion and neutron beams, capable of tracing the bit-flips in the configuration memory back to the physical resources affected in the FPGA. The validation was performed using, for the first time, the neutron source at the RALILSIS facility. The ISlS beam features a 1/E spectrum, which is similar to the terrestrial one with an acceleration between 107 and 10 8 in the energy range 10-100 MeV. The results gathered on Xilinx SRAM-based FPGAs are discussed in terms of cross section and circuit-level modifications. © 2007 IEEE.
Published: 2007
Full Text: View/download PDF

29. Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

Author: Caio Lunardi, Heather Quinn, Philippe O. A. Navaux, Daniel Oliveira, Laura Monroe, and Paolo Rech
Subjects: Nuclear and High Energy Physics, Engineering, Sorting algorithm, 010308 nuclear & particles physics, business.industry, Radix sort, Sorted array, Electrical engineering, Word error rate, Hardware_PERFORMANCEANDRELIABILITY, 02 engineering and technology, Fault injection, 01 natural sciences, 020202 computer hardware & architecture, Nuclear Energy and Engineering, Computer engineering, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, sort, Electrical and Electronic Engineering, Merge sort, business, Quicksort
Abstract: In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
Published: 2017
Full Text: View/download PDF

30. Register File Criticality and Compiler Optimization Effects on Embedded Microprocessors Reliability

Author: Paolo Rech, Filipe M. Lins, Lucas A. Tambara, and Fernanda Lima Kastensmidt
Subjects: Nuclear and High Energy Physics, Reduced instruction set computing, 010308 nuclear & particles physics, Computer science, Processor register, Register file, Optimizing compiler, 02 engineering and technology, Register window, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, law.invention, Microprocessor, Nuclear Energy and Engineering, law, Status register, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Operating system, Electrical and Electronic Engineering, Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, computer, Register allocation
Abstract: In this paper, we investigate the impact of register file errors in modern embedded microprocessors reliability through fault-injection and heavy-ion experiments. Additionally, we evaluate how different levels of compiler optimization modify the usage and failure probability of a processor register file. We select six representative benchmarks, each one compiled with three different levels of compiler optimization. We performed exhaustive fault-injection campaigns to measure the register’s architectural vulnerability factor of each code and configuration, identifying the registers that are more likely to generate silent data corruption or single event functional interruption. Moreover, we correlate the observed reliability variations with register file utilization. Finally, we irradiated with heavy ions two of the selected benchmarks compiled with two levels of optimization, and correlated experimental results with fault-injection analysis.
Published: 2017
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

30 results on '"Paolo Rech"'

1. Atmospheric Neutron-Induced Fault Generation and Propagation in Quantum Bits and Quantum Circuits

2. Efficient Error Detection for Matrix Multiplication with Systolic Arrays on FPGAs

3. Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

4. Evaluating and Mitigating Neutrons Effects on COTS EdgeAI Accelerators

5. How Reduced Data Precision and Degree of Parallelism Impact the Reliability of Convolutional Neural Networks on FPGAs

6. Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy

7. Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs

8. High-Energy Versus Thermal Neutron Contribution to Processor and Memory Error Rates

9. Selective Fault Tolerance for Register Files of Graphics Processing Units

10. Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs

11. Selective Hardening for Neural Networks in FPGAs

12. Evaluating the Impact of Repetition, Redundancy, Scrubbing, and Partitioning on 28-nm FPGA Reliability Through Neutron Testing

13. Reliability–Performance Analysis of Hardware and Software Co-Designs in SRAM-Based APSoCs

14. On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability

15. On the Reliability of Linear Regression and Pattern Recognition Feedforward Artificial Neural Networks in FPGAs

16. Analyzing the Impact of Radiation-Induced Failures in Programmable SoCs

17. Reliability Analysis of Operating Systems and Software Stack for Embedded Systems

18. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

19. Memory Access Time and Input Size Effects on Parallel Processors Reliability

20. Analyzing the Effectiveness of a Frame-Level Redundancy Scrubbing Technique for SRAM-based FPGAs

21. Using Benchmarks for Radiation Testing of Microprocessors and FPGAs

22. Reliability Evaluation of Embedded GPGPUs for Safety Critical Applications

23. Neutron Cross-Section of N-Modular Redundancy Technique in SRAM-Based FPGAs

24. GPUs Reliability Dependence on Degree of Parallelism

25. Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity

26. Radiation and Fault Injection Testing of a Fine-Grained Error Detection Technique for FPGAs

27. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

28. A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility

29. Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

30. Register File Criticality and Compiler Optimization Effects on Embedded Microprocessors Reliability

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

30 results on '"Paolo Rech"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources