1. Resilient Deep Learning Accelerators
- Author
-
He, Yi
- Subjects
Computer science - Abstract
Deep learning (DL) accelerators have been widely deployed in a broad range of application domains. Resilience against hardware failures is a top priority as hardware failures can lead to various undesirable consequences. For inference workloads, hardware failures can generate crashes and significant degradation in inference accuracy. For training workloads, hardware failures can result in failure to converge, low training/test accuracy, numerical errors, and crashes. In this thesis, we establish the new knowledge on how major classes of hardware failures propagate and affect inference and training workloads, and advance state-of-art resilient deep learning systems by devising lightweight, effective detection and recovery techniques to mitigate hardware failures. There are three main contributions in this thesis. First, we devise a novel technique for generating high-quality test programs to detect permanent hardware failures in the field. For compute hardware units, we first adopt the ATPG to generate high-quality test patterns, and then reverse-engineer the dataflow/reuse algorithm of the accelerator to map the test inputs to equivalent DNNs. For control hardware units, our key observation is that typically only one or a few fixed DNNs are deployed at a time, which allows us to target only the hardware failures that can directly affect these DNNs by executing different layers of the DNNs using carefully-crafted weight and input tensors. We demonstrate that our technique successfully detects > 99.9\% of stuck-at faults and > 99.0\% of transition faults, compared to only $< 80\%$ if random test programs are used. Second, we provide (1) an open-source resilience analysis framework targeting transient hardware failures, called FIdelity, as well as (2) the first in-depth study, using this framework, on the impacts of transient failures on DL inference accelerator through fault injection experiments. FIdelity enables accurate and quick resilience analysis by modeling hardware failures in software with high fidelity. This is achieved by taking account of both the spatial and temporal reuse effects of hardware signals to map the effects of a given faulty hardware signal to a set of faulty output neurons in the current DNN layer. Using FIdelity, we perform 46M fault injection experiments running various representative inference workloads. We thoroughly analyze the experiment results, and obtain new insights that can be used in designing efficient, resilient DL inference accelerators. Lastly, we focus on DL training workloads, and provide (1) the first study that reveals the fundamental understanding on how transient and permanent hardware failures affect DL training workloads, and (2) new, light-weight hardware failure mitigation techniques. We extend the FIdelity framework to perform large-scale fault injection experiments targeting DL training workloads, and conduct $>2.9M$ experiments using a diverse set of DL training workloads. We characterize the outcomes of these experiments and derive the necessary conditions that must be satisfied for hardware failures to cause unexpected training outcomes. Based on the necessary conditions, we develop ultra-lightweight software techniques to detect hardware failures and recover the workloads, which only require $24$-$32$ lines of code change, and introduce $0.003-0.025\%$ performance overhead for various representative neural networks.
- Published
- 2023
- Full Text
- View/download PDF