501. WARM: Workload-Aware Reliability Management in Linux/Android
- Author
-
Andrea Bartolini, Pietro Mercati, Luca Benini, Francesco Paterna, Tajana Rosing, Mercati, Pietro, Paterna, Francesco, Bartolini, Andrea, Benini, Luca, and Rosing, Tajana Å imuniÄ
- Subjects
020203 distributed computing ,Engineering ,Multi-core processor ,Optimization problem ,reliability ,business.industry ,Workload ,02 engineering and technology ,Solver ,Mobile ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,User experience design ,Embedded system ,Convex optimization ,0202 electrical engineering, electronic engineering, information engineering ,resource management ,Performance improvement ,Android (operating system) ,Electrical and Electronic Engineering ,business ,Software - Abstract
With CMOS scaling beyond 14 nm, reliability is a major concern for IC manufacturers. Reliability-aware design has a non-negligible overhead and cannot account for user experience in mobile devices. An alternative is dynamic reliability management (DRM), which counteracts degradation by adapting the operating conditions at runtime. In this paper, for the first time we formulate DRM as an optimization problem that accounts for reliability, temperature and performance. We develop an optimal policy for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. WARM approximates the solution of the convex solver within 5% on average, while executing more than $400 {\times }$ faster. WARM integrates a thermal controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. We show that WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. We show that WARM task allocation achieves up to one year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.
- Published
- 2017