1. Adaptive Request Scheduling for the I/O Forwarding Layer using Reinforcement Learning
- Author
-
Philippe O. A. Navaux, Jean Luca Bez, Toni Cortes, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Instituto de Informática da UFRGS (UFRGS), Universidade Federal do Rio Grande do Sul [Porto Alegre] (UFRGS), Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Data Aware Large Scale Computing (DATAMOVE ), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Barcelona Supercomputing Center - Centro Nacional de Supercomputacion (BSC - CNS), This study was financed in part by the Coordenac¸ao de Aperfeicoamento de Pessoal de Nıvel Superior - Brasil (CAPES) - Finance Code 001. It has also received support from the Conselho Nacional de Desenvolvimento Cientıfico e Tecnologico (CNPq), Brazil, the LICIA International Laboratory, NCSA-Inria-ANL-BSC-JSC-Riken Joint-Laboratory on Extreme Scale Computing (JLESC), the Spanish Ministry of Science and Innovation under the TIN2015–65316 grant, and the Generalitat de Catalunya under contract 2014– SGR–1051. This research received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 800144. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www. grid5000.fr). The authors acknowledge the National Laboratory for Scientific Computing (LNCC/MCTI, Brazil) for providing HPC resources of the SDumont supercomputer, which have contributed to the research results reported within this paper., Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
- Subjects
I/O forwarding ,Parallel I/O ,I/O scheduling ,Computer Networks and Communications ,Computer science ,Distributed computing ,Auto-tuning ,02 engineering and technology ,Scheduling (computing) ,I/O Scheduling ,I/O Forwarding ,Machine learning ,Reinforcement learning ,Aprenentatge automàtic ,0202 electrical engineering, electronic engineering, information engineering ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Input/output ,High performance I/O ,020206 networking & telecommunications ,Workload ,Reinforcement Learning ,Hardware and Architecture ,Superordinadors ,020201 artificial intelligence & image processing ,High performance computing ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Software - Abstract
In this paper, we propose an approach to adapt the I/O forwarding layer of HPC systems to applications’ access patterns. I/O optimization techniques can improve performance for the access patterns they were designed to target, but they often decrease performance for others. Furthermore, these techniques usually depend on the precise tune of their parameters, which commonly falls back to the users. Instead, we propose to do it dynamically at runtime based on the I/O workload observed by the system. Our approach uses a reinforcement learning technique – contextual bandits – to make the system capable of learning the best parameter value to each observed access pattern during its execution. That eliminates the need of a complicated and time-consuming previous training phase. Our case study is the TWINS scheduling algorithm, where performance improvements depend on the time window parameter, which in turn depends on the workload. We evaluate our proposal and demonstrate it can reach a precision of 88% on the parameter selection in the first hundreds of observations of an access pattern, achieving 99% of the optimal performance. We demonstrate that the system – which is expected to live for years – will be able to adapt to changes and optimize its performance after having observed an access pattern for a few (not necessarily contiguous) minutes. This study was financed in part by the Coordenaçao de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. It has also received support from the Conselho Nacional de Desenvolvimento Científico e Tecnologico (CNPq), Brazil; the LICIA International Laboratory; NCSA-Inria-ANL-BSC-JSC-Riken Joint-Laboratory on Extreme Scale Computing (JLESC); the Spanish Ministry of Science and Innovation under the TIN2015–65316 grant; and the Generalitat de Catalunya under contract 2014– SGR–1051. This research received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 800144.
- Published
- 2020