Back to Search Start Over

DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

Authors :
Wang, Qinlong
Sang, Bo
Zhang, Haitao
Tang, Mingjie
Zhang, Ke
Wang, Qinlong
Sang, Bo
Zhang, Haitao
Tang, Mingjie
Zhang, Ke
Publication Year :
2023

Abstract

The cloud is still a popular platform for distributed deep learning (DL) training jobs since resource sharing in the cloud can improve resource utilization and reduce overall costs. However, such sharing also brings multiple challenges for DL training jobs, e.g., high-priority jobs could impact, even interrupt, low-priority jobs. Meanwhile, most existing distributed DL training systems require users to configure the resources (i.e., the number of nodes and resources like CPU and memory allocated to each node) of jobs manually before job submission and can not adjust the job's resources during the runtime. The resource configuration of a job deeply affect this job's performance (e.g., training throughput, resource utilization, and completion rate). However, this usually leads to poor performance of jobs since users fail to provide optimal resource configuration in most cases. \system~is a distributed DL framework can auto-configure a DL job's initial resources and dynamically tune the job's resources to win the better performance. With elastic capability, \system~can effectively adjusts the resources of a job when there are performance issues detected or a job fails because of faults or eviction. Evaluations results show \system~can outperform manual well-tuned resource configurations. Furthermore, in the production Kubernetes cluster of \company, \system~reduces the medium of job completion time by 31\%, and improves the job completion rate by 6\%, CPU utilization by 15\%, and memory utilization by 20\% compared with manual configuration.<br />Comment: 10 pages, and open source system work

Details

Database :
OAIster
Publication Type :
Electronic Resource
Accession number :
edsoai.on1381615153
Document Type :
Electronic Resource