Back to Search Start Over

A lightweight performance proxy for deep‐learning model training on Amazon SageMaker.

Authors :
Keller Tesser, Rafael
Marques, Alvaro
Borin, Edson
Source :
Concurrency & Computation: Practice & Experience; 6/25/2024, Vol. 36 Issue 14, p1-22, 22p
Publication Year :
2024

Abstract

Summary: Cloud computing has become popular for training deep‐learning (DL) models, avoiding the costs of acquiring and maintaining on‐premise systems. SageMaker is a cloud service that automates the execution of DL workloads. Its features include automatic hyperparameter optimization and use of spot instances. Nonetheless, it does not assist in selecting the right instance type for a workload. In public clouds, rent price depends on the configuration of the chosen instance type. Advanced and faster instances are typically more expensive, but not always the best choice. To select the optimal instance type, users must compare the workload's relative performance (and hence cost) on several candidates. Building on the execution profiles of multiple DL applications, we model the performance and cost of training DL applications on SageMaker and propose a lightweight technique to estimate these at low temporal and monetary cost. This method is a performance proxy that can be used to replace more expensive performance measurement procedures. So, it could speed up any technique that relies on such measurements. We show how it can help cloud customers seeking suitable instance types to train DL models, and that it can accurately predict the performance of different instance types when training these models on SageMaker. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15320626
Volume :
36
Issue :
14
Database :
Complementary Index
Journal :
Concurrency & Computation: Practice & Experience
Publication Type :
Academic Journal
Accession number :
177418729
Full Text :
https://doi.org/10.1002/cpe.8104