Start Over

λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures

Authors :: Li Chen
Fei Xu
Zhi Zhou
Yiling Qin
Fangming Liu
Source :: IEEE Transactions on Computers. 71:450-463
Publication Year :: 2022
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2022.
Abstract: Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of $functions$ without managing virtual machines or servers. Though provided with a simpler resource interface ( $i.e.,$ function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to $unpredictable$ DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such $unpredictable performance$ of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this paper, we design and implement $\lambda DNN$ , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a $lightweight$ analytical DDNN training performance model to enable our design of $\lambda DNN$ resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, $\lambda DNN$ can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7%, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.

Subjects :: business.industry
Computer science
Distributed computing
CPU time
Provisioning
Throughput
Cloud computing
computer.software_genre
Bottleneck
Theoretical Computer Science
Computational Theory and Mathematics
Hardware and Architecture
Virtual machine
Server
Overhead (computing)
business
computer
Software

Details

ISSN :: 23263814 and 00189340
Volume :: 71
Database :: OpenAIRE
Journal :: IEEE Transactions on Computers
Accession number :: edsair.doi...........6e80509ecdab35c184c2bfca4f010b4c
Full Text :: https://doi.org/10.1109/tc.2021.3054656

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources