Back to Search Start Over

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML.

Authors :
Kurth, Thorsten
Smorkalov, Mikhail
Mendygral, Peter
Sridharan, Srinivas
Mathuriya, Amrita
Source :
Concurrency & Computation: Practice & Experience; 8/25/2019, Vol. 31 Issue 16, pN.PAG-N.PAG, 1p
Publication Year :
2019

Abstract

Summary: Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15320626
Volume :
31
Issue :
16
Database :
Complementary Index
Journal :
Concurrency & Computation: Practice & Experience
Publication Type :
Academic Journal
Accession number :
137639853
Full Text :
https://doi.org/10.1002/cpe.4989