TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML.

Authors :: Kurth, Thorsten
Smorkalov, Mikhail
Mendygral, Peter
Sridharan, Srinivas
Mathuriya, Amrita
Source :: Concurrency & Computation: Practice & Experience; 8/25/2019, Vol. 31 Issue 16, pN.PAG-N.PAG, 1p
Publication Year :: 2019
Abstract: Summary: Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems. [ABSTRACT FROM AUTHOR]

Subjects :: RAPID prototyping
DEEP learning
MODEL railroads
HIGH performance computing
PERFORMANCES

Full Text Access

Tools