Back to Search
Start Over
TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML.
- Source :
- Concurrency & Computation: Practice & Experience; 8/25/2019, Vol. 31 Issue 16, pN.PAG-N.PAG, 1p
- Publication Year :
- 2019
-
Abstract
- Summary: Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 15320626
- Volume :
- 31
- Issue :
- 16
- Database :
- Complementary Index
- Journal :
- Concurrency & Computation: Practice & Experience
- Publication Type :
- Academic Journal
- Accession number :
- 137639853
- Full Text :
- https://doi.org/10.1002/cpe.4989