Back to Search Start Over

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

Authors :
Jia Wei
Xingjun Zhang
Zeyu Ji
Jingbo Li
Zheng Wei
Source :
Scientific Reports, Vol 11, Iss 1, Pp 1-14 (2021)
Publication Year :
2021
Publisher :
Nature Portfolio, 2021.

Abstract

Abstract Due to the increase in computing power, it is possible to improve the feature extraction and data fitting capabilities of DNN networks by increasing their depth and model complexity. However, the big data and complex models greatly increase the training overhead of DNN, so accelerating their training process becomes a key task. The Tianhe-3 peak speed is designed to target E-class, and the huge computing power provides a potential opportunity for DNN training. We implement and extend LeNet, AlexNet, VGG, and ResNet model training for a single MT-2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and propose an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy for the gradient synchronization process base on the ARM architecture features of the Tianhe-3 prototype, providing experimental data and theoretical basis for further enhancing and improving the performance of the Tianhe-3 prototype in large-scale distributed training of neural networks.

Subjects

Subjects :
Medicine
Science

Details

Language :
English
ISSN :
20452322
Volume :
11
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Scientific Reports
Publication Type :
Academic Journal
Accession number :
edsdoj.74465dea31954f998eb85e833885edfc
Document Type :
article
Full Text :
https://doi.org/10.1038/s41598-021-98794-z