Back to Search
Start Over
FLSGD: free local SGD with parallel synchronization.
- Source :
-
Journal of Supercomputing . Jul2022, Vol. 78 Issue 10, p12410-12433. 24p. - Publication Year :
- 2022
-
Abstract
- Synchronous parameters algorithms with data parallelism have been successfully utilized to accelerate the distributed training of deep neural networks (DNNs). However, a prevalent shortcoming of the synchronous methods is computation waste resulted from the mutual waiting among the computational workers with different performance and the communication delays at each synchronization. To alleviate this drawback, we propose a novel method, free local stochastic gradient descent (FLSGD) with parallel synchronization, to eliminate the waiting and communication overhead. Specifically, the process of distributed DNN training is firstly modeled as a pipeline which assembly consists of three components: dataset partition, local SGD, and parameter updating. Then, a novel adaptive batch size and dataset partition method based on the computational performance of the node is employed to eliminate the waiting time by keeping the load balance of the distributed DNN training. The local SGD and the parameter updating including gradients synchronization are parallelized to eliminate the communication cost by one-step gradient delaying, and the stale problem is remedied by an appropriate approximation. To our best knowledge, this is the first work focusing on decreasing both distributed training load balancing and communication overhead Extensive experiments are conducted with four state-of-the-art DNN models on two image classification datasets (i.e., CIFAR10 and CIFAR100) to demonstrate that the effectiveness of FLSGD outperforms the synchronous methods. [ABSTRACT FROM AUTHOR]
- Subjects :
- *ARTIFICIAL neural networks
*SYNCHRONIZATION
*DISTRIBUTED computing
*TIMEKEEPING
Subjects
Details
- Language :
- English
- ISSN :
- 09208542
- Volume :
- 78
- Issue :
- 10
- Database :
- Academic Search Index
- Journal :
- Journal of Supercomputing
- Publication Type :
- Academic Journal
- Accession number :
- 157415832
- Full Text :
- https://doi.org/10.1007/s11227-021-04267-5