Back to Search Start Over

A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

Authors :
Guozheng Wang
Yongmei Lei
Zeyu Zhang
Cunlu Peng
Source :
Data Science and Engineering, Vol 8, Iss 1, Pp 61-72 (2023)
Publication Year :
2023
Publisher :
SpringerOpen, 2023.

Abstract

Abstract Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a grouping AllReduce method based on the two-dimensional torus topology. This method synchronizes the model parameters by grouping and makes full use of bandwidth. Secondly, we propose a distributed algorithm, 2D-TGA-ADMM, which combines the 2D-TGA with the alternating direction method of multipliers (ADMM). It focuses on sub-model training and reduces the wait time among workers in the synchronization process. Finally, experimental results on the Tianhe-2 supercomputing platform show that compared with the $${\mathtt {MPI\_Allreduce}}$$ MPI _ Allreduce , the 2D-TGA could shorten the synchronization wait time by $$33\%$$ 33 % .

Details

Language :
English
ISSN :
23641185 and 23641541
Volume :
8
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Data Science and Engineering
Publication Type :
Academic Journal
Accession number :
edsdoj.326f786c96534dbf82faf8489cae1202
Document Type :
article
Full Text :
https://doi.org/10.1007/s41019-022-00202-7