A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

Authors :: Guozheng Wang
Yongmei Lei
Zeyu Zhang
Cunlu Peng
Source :: Data Science and Engineering, Vol 8, Iss 1, Pp 61-72 (2023)
Publication Year :: 2023
Publisher :: SpringerOpen, 2023.
Abstract: Abstract Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a grouping AllReduce method based on the two-dimensional torus topology. This method synchronizes the model parameters by grouping and makes full use of bandwidth. Secondly, we propose a distributed algorithm, 2D-TGA-ADMM, which combines the 2D-TGA with the alternating direction method of multipliers (ADMM). It focuses on sub-model training and reduces the wait time among workers in the synchronization process. Finally, experimental results on the Tianhe-2 supercomputing platform show that compared with the $${\mathtt {MPI\_Allreduce}}$$ MPI _ Allreduce , the 2D-TGA could shorten the synchronization wait time by $$33\%$$ 33 % .

Subjects :: ADMM
Grouping AllReduce
Two-dimensional torus topology
Synchronous algorithm
Information technology
T58.5-58.64
Electronic computers. Computer science
QA75.5-76.95

Full Text Access

Tools