Back to Search Start Over

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology.

Authors :
Wang, Shuai
Geng, Jinkun
Li, Dan
Source :
IEEE/ACM Transactions on Networking; Apr2022, Vol. 30 Issue 2, p572-585, 14p
Publication Year :
2022

Abstract

To tackle the increasingly larger training data and models, researchers and engineers resort to multiple servers in a data center for distributed machine learning (DML). On one hand, DML enables us to leverage the computation power of multiple servers, which can effectively accelerate those computation-intensive tasks. On the other hand, DML also incurs significant communication cost due to parameter synchronization among these servers. In this paper, we want to explore the impact of synchronization topology, including both logical topology and physical topology, on the DML performance. First, we revisit the existing logical topologies, e.g., parameter server and ring allreduce, for parameter synchronization, and we find that these flat synchronization topologies is inefficient when running a large-scale DML training. Therefore, we propose a hierarchical parameter synchronization topology, called HiPS, which can achieve efficient parameter synchronization even on a large scale. Then, we compare two representative physical network topologies, namely, Fat-Tree and BCube. Based on our analyses, BCube has many advantages over Fat-Tree, e.g., higher bandwidth, better load balance, and lower hardware cost. The simulation results also show that BCube is more friendly to RDMA. Relying on the advantages of HiPS and BCube, the GST of “HiPS+BCube” is 12% ~ 70% lower than other combinations. Moreover, when the cluster size increases from 16 to 1024, the performance of “HiPS+BCube” only drops by 6.5%, while the performance of “Ring+BCube” drops by 44.6%. Hence, we believe “HiPS+BCube” is the optimal solution to benefit DML in large scale. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10636692
Volume :
30
Issue :
2
Database :
Complementary Index
Journal :
IEEE/ACM Transactions on Networking
Publication Type :
Academic Journal
Accession number :
156342407
Full Text :
https://doi.org/10.1109/TNET.2021.3117042