Back to Search Start Over

Accelerating Distributed Training With Collaborative In-Network Aggregation

Authors :
Fang, Jin
Xu, Hongli
Zhao, Gongming
Yu, Zhuolong
Shen, Bingchen
Xie, Liguang
Source :
IEEE/ACM Transactions on Networking; August 2024, Vol. 32 Issue: 4 p3437-3452, 16p
Publication Year :
2024

Abstract

The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches’ limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by <inline-formula> <tex-math notation="LaTeX">$1.5 \times $ </tex-math></inline-formula> compared to the state-of-the-art solutions.

Details

Language :
English
ISSN :
10636692
Volume :
32
Issue :
4
Database :
Supplemental Index
Journal :
IEEE/ACM Transactions on Networking
Publication Type :
Periodical
Accession number :
ejs67220142
Full Text :
https://doi.org/10.1109/TNET.2024.3387948