Back to Search
Start Over
Accelerating Distributed Training With Collaborative In-Network Aggregation
- Source :
- IEEE/ACM Transactions on Networking; August 2024, Vol. 32 Issue: 4 p3437-3452, 16p
- Publication Year :
- 2024
-
Abstract
- The surging scale of distributed training (DT) incurs significant communication overhead in datacenters, while a promising solution is in-network aggregation (INA). It leverages programmable switches (e.g., Intel Tofino switches) for gradient aggregation to accelerate DT tasks. Due to switches’ limited on-chip memory size, existing solutions try to design the memory sharing mechanism for INA. This mechanism requires gradients to arrive at switches synchronously, while network dynamics make it common for the asynchronous arrival of gradients, resulting in existing solutions being inefficient (e.g., massive communication overhead). To address this issue, we propose GOAT, the first-of-its-kind work on gradient scheduling with collaborative in-network aggregation, so that switches can efficiently aggregate asynchronously arriving gradients. Specifically, GOAT first partitions the model into a set of sub-models, then decides which sub-model gradients each switch is responsible for aggregating exclusively and to which switch each worker should send its sub-model gradients. To this end, we design an efficient knapsack-based randomized rounding algorithm and formally analyze the approximation performance. We implement GOAT and evaluate its performance on a testbed consisting of 3 Intel Tofino switches and 9 servers. Experimental results show that GOAT can speed up the DT by <inline-formula> <tex-math notation="LaTeX">$1.5 \times $ </tex-math></inline-formula> compared to the state-of-the-art solutions.
Details
- Language :
- English
- ISSN :
- 10636692
- Volume :
- 32
- Issue :
- 4
- Database :
- Supplemental Index
- Journal :
- IEEE/ACM Transactions on Networking
- Publication Type :
- Periodical
- Accession number :
- ejs67220142
- Full Text :
- https://doi.org/10.1109/TNET.2024.3387948