Back to Search Start Over

RAT - Resilient Allreduce Tree for Distributed Machine Learning

Authors :
Shuihai Hu
Junxue Zhang
Hong Zhang
Xinchen Wan
Hao Wang
Kai Chen
Source :
APNet
Publication Year :
2020
Publisher :
ACM, 2020.

Abstract

Parameter/gradient exchange plays an important role in large-scale distributed machine learning (DML). However, prior solutions such as parameter server (PS) or ring-allreduce (Ring) fall short since they are not resilient to issues or uncertainties like oversubscription, congestion or failures that may occur in datacenter networks (DCN). This paper proposes RAT, a new solution that determines the communication pattern for DML. At its heart, RAT establishes allreduce trees taking into account the physical topology and its oversubscription condition. The allreduce trees specify the aggregation pattern in which each aggregator is responsible for aggregating gradients from all workers within an oversubscribed region at the reduce phase, and broadcasting the updates back to workers at the broadcast phase. We show that such an approach can effectively reduce cross-region traffic and shorten dependency chain compared to prior solutions. We have evaluated RAT in both oversubscribed network and network with failures and found that RAT is resilient to these issues or uncertainties. For example, it delivers an average of 25X and 5.7X speedup compared to PS in oversubscribed network and Ring in network with failures, respectively.

Details

Database :
OpenAIRE
Journal :
4th Asia-Pacific Workshop on Networking
Accession number :
edsair.doi...........5ae7368c305b30ec0391bc139d5a97bd
Full Text :
https://doi.org/10.1145/3411029.3411037