Overlapping Communication With Computation in Parameter Server for Scalable DL Training.

Authors :: Wang, Shaoqi
Pi, Aidi
Zhou, Xiaobo
Wang, Jun
Xu, Cheng-Zhong
Source :: IEEE Transactions on Parallel & Distributed Systems. Sep2021, Vol. 32 Issue 9, p2144-2159. 16p.
Publication Year :: 2021
Abstract: Scalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches could bring significant overhead in gradient communication. Meanwhile, they cannot be effectively applied to the overlap between parameter communication and forward computation. In this article, we propose and develop iPart, a novel approach that partitions communication and computation in various partition sizes to overlap gradient communication with backward computation and parameter communication with forward computation. iPart formulates the partitioning decision as an optimization problem and solves it based on a greedy algorithm to derive communication and computation partitions. We implement iPart in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iPart improves the scalability of a cluster of 72 nodes by up to 94 percent over the default PS and 52 percent over the layer by layer strategy. [ABSTRACT FROM AUTHOR]

Subjects :: *DEEP learning
*SCALABILITY
*GREEDY algorithms
*STATISTICAL decision making
*COMPUTER architecture
*PROBLEM solving

Full Text Access

Tools