1. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication
- Author
-
Viveck R. Cadambe, Vipul Gupta, Tze Meng Low, Yaoqing Yang, Pulkit Grover, Kannan Ramchandran, Christian Engelmann, and Haewon Jeong
- Subjects
Computer science ,Computation ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,020206 networking & telecommunications ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,0101 mathematics ,Error detection and correction ,01 natural sciences ,Matrix multiplication - Abstract
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires \(\sim \)50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
- Published
- 2020