1. Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies:
- Author
-
Gittens, Alex, Devarakonda, Aditya, Racah, Evan, Ringenburg, Michael, Gerhardt, Lisa, Kottaalam, Jey, Liu, Jialin, Maschhoff, Kristyn, Canon, Shane, Chhugani, Jatin, Sharma, Pramod, Yang, Jiyan, Demmel, James, Harrell, Jim, Krishnamurthy, Venkat, Mahoney, Michael W., and Prabhat, Mr
- Abstract
We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.
- Published
- 2016