Start Over

Efficient Performance Prediction for Apache Spark.

Authors :: Cheng, Guoli
Ying, Shi
Wang, Bingming
Li, Yuhang
Source :: Journal of Parallel & Distributed Computing. Mar2021, Vol. 149, p40-51. 12p.
Publication Year :: 2021
Abstract: Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. • We use Adaboost algorithm to build a set of performance models at the stage level for Spark prediction. • We use projective sampling to minimize the time and resources spent in modeling. • The experiment results show that our approach is lower than the previously proposed approach in prediction error and cost. [ABSTRACT FROM AUTHOR]