Start Over

A comparative performance study of spark on kubernetes.

Authors :: Zhu, Changpeng
Han, Bo
Zhao, Yinliang
Source :: Journal of Supercomputing. Jul2022, Vol. 78 Issue 11, p13298-13322. 25p.
Publication Year :: 2022
Abstract: Kubernetes makes it easier to automate deployment and scale containerized applications to achieve a near-native performance. However, there is still a lack of systematic performance studies on how Spark applications perform on Kubernetes. In this paper, we first propose a model to capture the execution behavior of tasks, stages, and jobs, and present an implementation of a prototype system based on the model. The system is then used to collect and analyze various types of performance and system metrics, such as execution time and CPU utilization. Second, with the use of various Spark applications, we evaluate the performance of Spark on Kubernetes by comparing it with its baseline, i.e., Spark on bare metal. Based on the comparison and leveraging the system, we locate what stages suffer from the performance loss of these applications on Kubernetes, and then reveal the root causes of the loss by analyzing their work-flows, execution time and costs of system resources. Through extensive measurements, we find that Spark on Kubernetes falls behind its baseline in the range of - 2.9 % to 83.9%. There are several root causes of the performance loss and benefits of Spark on Kubernetes. First, data locality deterioration by pods is a crucial root cause of the loss. To address the problem, we propose an approach to schedule tasks by taking both data locality and the utilization of executors into account. Experiments show that this approach increases the performance of Spark on Kubernetes by up to 32.2%. Second, the lower CPU usages of executors are another root cause of the performance loss, even if they have an equivalent CPU configuration on both Kubernetes and bare metal. In contrast, with the same memory configuration, executors use more memory on Kubernetes than on bare metal, contributing to the performance benefit of Spark on Kubernetes in some stages. Our research efforts in this paper benefit developers and researchers when they make valuable decisions on deploying Spark applications on Kubernetes for a better performance. [ABSTRACT FROM AUTHOR]