1. Boosting HPC data analysis performance with the ParSoDA-Py library.
- Author
-
Belcastro, Loris, Giampà, Salvatore, Marozzo, Fabrizio, Talia, Domenico, Trunfio, Paolo, Badia, Rosa M., Ejarque, Jorge, and Mammadli, Nihad
- Subjects
- *
DATA analysis , *PYTHON programming language , *DATA mining , *DATA libraries , *LIBRARY technical services , *HIGH performance computing , *BIG data - Abstract
Developing and executing large-scale data analysis applications in parallel and distributed environments can be a complex and time-consuming task. Developers often find themselves diverted from their application logic to handle technical details about the underlying runtime and related issues. To simplify this process, ParSoDA, a Java library, has been proposed to facilitate the development of parallel data mining applications executed on HPC systems. It simplifies the process by providing built-in scalability mechanisms relying on the Hadoop and Spark frameworks. This paper presents ParSoDA-Py, the Python version of the ParSoDA library, which allows for further support of commonly used runtimes and libraries for big data analysis. After a complete library redesign, ParSoDA can be now easily integrated with other Python-based distributed runtimes for HPC systems, such as COMPSs and Apache Spark, and with the large ecosystem of Python-based data processing libraries. The paper discusses the adaptation process, which takes into consideration the new technical requirements, and evaluates both usability and scalability through some case study applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF