Back to Search Start Over

Scalable random forest with data-parallel computing

Authors :
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Barcelona Supercomputing Center
Vázquez-Novoa, Fernando
Conejero Bañón, Francisco Javier
Tatu, Cristian
Badia Sala, Rosa Maria
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Barcelona Supercomputing Center
Vázquez-Novoa, Fernando
Conejero Bañón, Francisco Javier
Tatu, Cristian
Badia Sala, Rosa Maria
Publication Year :
2023

Abstract

In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.<br />This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. 101016577 (AI-Sprint project).<br />Peer Reviewed<br />Postprint (author's final draft)

Details

Database :
OAIster
Notes :
14 p., application/pdf, English
Publication Type :
Electronic Resource
Accession number :
edsoai.on1397549663
Document Type :
Electronic Resource