Back to Search Start Over

Screening hardware and volume factors in distributed machine learning algorithms on spark

Authors :
Germano C. Vasconcelos
Jairson B. Rodrigues
Paulo Maciel
Source :
Computing. 103:2203-2225
Publication Year :
2021
Publisher :
Springer Science and Business Media LLC, 2021.

Abstract

This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naive Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.

Details

ISSN :
14365057 and 0010485X
Volume :
103
Database :
OpenAIRE
Journal :
Computing
Accession number :
edsair.doi...........6180b9dde17d1b296524de600202e0cb