Back to Search
Start Over
Screening hardware and volume factors in distributed machine learning algorithms on spark
- Source :
- Computing. 103:2203-2225
- Publication Year :
- 2021
- Publisher :
- Springer Science and Business Media LLC, 2021.
-
Abstract
- This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naive Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.
- Subjects :
- Computer science
02 engineering and technology
Machine learning
computer.software_genre
Theoretical Computer Science
Naive Bayes classifier
Spark (mathematics)
0202 electrical engineering, electronic engineering, information engineering
Parallel coordinates
Numerical Analysis
business.industry
Design of experiments
Linear model
020206 networking & telecommunications
Computer Science Applications
Random forest
Support vector machine
Computational Mathematics
Computational Theory and Mathematics
Multilayer perceptron
020201 artificial intelligence & image processing
Artificial intelligence
business
computer
Algorithm
Software
Computer hardware
Subjects
Details
- ISSN :
- 14365057 and 0010485X
- Volume :
- 103
- Database :
- OpenAIRE
- Journal :
- Computing
- Accession number :
- edsair.doi...........6180b9dde17d1b296524de600202e0cb