1. Screening hardware and volume factors in distributed machine learning algorithms on spark
- Author
-
Germano C. Vasconcelos, Jairson B. Rodrigues, and Paulo Maciel
- Subjects
Computer science ,02 engineering and technology ,Machine learning ,computer.software_genre ,Theoretical Computer Science ,Naive Bayes classifier ,Spark (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Parallel coordinates ,Numerical Analysis ,business.industry ,Design of experiments ,Linear model ,020206 networking & telecommunications ,Computer Science Applications ,Random forest ,Support vector machine ,Computational Mathematics ,Computational Theory and Mathematics ,Multilayer perceptron ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Algorithm ,Software ,Computer hardware - Abstract
This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naive Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.
- Published
- 2021