1. Implicitly distributed fuzzy random forests
- Author
-
Francesco Marcelloni, Marco Barsacchi, and Alessio Bechini
- Subjects
Big Data ,Fuzzy classification ,Apache Spark ,Computer science ,Fuzzy Random Forest ,Fuzzy set ,Fuzzy Random Forest, Big Data, Apache Spark, MapReduce ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Fuzzy logic ,Random forest ,Statistical classification ,ComputingMethodologies_PATTERNRECOGNITION ,020204 information systems ,Computer cluster ,Scalability ,Spark (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,MapReduce ,Data mining ,computer - Abstract
In the field of Data Mining for large scale datasets, also known as Big Data Mining, the availability of effective and efficient classifiers is a prime concern. Accurate classification results can be obtained with sophisticated models, e.g. using ensembling approaches and exploiting concepts of fuzzy set theory, but with an high computational cost. The quest for efficiency leads to the adoption of distributed versions of classification algorithms, and in this effort the support of proper cluster computing frameworks can be fundamental. In this paper it is proposed DFRF, a novel distributed fuzzy random forest induction algorithm, based on a fuzzy discretizer for continuous attributes. The described approach, although shaped on the MapReduce programming model, takes advantage of the implicit distribution of the computation provided by the Apache Spark framework. An extensive experimental characterization of the algorithm over Big Datasets, along with a comparison with other state-of-the-art fuzzy classification algorithms, shows that DFRF provides very competitive results; moreover, a scalability study carried out on a small computer cluster shows that the approach is well behaved with respect to an increment in the number of available computing units.
- Published
- 2021
- Full Text
- View/download PDF