Back to Search Start Over

Comparative Study of Binary Classification Methods to Analyze a Massive Dataset on Virtual Machine.

Authors :
Naik, Neelam
Purohit, Seema
Source :
Procedia Computer Science; 2017, Vol. 112, p1863-1870, 8p
Publication Year :
2017

Abstract

Massive dataset can be analyzed by establishing physical distributed environment or by hiring cloud-based distributed environment. The advantage of cloud-based environment over physical environment is that, it provides scalable virtual resources on demand and thus makes it suitable for handling increase in volume of the data. The various hidden patterns in data can provide knowledge bases for decision making. The statistical or data mining based methods can be used for finding knowledge patterns. Among the decision tree based classification algorithms, implementable in distributed environment, an efficient algorithm can be selected based on few parameters such as execution time, accuracy of prediction and complexity of the tree structure. In this study, Apache Hadoop-based distributed environment is established on virtual machine. Apache Spark is installed to execute machine learning algorithms. The comparative study of binary classification methods such as decision tree, gradient boosted tree and random forest tree is performed to judge their performances on the basis of defined parameters. It is found that Random forest tree performs best among all three algorithms for the considered dataset. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18770509
Volume :
112
Database :
Supplemental Index
Journal :
Procedia Computer Science
Publication Type :
Academic Journal
Accession number :
124953220
Full Text :
https://doi.org/10.1016/j.procs.2017.08.232