Comparative Study of Binary Classification Methods to Analyze a Massive Dataset on Virtual Machine.

Authors :: Naik, Neelam
Purohit, Seema
Source :: Procedia Computer Science; 2017, Vol. 112, p1863-1870, 8p
Publication Year :: 2017
Abstract: Massive dataset can be analyzed by establishing physical distributed environment or by hiring cloud-based distributed environment. The advantage of cloud-based environment over physical environment is that, it provides scalable virtual resources on demand and thus makes it suitable for handling increase in volume of the data. The various hidden patterns in data can provide knowledge bases for decision making. The statistical or data mining based methods can be used for finding knowledge patterns. Among the decision tree based classification algorithms, implementable in distributed environment, an efficient algorithm can be selected based on few parameters such as execution time, accuracy of prediction and complexity of the tree structure. In this study, Apache Hadoop-based distributed environment is established on virtual machine. Apache Spark is installed to execute machine learning algorithms. The comparative study of binary classification methods such as decision tree, gradient boosted tree and random forest tree is performed to judge their performances on the basis of defined parameters. It is found that Random forest tree performs best among all three algorithms for the considered dataset. [ABSTRACT FROM AUTHOR]

Subjects :: DATA analysis
BINARY sequences
CLOUD computing
DECISION making
KNOWLEDGE base
COMPARATIVE studies
DECISION trees

Full Text Access

Tools