1. Subclass-based semi-random data partitioning for improving sample representativeness.
- Author
-
Liu, Han, Chen, Shyi-Ming, and Cocea, Mihaela
- Subjects
- *
RANDOM data (Statistics) , *PARALLEL algorithms , *MACHINE learning , *STATISTICAL sampling , *PREDICTION models - Abstract
Abstract In machine learning tasks, it is essential for a data set to be partitioned into a training set and a test set in a specific ratio. In this context, the training set is used for learning a model for making predictions on new instances, whereas the test set is used for evaluating the prediction accuracy of a model on new instances. In the context of human learning, a training set can be viewed as learning material that covers knowledge, whereas a test set can be viewed as an exam paper that provides questions for students to answer. In practice, data partitioning has typically been done by randomly selecting 70% instances for training and the rest for testing. In this paper, we argue that random data partitioning is likely to result in the sample representativeness issue, i.e., training and test instances show very dissimilar characteristics leading to the case similar to testing students on material that was not taught. To address the above issue, we propose a subclass-based semi-random data partitioning approach. The experimental results show that the proposed data partitioning approach leads to significant advances in learning performance due to the improvement of sample representativeness. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF