1. Learning from small datasets containing nominal attributes
- Author
-
Qi Shi Shi, Der-Chiang Li, and Hung Yu Chen
- Subjects
education.field_of_study ,Artificial neural network ,business.industry ,Computer science ,Cognitive Neuroscience ,Bootstrap aggregating ,Population ,Sample (statistics) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Fuzzy logic ,Computer Science Applications ,Support vector machine ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data pre-processing ,business ,education ,computer - Abstract
In many small-data-learning problems, owing to the incomplete data structure, explicit information for decision makers is limited. Although machine learning algorithms are extensively applied to extract knowledge, most of them are developed without considering whether the training sets can fully represent the population properties. Focusing on small data which contains nominal inputs and continuous outputs, this paper develops an effective sample generating procedure based on fuzzy theories to tackle the learning issue by data preprocessing. According to the derived fuzzy relations between categories and continuous outputs, the possibilities of the combinations of categories (virtual samples) can be aggregated when continuous outputs are given. Proper virtual samples are further selected by using fuzzy alpha-cut on the possibility distributions, and these are added to the training sets to form new ones. In the experiment, sixteen datasets taken from the UC Irvine Machine Learning Repository are examined with back-propagation neural networks and support vector regressions. The results reveal that the forecasting accuracies of the two models are significantly improved when they are built with the proposed new training sets. Moreover, the results also indicate the proposed method outperforms bootstrap aggregating and the synthetic minority over-sampling technique-Nominal-Continuous with the greatest amount of statistical support.
- Published
- 2018