Back to Search
Start Over
Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets
- Source :
- PLoS ONE, PLoS ONE, Vol 12, Iss 8, p e0181853 (2017)
- Publication Year :
- 2017
- Publisher :
- Public Library of Science (PLoS), 2017.
-
Abstract
- It is difficult for learning models to achieve high classification performances with imbalanced data sets, because with imbalanced data sets, when one of the classes is much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often have poor learning performances due to slow convergence in the smaller classes. To balance such data sets, this paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data. In the reducing operation, we use the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion method to find representative data from the majority data. To generate the synthetic samples, we propose a counterintuitive hypothesis to find the distributed shape of the minority data, and then produce samples according to this distribution. Four real datasets were used to examine the performance of the proposed approach. We used paired t-tests to compare the Accuracy, G-mean, and F-measure scores of the proposed data pre-processing (PPDP) method merging in the D3C method (PPDP+D3C) with those of the one-sided selection (OSS), the well-known SMOTEBoost (SB) study, and the normal distribution-based oversampling (NDO) approach, and the proposed data pre-processing (PPDP) method. The results indicate that the classification performance of the proposed approach is better than that of above-mentioned methods.
- Subjects :
- Computer and Information Sciences
Databases, Factual
Computer science
Kernel Functions
Normal Distribution
lcsh:Medicine
Datasets as Topic
02 engineering and technology
Research and Analysis Methods
Plot (graphics)
Machine Learning
Normal distribution
Machine Learning Algorithms
Artificial Intelligence
Support Vector Machines
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
Data Mining
Humans
lcsh:Science
Operator Theory
Multidisciplinary
business.industry
Applied Mathematics
Simulation and Modeling
lcsh:R
Pattern recognition
Probability Theory
Probability Distribution
Data set
Support vector machine
Statistical classification
Data Interpretation, Statistical
Kernel (statistics)
Physical Sciences
Outlier
Probability distribution
lcsh:Q
020201 artificial intelligence & image processing
Artificial intelligence
Information Technology
business
Mathematics
Algorithms
Research Article
Subjects
Details
- ISSN :
- 19326203
- Volume :
- 12
- Database :
- OpenAIRE
- Journal :
- PLOS ONE
- Accession number :
- edsair.doi.dedup.....e28fb8743dc2b1a8652123773a0bf693