Back to Search
Start Over
On the overestimation of random forest’s out-of-bag error
- Source :
- PLoS ONE, Vol 13, Iss 8, p e0201904 (2018), PLoS ONE
- Publication Year :
- 2018
- Publisher :
- Public Library of Science (PLoS), 2018.
-
Abstract
- The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.
- Subjects :
- 0301 basic medicine
Decision Analysis
010504 meteorology & atmospheric sciences
Normal Distribution
lcsh:Medicine
01 natural sciences
Trees
Mathematical and Statistical Techniques
Neoplasms
Breast Tumors
Statistics
Medicine and Health Sciences
lcsh:Science
Mathematics
Multidisciplinary
Simulation and Modeling
Prostate Cancer
Prostate Diseases
Eukaryota
Sampling (statistics)
Plants
Random forest
Oncology
Binary classification
Physical Sciences
Metric (mathematics)
Engineering and Technology
Management Engineering
Statistics (Mathematics)
Algorithms
Research Article
Urology
Decision tree
Research and Analysis Methods
Normal distribution
03 medical and health sciences
Breast Cancer
Humans
Computer Simulation
Statistical Methods
Selection (genetic algorithm)
0105 earth and related environmental sciences
Colorectal Cancer
Estimation
Decision Trees
lcsh:R
Organisms
Cancers and Neoplasms
Biology and Life Sciences
Computational Biology
Probability Theory
Probability Distribution
Genitourinary Tract Tumors
030104 developmental biology
lcsh:Q
Forecasting
Subjects
Details
- ISSN :
- 19326203
- Volume :
- 13
- Database :
- OpenAIRE
- Journal :
- PLOS ONE
- Accession number :
- edsair.doi.dedup.....da527bfd6b1219877f82a44d6c67e045