101. Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models.
- Author
-
Rospleszcz S, Janitza S, and Boulesteix AL
- Subjects
- Computer Simulation, Humans, Multivariate Analysis, Nutrition Surveys statistics & numerical data, Biometry methods, Models, Statistical
- Abstract
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap-based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks., (© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.)
- Published
- 2016
- Full Text
- View/download PDF