1. Performance Comparison of Linear and Nonlinear Feature Selection Methods for the Analysis of Large Survey Datasets
- Author
-
Andrew Sixsmith, Martin Ester, Olga Krakovska, Gregory J. Christie, and Sylvain Moreno
- Subjects
Aging ,Databases, Factual ,Computer science ,Physiology ,computer.software_genre ,01 natural sciences ,Systems Science ,010104 statistics & probability ,Database and Informatics Methods ,Medicine and Health Sciences ,Public and Occupational Health ,Longitudinal Studies ,media_common ,0303 health sciences ,Multidisciplinary ,Alcohol Consumption ,Organic Compounds ,Contrast (statistics) ,Chemistry ,Health Education and Awareness ,Research Design ,Performance comparison ,Physical Sciences ,Medicine ,Data mining ,Behavioral and Social Aspects of Health ,Algorithms ,Research Article ,Computer and Information Sciences ,media_common.quotation_subject ,Science ,Context (language use) ,Feature selection ,Research and Analysis Methods ,03 medical and health sciences ,Humans ,0101 mathematics ,030304 developmental biology ,Nutrition ,Electronic Data Processing ,Variables ,Organic Chemistry ,Chemical Compounds ,Biology and Life Sciences ,Models, Theoretical ,Diet ,Health Care ,Nonlinear system ,Alcohols ,Survey data collection ,Physiological Processes ,computer ,Organism Development ,Nonlinear Systems ,Mathematics ,Developmental Biology - Abstract
Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.
- Published
- 2019