Back to Search
Start Over
Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
- Source :
- Journal of Cheminformatics
- Publication Year :
- 2012
-
Abstract
- Background One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. Results The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. Conclusions The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
- Subjects :
- Quantitative structure–activity relationship
Variable importance
Computer science
High dimensional data
Feature vector
Feature selection
Library and Information Sciences
computer.software_genre
01 natural sciences
Cross-validation
03 medical and health sciences
QSPR
Physical and Theoretical Chemistry
Data-mining
Selection (genetic algorithm)
030304 developmental biology
0303 health sciences
Dimensionality reduction
Random forests
Computer Graphics and Computer-Aided Design
0104 chemical sciences
Computer Science Applications
Random forest
Support vector machine
010404 medicinal & biomolecular chemistry
Property prediction
Hybrid methodology
Data mining
computer
Research Article
Subjects
Details
- ISSN :
- 17582946
- Volume :
- 5
- Issue :
- 1
- Database :
- OpenAIRE
- Journal :
- Journal of cheminformatics
- Accession number :
- edsair.doi.dedup.....a7fe3be2cce29ac4b174284ce94f04e1