1. Cardiometabolic risk estimation using exposome data and machine learning
- Author
-
Atehortúa, A. (Angélica), Gkontra, P. (Polyxeni), Camacho, M. (Marina), Diaz, O. (Oliver), Bulgheroni, M. (Maria), Simonetti, V. (Valentina), Chadeau-Hyam, M. (Marc), Felix, J. F. (Janine F.), Sebert, S. (Sylvain), Lekadir, K. (Karim), Atehortúa, A. (Angélica), Gkontra, P. (Polyxeni), Camacho, M. (Marina), Diaz, O. (Oliver), Bulgheroni, M. (Maria), Simonetti, V. (Valentina), Chadeau-Hyam, M. (Marc), Felix, J. F. (Janine F.), Sebert, S. (Sylvain), and Lekadir, K. (Karim)
- Abstract
Background: The human exposome encompasses all exposures that individuals encounter throughout their lifetime. It is now widely acknowledged that health outcomes are influenced not only by genetic factors but also by the interactions between these factors and various exposures. Consequently, the exposome has emerged as a significant contributor to the overall risk of developing major diseases, such as cardiovascular disease (CVD) and diabetes. Therefore, personalized early risk assessment based on exposome attributes might be a promising tool for identifying high-risk individuals and improving disease prevention. Objective: Develop and evaluate a novel and fair machine learning (ML) model for CVD and type 2 diabetes (T2D) risk prediction based on a set of readily available exposome factors. We evaluated our model using internal and external validation groups from a multi-center cohort. To be considered fair, the model was required to demonstrate consistent performance across different sub-groups of the cohort. Methods: From the UK Biobank, we identified 5,348 and 1,534 participants who within 13 years from the baseline visit were diagnosed with CVD and T2D, respectively. An equal number of participants who did not develop these pathologies were randomly selected as the control group. 109 readily available exposure variables from six different categories (physical measures, environmental, lifestyle, mental health events, sociodemographics, and early-life factors) from the participant’s baseline visit were considered. We adopted the XGBoost ensemble model to predict individuals at risk of developing the diseases. The model’s performance was compared to that of an integrative ML model which is based on a set of biological, clinical, physical, and sociodemographic variables, and, additionally for CVD, to the Framingham risk score. Moreover, we assessed the proposed model for potential bias related to sex, ethnicity, and age. Lastly, we interpreted the model’s r
- Published
- 2023