Back to Search
Start Over
Developing more generalizable prediction models from pooled studies and large clustered data sets
- Source :
- Statistics in Medicine
- Publication Year :
- 2021
- Publisher :
- John Wiley and Sons Inc., 2021.
-
Abstract
- Prediction models often yield inaccurate predictions for new individuals. Large data sets from pooled studies or electronic healthcare records may alleviate this with an increased sample size and variability in sample characteristics. However, existing strategies for prediction model development generally do not account for heterogeneity in predictor-outcome associations between different settings and populations. This limits the generalizability of developed models (even from large, combined, clustered data sets) and necessitates local revisions. We aim to develop methodology for producing prediction models that require less tailoring to different settings and populations. We adopt internal-external cross-validation to assess and reduce heterogeneity in models' predictive performance during the development. We propose a predictor selection algorithm that optimizes the (weighted) average performance while minimizing its variability across the hold-out clusters (or studies). Predictors are added iteratively until the estimated generalizability is optimized. We illustrate this by developing a model for predicting the risk of atrial fibrillation and updating an existing one for diagnosing deep vein thrombosis, using individual participant data from 20 cohorts (N = 10 873) and 11 diagnostic studies (N = 10 014), respectively. Meta-analysis of calibration and discrimination performance in each hold-out cluster shows that trade-offs between average and heterogeneity of performance occurred. Our methodology enables the assessment of heterogeneity of prediction model performance during model development in multiple or clustered data sets, thereby informing researchers on predictor selection to improve the generalizability to different settings and populations, and reduce the need for model tailoring. Our methodology has been implemented in the R package metamisc.
- Subjects :
- Statistics and Probability
Epidemiology
Calibration (statistics)
Computer science
Sample (statistics)
Machine learning
computer.software_genre
01 natural sciences
010104 statistics & probability
03 medical and health sciences
0302 clinical medicine
RA0421
Clustered data
Humans
Generalizability theory
030212 general & internal medicine
0101 mathematics
Selection algorithm
Selection (genetic algorithm)
Research Articles
business.industry
individual participant data
prediction
internal‐external cross‐validation
Sample size determination
Research Design
Calibration
Artificial intelligence
heterogeneity
business
computer
RA
Predictive modelling
Research Article
Subjects
Details
- Language :
- English
- ISSN :
- 10970258 and 02776715
- Volume :
- 40
- Issue :
- 15
- Database :
- OpenAIRE
- Journal :
- Statistics in Medicine
- Accession number :
- edsair.doi.dedup.....a16a6fab4f3865ad8df3b8b05d356da2