1. Estimating disease prevalence from drug utilization data using the Random Forest algorithm
- Author
-
Markus M J Nielen, Laurentius C J Slobbe, Koen Füssenich, Albert Wong, Hendriek Boshuizen, Hans van Oers, Johan Polder, Talitha L Feenstra, Tranzo, Scientific center for care and wellbeing, PharmacoTherapy, -Epidemiology and -Economics, Value, Affordability and Sustainability (VALUE), and Real World Studies in PharmacoEpidemiology, -Genetics, -Economics and -Therapy (PEGET)
- Subjects
Adult ,Male ,NATIONAL-HEALTH ,medicine.medical_specialty ,030232 urology & nephrology ,Prevalence ,Disease ,03 medical and health sciences ,0302 clinical medicine ,Acquired immunodeficiency syndrome (AIDS) ,Internal medicine ,Diabetes mellitus ,medicine ,Life Science ,Humans ,030212 general & internal medicine ,Medical prescription ,Aged ,Netherlands ,Probability ,Human Nutrition & Health ,Asthma ,Aged, 80 and over ,COPD ,business.industry ,Humane Voeding & Gezondheid ,Public Health, Environmental and Occupational Health ,Area under the curve ,DIABETES-MELLITUS ,Middle Aged ,CARE ,medicine.disease ,TRENDS ,Drug Utilization ,Hospitalization ,Biometris ,Population Surveillance ,Chronic Disease ,GERMANY ,Female ,Public Health Monitoring ,business ,Algorithms ,Forecasting - Abstract
Background Aggregated claims data on medication are often used as a proxy for the prevalence of diseases, especially chronic diseases. However, linkage between medication and diagnosis tend to be theory based and not very precise. Modelling disease probability at an individual level using individual level data may yield more accurate results. Methods Individual probabilities of having a certain chronic disease were estimated using the Random Forest (RF) algorithm. A training set was created from a general practitioners database of 276 723 cases that included diagnosis and claims data on medication. Model performance for 29 chronic diseases was evaluated using Receiver-Operator Curves, by measuring the Area Under the Curve (AUC). Results The diseases for which model performance was best were Parkinson’s disease (AUC = .89, 95% CI = .77–1.00), diabetes (AUC = .87, 95% CI = .85–.90), osteoporosis (AUC = .87, 95% CI = .81–.92) and heart failure (AUC = .81, 95% CI = .74–.88). Five other diseases had an AUC >.75: asthma, chronic enteritis, COPD, epilepsy and HIV/AIDS. For 16 of 17 diseases tested, the medication categories used in theory-based algorithms were also identified by our method, however the RF models included a broader range of medications as important predictors. Conclusion Data on medication use can be a useful predictor when estimating the prevalence of several chronic diseases. To improve the estimates, for a broader range of chronic diseases, research should use better training data, include more details concerning dosages and duration of prescriptions, and add related predictors like hospitalizations.
- Published
- 2019