Back to Search
Start Over
Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record
- Source :
- Arthritis Research & Therapy, Vol 21, Iss 1, Pp 1-9 (2019), Arthritis Research & Therapy
- Publication Year :
- 2019
- Publisher :
- BMC, 2019.
-
Abstract
- Background Systemic sclerosis (SSc) is a rare disease with studies limited by small sample sizes. Electronic health records (EHRs) represent a powerful tool to study patients with rare diseases such as SSc, but validated methods are needed. We developed and validated EHR-based algorithms that incorporate billing codes and clinical data to identify SSc patients in the EHR. Methods We used a de-identified EHR with over 3 million subjects and identified 1899 potential SSc subjects with at least 1 count of the SSc ICD-9 (710.1) or ICD-10-CM (M34*) codes. We randomly selected 200 as a training set for chart review. A subject was a case if diagnosed with SSc by a rheumatologist, dermatologist, or pulmonologist. We selected the following algorithm components based on clinical knowledge and available data: SSc ICD-9 and ICD-10-CM codes, positive antinuclear antibody (ANA) (titer ≥ 1:80), and a keyword of Raynaud’s phenomenon (RP). We performed both rule-based and machine learning techniques for algorithm development. Positive predictive values (PPVs), sensitivities, and F-scores (which account for PPVs and sensitivities) were calculated for the algorithms. Results PPVs were low for algorithms using only 1 count of the SSc ICD-9 code. As code counts increased, the PPVs increased. PPVs were higher for algorithms using ICD-10-CM codes versus the ICD-9 code. Adding a positive ANA and RP keyword increased the PPVs of algorithms only using ICD billing codes. Algorithms using ≥ 3 or ≥ 4 counts of the SSc ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV at 100% but a low sensitivity at 50%. The algorithm with the highest F-score of 91% was ≥ 4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A machine learning method using random forests yielded an algorithm with a PPV of 84%, sensitivity of 92%, and F-score of 88%. The most important feature was RP keyword. Conclusions Algorithms using only ICD-9 codes did not perform well to identify SSc patients. The highest performing algorithms incorporated clinical data with billing codes. EHR-based algorithms can identify SSc patients across a healthcare system, enabling researchers to examine important outcomes.
- Subjects :
- Adult
Male
lcsh:Diseases of the musculoskeletal system
Databases, Factual
Computer science
Bioinformatics
Machine learning
computer.software_genre
Sensitivity and Specificity
Clinical knowledge
Machine Learning
03 medical and health sciences
0302 clinical medicine
International Classification of Diseases
Electronic health record
Chart review
Humans
Electronic health records
030212 general & internal medicine
Aged
Aged, 80 and over
030203 arthritis & rheumatology
Scleroderma, Systemic
Training set
business.industry
Reproducibility of Results
Rule-based system
Small sample
Middle Aged
3. Good health
Random forest
Feature (computer vision)
Systemic sclerosis
Female
Artificial intelligence
lcsh:RC925-935
business
computer
Algorithm
Algorithms
Research Article
Subjects
Details
- Language :
- English
- ISSN :
- 14786362
- Volume :
- 21
- Issue :
- 1
- Database :
- OpenAIRE
- Journal :
- Arthritis Research & Therapy
- Accession number :
- edsair.doi.dedup.....562afa234fa0a168ab890ed5e5ae710f