66 results on '"Denny, JC."'
Search Results
2. PheWAS analysis on large-scale biobank data with PheTK.
- Author
-
Tran TC, Schlueter DJ, Zeng C, Mo H, Carroll RJ, and Denny JC
- Subjects
- Humans, Electronic Health Records, Phenotype, Genome-Wide Association Study methods, Phenomics methods, Databases, Genetic, Biological Specimen Banks, Software
- Abstract
Summary: With the rapid growth of genetic data linked to electronic health record (EHR) data in huge cohorts, large-scale phenome-wide association study (PheWAS) have become powerful discovery tools in biomedical research. PheWAS is an analysis method to study phenotype associations utilizing longitudinal EHR data. Previous PheWAS packages were developed mostly with smaller datasets and with earlier PheWAS approaches. PheTK was designed to simplify analysis and efficiently handle biobank-scale data. PheTK uses multithreading and supports a full PheWAS workflow including extraction of data from OMOP databases and Hail matrix tables as well as PheWAS analysis for both phecode version 1.2 and phecodeX. Benchmarking results showed PheTK took 64% less time than the R PheWAS package to complete the same workflow. PheTK can be run locally or on cloud platforms such as the All of Us Researcher Workbench (All of Us) or the UK Biobank (UKB) Research Analysis Platform (RAP)., Availability and Implementation: The PheTK package is freely available on the Python Package Index, on GitHub under GNU General Public License (GPL-3) at https://github.com/nhgritctran/PheTK, and on Zenodo, DOI 10.5281/zenodo.14217954, at https://doi.org/10.5281/zenodo.14217954. PheTK is implemented in Python and platform independent., (Published by Oxford University Press 2024.)
- Published
- 2024
- Full Text
- View/download PDF
3. Informatics innovation to provide return of value to participant communities in the All of Us Research Program.
- Author
-
Mapes BM, Peterson RS, Watson K, Basford M, Cohn E, Harris PA, and Denny JC
- Subjects
- United States, Humans, Precision Medicine, Biomedical Research, Access to Information, Medical Informatics
- Abstract
Objectives: The All of Us Research Program harnesses advances in technology, science, and engagement for precision medicine research. We describe informatics innovations which support that goal and return value to the participant cohort and community., Materials and Methods: Research data from the All of Us Research Program are available to authorized users on the All of Us Researcher Workbench. We describe the technical infrastructure that enables data access and usage for researchers. Participants are considered partners. To ensure return of value, we outline participant access to information., Results: The All of Us Research Hub allows broad access to data, regardless of background. The innovations described are rooted in the program's core values: participation is open and reflects the diversity of the United States; participants are partners and have access to their information; transparency, security, and privacy are of the highest importance; data are broadly accessible; and the program promotes positive change. We assess research impact and reflect on how All of Us can increase existing return of value to participant communities through future informatics advancements., Discussion: The program will continue to support efforts to ensure equitable access to data and return of value to participants. Looking ahead, we invite the community to join us., Conclusion: All of Us research findings can change clinical care, inform guidelines, and set a new bar for data sharing. The ultimate return of value is better care for all., (Published by Oxford University Press on behalf of the American Medical Informatics Association 2024.)
- Published
- 2024
- Full Text
- View/download PDF
4. Comparison of phenomic profiles in the All of Us Research Program against the US general population and the UK Biobank.
- Author
-
Zeng C, Schlueter DJ, Tran TC, Babbar A, Cassini T, Bastarache LA, and Denny JC
- Subjects
- Humans, Biological Specimen Banks, UK Biobank, Phenotype, United Kingdom epidemiology, Phenomics, Population Health
- Abstract
Importance: Knowledge gained from cohort studies has dramatically advanced both public and precision health. The All of Us Research Program seeks to enroll 1 million diverse participants who share multiple sources of data, providing unique opportunities for research. It is important to understand the phenomic profiles of its participants to conduct research in this cohort., Objectives: More than 280 000 participants have shared their electronic health records (EHRs) in the All of Us Research Program. We aim to understand the phenomic profiles of this cohort through comparisons with those in the US general population and a well-established nation-wide cohort, UK Biobank, and to test whether association results of selected commonly studied diseases in the All of Us cohort were comparable to those in UK Biobank., Materials and Methods: We included participants with EHRs in All of Us and participants with health records from UK Biobank. The estimates of prevalence of diseases in the US general population were obtained from the Global Burden of Diseases (GBD) study. We conducted phenome-wide association studies (PheWAS) of 9 commonly studied diseases in both cohorts., Results: This study included 287 012 participants from the All of Us EHR cohort and 502 477 participants from the UK Biobank. A total of 314 diseases curated by the GBD were evaluated in All of Us, 80.9% (N = 254) of which were more common in All of Us than in the US general population [prevalence ratio (PR) >1.1, P < 2 × 10-5]. Among 2515 diseases and phenotypes evaluated in both All of Us and UK Biobank, 85.6% (N = 2152) were more common in All of Us (PR >1.1, P < 2 × 10-5). The Pearson correlation coefficients of effect sizes from PheWAS between All of Us and UK Biobank were 0.61, 0.50, 0.60, 0.57, 0.40, 0.53, 0.46, 0.47, and 0.24 for ischemic heart diseases, lung cancer, chronic obstructive pulmonary disease, dementia, colorectal cancer, lower back pain, multiple sclerosis, lupus, and cystic fibrosis, respectively., Discussion: Despite the differences in prevalence of diseases in All of Us compared to the US general population or the UK Biobank, our study supports that All of Us can facilitate rapid investigation of a broad range of diseases., Conclusion: Most diseases were more common in All of Us than in the general US population or the UK Biobank. Results of disease-disease association tests from All of Us are comparable to those estimated in another well-studied national cohort., (Published by Oxford University Press on behalf of the American Medical Informatics Association 2024.)
- Published
- 2024
- Full Text
- View/download PDF
5. Systematic replication of smoking disease associations using survey responses and EHR data in the All of Us Research Program.
- Author
-
Schlueter DJ, Sulieman L, Mo H, Keaton JM, Ferrara TM, Williams A, Qian J, Stubblefield O, Zeng C, Tran TC, Bastarache L, Dai J, Babbar A, Ramirez A, Goleva SB, and Denny JC
- Subjects
- Humans, Phenotype, Polymorphism, Single Nucleotide, Smoking, Genome-Wide Association Study methods, Population Health
- Abstract
Objective: The All of Us Research Program (All of Us) aims to recruit over a million participants to further precision medicine. Essential to the verification of biobanks is a replication of known associations to establish validity. Here, we evaluated how well All of Us data replicated known cigarette smoking associations., Materials and Methods: We defined smoking exposure as follows: (1) an EHR Smoking exposure that used International Classification of Disease codes; (2) participant provided information (PPI) Ever Smoking; and, (3) PPI Current Smoking, both from the lifestyle survey. We performed a phenome-wide association study (PheWAS) for each smoking exposure measurement type. For each, we compared the effect sizes derived from the PheWAS to published meta-analyses that studied cigarette smoking from PubMed. We defined two levels of replication of meta-analyses: (1) nominally replicated: which required agreement of direction of effect size, and (2) fully replicated: which required overlap of confidence intervals., Results: PheWASes with EHR Smoking, PPI Ever Smoking, and PPI Current Smoking revealed 736, 492, and 639 phenome-wide significant associations, respectively. We identified 165 meta-analyses representing 99 distinct phenotypes that could be matched to EHR phenotypes. At P < .05, 74 were nominally replicated and 55 were fully replicated. At P < 2.68 × 10-5 (Bonferroni threshold), 58 were nominally replicated and 40 were fully replicated., Discussion: Most phenotypes found in published meta-analyses associated with smoking were nominally replicated in All of Us. Both survey and EHR definitions for smoking produced similar results., Conclusion: This study demonstrated the feasibility of studying common exposures using All of Us data., (Published by Oxford University Press on behalf of the American Medical Informatics Association 2023.)
- Published
- 2023
- Full Text
- View/download PDF
6. The Impact of COVID-19 on the All of Us Research Program.
- Author
-
Hedden SL, McClain J, Mandich A, Baskir R, Caulder MS, Denny JC, Hamlet MRJ, Prabhu Das I, McNeil Ford N, Lopez-Class M, Elmi A, Wallace R, Linkie A, and Garriock HA
- Subjects
- Humans, Pandemics prevention & control, Data Collection, COVID-19 epidemiology, Population Health
- Abstract
The All of Us Research Program, a health and genetics epidemiologic data collection program, has been substantially affected by the coronavirus disease 2019 (COVID-19) pandemic. Although the program is highly digital in nature, certain aspects of the data collection require in-person interaction between staff and participants. Before the pandemic, the program was enrolling approximately 12,500 participants per month at more than 400 clinical sites. In March 2020, because of the pandemic, all in-person activity at program sites and by engagement partners was paused to develop processes and procedures for in-person activities that incorporated strict safety protocols. In addition, the program adopted new data collection methodologies to reduce the need for in-person activities. Through February 2022, a total of 224 clinical sites had reactivated in-person activity, and all enrollment and engagement partners have adopted new data collection methods that can be used remotely. As the COVID-19 pandemic persists, the program continues to require safety procedures for in-person activity and continues to generate and pilot methodologies that reduce risk and make it easier for participants to provide information., (Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health 2022. This work is written by (a) US Government employee(s) and is in the public domain in the US.)
- Published
- 2023
- Full Text
- View/download PDF
7. Genetically predicted sex hormone levels and health outcomes: phenome-wide Mendelian randomization investigation.
- Author
-
Yuan S, Wang L, Sun J, Yu L, Zhou X, Yang J, Zhu Y, Gill D, Burgess S, Denny JC, Larsson SC, Theodoratou E, and Li X
- Subjects
- Female, Humans, Male, Gonadal Steroid Hormones, Mendelian Randomization Analysis, Estradiol blood, Estradiol genetics, Sex Hormone-Binding Globulin genetics, Testosterone blood, Testosterone genetics
- Abstract
Background: Sex hormone-binding globulin (SHBG), testosterone and oestradiol have been associated with many diseases in observational studies; however, the causality of associations remains unestablished., Methods: A phenome-wide Mendelian randomization (MR) association study was performed to explore disease outcomes associated with genetically proxied circulating SHBG, testosterone and oestradiol levels by using updated genetic instruments in 339 197 unrelated White British individuals (54% female) in the UK Biobank. Two-sample MR analyses with data from large genetic studies were conducted to replicate identified associations in phenome-wide MR analyses. Multivariable MR analyses were performed to investigate mediation effects of hormone-related biomarkers in observed associations with diseases., Results: Phenome-wide MR analyses examined associations of genetically predicted SHBG, testosterone and oestradiol levels with 1211 disease outcomes, and identified 28 and 13 distinct phenotypes associated with genetically predicted SHBG and testosterone, respectively; 22 out of 28 associations for SHBG and 10 out of 13 associations for testosterone were replicated in two-sample MR analyses. Higher genetically predicted SHBG levels were associated with a reduced risk of hypertension, type 2 diabetes, diabetic complications, coronary atherosclerotic outcomes, gout and benign and malignant neoplasm of uterus, but an increased risk of varicose veins and fracture (mainly in females). Higher genetically predicted testosterone levels were associated with a lower risk of type 2 diabetes, coronary atherosclerotic outcomes, gout and coeliac disease mainly in males, but an increased risk of cholelithiasis in females., Conclusions: These findings suggest that sex hormones may causally affect risk of several health outcomes., (© The Author(s) 2022. Published by Oxford University Press on behalf of the International Epidemiological Association.)
- Published
- 2022
- Full Text
- View/download PDF
8. Cox regression is robust to inaccurate EHR-extracted event time: an application to EHR-based GWAS.
- Author
-
Irlmeier R, Hughey JJ, Bastarache L, Denny JC, and Chen Q
- Subjects
- Proportional Hazards Models, Logistic Models, Genotype, Computer Simulation, Genome-Wide Association Study, Electronic Health Records
- Abstract
Motivation: Logistic regression models are used in genomic studies to analyze the genetic data linked to electronic health records (EHRs), and do not take full usage of the time-to-event information available in EHRs. Previous work has shown that Cox regression, which can account for left truncation and right censoring in EHRs, increased the power to detect genotype-phenotype associations compared to logistic regression. We extend this to evaluate the relative performance of Cox regression and various logistic regression models in the presence of positive errors in event time (delayed event time), relating to recorded event time accuracy., Results: One Cox model and three logistic regression models were considered under different scenarios of delayed event time. Extensive simulations and a genomic study application were used to evaluate the impact of delayed event time. While logistic regression does not model the time-to-event directly, various logistic regression models used in the literature were more sensitive to delayed event time than Cox regression. Results highlighted the importance to identify and exclude the patients diagnosed before entry time. Cox regression had similar or modest improvement in statistical power over various logistic regression models at controlled type I error. This was supported by the empirical data, where the Cox models steadily had the highest sensitivity to detect known genotype-phenotype associations under all scenarios of delayed event time., Availability and Implementation: Access to individual-level EHR and genotype data is restricted by the IRB. Simulation code and R script for data process are at: https://github.com/QingxiaCindyChen/CoxRobustEHR.git., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2022
- Full Text
- View/download PDF
9. Antibodies to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in All of Us Research Program Participants, 2 January to 18 March 2020.
- Author
-
Althoff KN, Schlueter DJ, Anton-Culver H, Cherry J, Denny JC, Thomsen I, Karlson EW, Havers FP, Cicek MS, Thibodeau SN, Pinto LA, Lowy D, Malin BA, Ohno-Machado L, Williams C, Goldstein D, Kouame A, Ramirez A, Roman A, Sharpless NE, Gebo KA, and Schully SD
- Subjects
- Antibodies, Viral, Enzyme-Linked Immunosorbent Assay, Humans, Immunoglobulin G, SARS-CoV-2, Sensitivity and Specificity, COVID-19 diagnosis, Population Health
- Abstract
Background: With limited severe acute respiratory syndrome coronavirus (SARS-CoV-2) testing capacity in the United States at the start of the epidemic (January-March 2020), testing was focused on symptomatic patients with a travel history throughout February, obscuring the picture of SARS-CoV-2 seeding and community transmission. We sought to identify individuals with SARS-CoV-2 antibodies in the early weeks of the US epidemic., Methods: All of Us study participants in all 50 US states provided blood specimens during study visits from 2 January to 18 March 2020. Participants were considered seropositive if they tested positive for SARS-CoV-2 immunoglobulin G (IgG) antibodies with the Abbott Architect SARS-CoV-2 IgG enzyme-linked immunosorbent assay (ELISA) and the EUROIMMUN SARS-CoV-2 ELISA in a sequential testing algorithm. The sensitivity and specificity of these ELISAs and the net sensitivity and specificity of the sequential testing algorithm were estimated, along with 95% confidence intervals (CIs)., Results: The estimated sensitivities of the Abbott and EUROIMMUN assays were 100% (107 of 107 [95% CI: 96.6%-100%]) and 90.7% (97 of 107 [83.5%-95.4%]), respectively, and the estimated specificities were 99.5% (995 of 1000 [98.8%-99.8%]) and 99.7% (997 of 1000 [99.1%-99.9%]), respectively. The net sensitivity and specificity of our sequential testing algorithm were 90.7% (97 of 107 [95% CI: 83.5%-95.4%]) and 100.0% (1000 of 1000 [99.6%-100%]), respectively. Of the 24 079 study participants with blood specimens from 2 January to 18 March 2020, 9 were seropositive, 7 before the first confirmed case in the states of Illinois, Massachusetts, Wisconsin, Pennsylvania, and Mississippi., Conclusions: Our findings identified SARS-CoV-2 infections weeks before the first recognized cases in 5 US states., (Published by Oxford University Press for the Infectious Diseases Society of America 2021.)
- Published
- 2022
- Full Text
- View/download PDF
10. PheWAS-ME: a web-app for interactive exploration of multimorbidity patterns in PheWAS.
- Author
-
Strayer N, Shirey-Rice JK, Shyr Y, Denny JC, Pulley JM, and Xu Y
- Subjects
- Genome-Wide Association Study, Genotype, Humans, Multimorbidity, Phenotype, Mobile Applications, Polymorphism, Single Nucleotide
- Abstract
Summary: Electronic health records (EHRs) linked with a DNA biobank provide unprecedented opportunities for biomedical research in precision medicine. The Phenome-wide association study (PheWAS) is a widely used technique for the evaluation of relationships between genetic variants and a large collection of clinical phenotypes recorded in EHRs. PheWAS analyses are typically presented as static tables and charts of summary statistics obtained from statistical tests of association between a genetic variant and individual phenotypes. Comorbidities are common and typically lead to complex, multivariate gene-disease association signals that are challenging to interpret. Discovering and interrogating multimorbidity patterns and their influence in PheWAS is difficult and time-consuming. We present PheWAS-ME: an interactive dashboard to visualize individual-level genotype and phenotype data side-by-side with PheWAS analysis results, allowing researchers to explore multimorbidity patterns and their associations with a genetic variant of interest. We expect this application to enrich PheWAS analyses by illuminating clinical multimorbidity patterns present in the data., Availability and Implementation: A demo PheWAS-ME application is publicly available at https://prod.tbilab.org/phewas_me/. Sample datasets are provided for exploration with the option to upload custom PheWAS results and corresponding individual-level data. Online versions of the appendices are available at https://prod.tbilab.org/phewas_me_info/. The source code is available as an R package on GitHub (https://github.com/tbilab/multimorbidity_explorer)., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
11. DDIWAS: High-throughput electronic health record-based screening of drug-drug interactions.
- Author
-
Wu P, Nelson SD, Zhao J, Stone CA Jr, Feng Q, Chen Q, Larson EA, Li B, Cox NJ, Stein CM, Phillips EJ, Roden DM, Denny JC, and Wei WQ
- Subjects
- Drug Interactions, Electronic Health Records, Humans, Knowledge Bases, Drug-Related Side Effects and Adverse Reactions, Pharmaceutical Preparations
- Abstract
Objective: We developed and evaluated Drug-Drug Interaction Wide Association Study (DDIWAS). This novel method detects potential drug-drug interactions (DDIs) by leveraging data from the electronic health record (EHR) allergy list., Materials and Methods: To identify potential DDIs, DDIWAS scans for drug pairs that are frequently documented together on the allergy list. Using deidentified medical records, we tested 616 drugs for potential DDIs with simvastatin (a common lipid-lowering drug) and amlodipine (a common blood-pressure lowering drug). We evaluated the performance to rediscover known DDIs using existing knowledge bases and domain expert review. To validate potential novel DDIs, we manually reviewed patient charts and searched the literature., Results: DDIWAS replicated 34 known DDIs. The positive predictive value to detect known DDIs was 0.85 and 0.86 for simvastatin and amlodipine, respectively. DDIWAS also discovered potential novel interactions between simvastatin-hydrochlorothiazide, amlodipine-omeprazole, and amlodipine-valacyclovir. A software package to conduct DDIWAS is publicly available., Conclusions: In this proof-of-concept study, we demonstrate the value of incorporating information mined from existing allergy lists to detect DDIs in a real-world clinical setting. Since allergy lists are routinely collected in EHRs, DDIWAS has the potential to detect and validate DDI signals across institutions., (© The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
12. Meeting the challenge: Health information technology's essential role in achieving precision medicine.
- Author
-
Zayas-Cabán T, Chaney KJ, Rogers CC, Denny JC, and White PJ
- Subjects
- Delivery of Health Care, Humans, Medical Informatics, Precision Medicine
- Abstract
Precision medicine can revolutionize health care by tailoring treatments to individual patient needs. Advancing precision medicine requires evidence development through research that combines needed data, including clinical data, at an unprecedented scale. Widespread adoption of health information technology (IT) has made digital clinical data broadly available. These data and information systems must evolve to support precision medicine research and delivery. Specifically, relevant health IT data, infrastructure, clinical integration, and policy needs must be addressed. This article outlines those needs and describes work the Office of the National Coordinator for Health Information Technology is leading to improve health IT through pilot projects and standards and policy development. The Office of the National Coordinator for Health Information Technology will build on these efforts and continue to coordinate with other key stakeholders to achieve the vision of precision medicine. Advancement of precision medicine will require ongoing, collaborative health IT policy and technical initiatives that advance discovery and transform healthcare delivery., (Published by Oxford University Press on behalf of the American Medical Informatics Association 2021. This work is written by US Government employees and is in the public domain in the US.)
- Published
- 2021
- Full Text
- View/download PDF
13. Real-time clinical note monitoring to detect conditions for rapid follow-up: A case study of clinical trial enrollment in drug-induced torsades de pointes and Stevens-Johnson syndrome.
- Author
-
DeLozier S, Speltz P, Brito J, Tang LA, Wang J, Smith JC, Giuse D, Phillips E, Williams K, Strickland T, Davogustto G, Roden D, and Denny JC
- Subjects
- Adult, Data Mining, Female, Humans, Male, Middle Aged, Prospective Studies, Rare Diseases diagnosis, Torsades de Pointes diagnosis, Drug-Related Side Effects and Adverse Reactions diagnosis, Electronic Health Records, Medical Order Entry Systems, Stevens-Johnson Syndrome diagnosis, Torsades de Pointes chemically induced
- Abstract
Identifying acute events as they occur is challenging in large hospital systems. Here, we describe an automated method to detect 2 rare adverse drug events (ADEs), drug-induced torsades de pointes and Stevens-Johnson syndrome and toxic epidermal necrolysis, in near real time for participant recruitment into prospective clinical studies. A text processing system searched clinical notes from the electronic health record (EHR) for relevant keywords and alerted study personnel via email of potential patients for chart review or in-person evaluation. Between 2016 and 2018, the automated recruitment system resulted in capture of 138 true cases of drug-induced rare events, improving recall from 43% to 93%. Our focused electronic alert system maintained 2-year enrollment, including across an EHR migration from a bespoke system to Epic. Real-time monitoring of EHR notes may accelerate research for certain conditions less amenable to conventional study recruitment paradigms., (© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
14. PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records.
- Author
-
Zheng NS, Feng Q, Kerchberger VE, Zhao J, Edwards TL, Cox NJ, Stein CM, Roden DM, Denny JC, and Wei WQ
- Subjects
- Adult, Dementia genetics, Diabetes Mellitus, Type 2 genetics, Genome-Wide Association Study, Humans, Hypothyroidism genetics, Natural Language Processing, Polymorphism, Single Nucleotide, Terminology as Topic, Algorithms, Electronic Health Records, Information Storage and Retrieval methods, Knowledge Bases, Phenotype
- Abstract
Objective: Developing algorithms to extract phenotypes from electronic health records (EHRs) can be challenging and time-consuming. We developed PheMap, a high-throughput phenotyping approach that leverages multiple independent, online resources to streamline the phenotyping process within EHRs., Materials and Methods: PheMap is a knowledge base of medical concepts with quantified relationships to phenotypes that have been extracted by natural language processing from publicly available resources. PheMap searches EHRs for each phenotype's quantified concepts and uses them to calculate an individual's probability of having this phenotype. We compared PheMap to clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network for type 2 diabetes mellitus (T2DM), dementia, and hypothyroidism using 84 821 individuals from Vanderbilt Univeresity Medical Center's BioVU DNA Biobank. We implemented PheMap-based phenotypes for genome-wide association studies (GWAS) for T2DM, dementia, and hypothyroidism, and phenome-wide association studies (PheWAS) for variants in FTO, HLA-DRB1, and TCF7L2., Results: In this initial iteration, the PheMap knowledge base contains quantified concepts for 841 disease phenotypes. For T2DM, dementia, and hypothyroidism, the accuracy of the PheMap phenotypes were >97% using a 50% threshold and eMERGE case-control status as a reference standard. In the GWAS analyses, PheMap-derived phenotype probabilities replicated 43 of 51 previously reported disease-associated variants for the 3 phenotypes. For 9 of the 11 top associations, PheMap provided an equivalent or more significant P value than eMERGE-based phenotypes. The PheMap-based PheWAS showed comparable or better performance to a traditional phecode-based PheWAS. PheMap is publicly available online., Conclusions: PheMap significantly streamlines the process of extracting research-quality phenotype information from EHRs, with comparable or better performance to current phenotyping approaches., (© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association.)
- Published
- 2020
- Full Text
- View/download PDF
15. A Polygenic and Phenotypic Risk Prediction for Polycystic Ovary Syndrome Evaluated by Phenome-Wide Association Studies.
- Author
-
Joo YY, Actkins K, Pacheco JA, Basile AO, Carroll R, Crosslin DR, Day F, Denny JC, Velez Edwards DR, Hakonarson H, Harley JB, Hebbring SJ, Ho K, Jarvik GP, Jones M, Karaderi T, Mentch FD, Meun C, Namjou B, Pendergrass S, Ritchie MD, Stanaway IB, Urbanek M, Walunas TL, Smith M, Chisholm RL, Kho AN, Davis L, and Hayes MG
- Subjects
- Adolescent, Aged, Case-Control Studies, Child, Electronic Health Records, Female, Follow-Up Studies, Genetic Predisposition to Disease, Humans, Middle Aged, Polycystic Ovary Syndrome epidemiology, Polycystic Ovary Syndrome genetics, Prognosis, Risk Factors, Algorithms, Genome-Wide Association Study, Multifactorial Inheritance genetics, Phenomics methods, Phenotype, Polycystic Ovary Syndrome diagnosis
- Abstract
Context: As many as 75% of patients with polycystic ovary syndrome (PCOS) are estimated to be unidentified in clinical practice., Objective: Utilizing polygenic risk prediction, we aim to identify the phenome-wide comorbidity patterns characteristic of PCOS to improve accurate diagnosis and preventive treatment., Design, Patients, and Methods: Leveraging the electronic health records (EHRs) of 124 852 individuals, we developed a PCOS risk prediction algorithm by combining polygenic risk scores (PRS) with PCOS component phenotypes into a polygenic and phenotypic risk score (PPRS). We evaluated its predictive capability across different ancestries and perform a PRS-based phenome-wide association study (PheWAS) to assess the phenomic expression of the heightened risk of PCOS., Results: The integrated polygenic prediction improved the average performance (pseudo-R2) for PCOS detection by 0.228 (61.5-fold), 0.224 (58.8-fold), 0.211 (57.0-fold) over the null model across European, African, and multi-ancestry participants respectively. The subsequent PRS-powered PheWAS identified a high level of shared biology between PCOS and a range of metabolic and endocrine outcomes, especially with obesity and diabetes: "morbid obesity", "type 2 diabetes", "hypercholesterolemia", "disorders of lipid metabolism", "hypertension", and "sleep apnea" reaching phenome-wide significance., Conclusions: Our study has expanded the methodological utility of PRS in patient stratification and risk prediction, especially in a multifactorial condition like PCOS, across different genetic origins. By utilizing the individual genome-phenome data available from the EHR, our approach also demonstrates that polygenic prediction by PRS can provide valuable opportunities to discover the pleiotropic phenomic network associated with PCOS pathogenesis., (© Endocrine Society 2020. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2020
- Full Text
- View/download PDF
16. Evaluating the Utility of Polygenic Risk Scores in Identifying High-Risk Individuals for Eight Common Cancers.
- Author
-
Jia G, Lu Y, Wen W, Long J, Liu Y, Tao R, Li B, Denny JC, Shu XO, and Zheng W
- Abstract
Background: Genome-wide association studies have identified common genetic risk variants in many loci associated with multiple cancers. We sought to systematically evaluate the utility of these risk variants in identifying high-risk individuals for eight common cancers., Methods: We constructed polygenic risk scores (PRS) using genome-wide association studies-identified risk variants for each cancer. Using data from 400 812 participants of European descent in a population-based cohort study, UK Biobank, we estimated hazard ratios associated with PRS using Cox proportional hazard models and evaluated the performance of the PRS in cancer risk prediction and their ability to identify individuals at more than a twofold elevated risk, a risk level comparable to a moderate-penetrance mutation in known cancer predisposition genes., Results: During a median follow-up of 5.8 years, 14 584 incident case patients of cancers were identified (ranging from 358 epithelial ovarian cancer case patients to 4430 prostate cancer case patients). Compared with those at an average risk, individuals among the highest 5% of the PRS had a two- to threefold elevated risk for cancer of the prostate, breast, pancreas, colorectal, or ovary, and an approximately 1.5-fold elevated risk of cancer of the lung, bladder, or kidney. The areas under the curve ranged from 0.567 to 0.662. Using PRS, 40.4% of the study participants can be classified as having more than a twofold elevated risk for at least one site-specific cancer., Conclusions: A large proportion of the general population can be identified at an elevated cancer risk by PRS, supporting the potential clinical utility of PRS for personalized cancer risk prediction., (© The Author(s) 2020. Published by Oxford University Press.)
- Published
- 2020
- Full Text
- View/download PDF
17. medExtractR: A targeted, customizable approach to medication extraction from electronic health records.
- Author
-
Weeks HL, Beck C, McNeer E, Williams ML, Bejan CA, Denny JC, and Choi L
- Subjects
- Datasets as Topic, Drug Therapy, Humans, Programming Languages, Algorithms, Data Mining methods, Electronic Health Records, Natural Language Processing, Pharmaceutical Preparations, Software
- Abstract
Objective: We developed medExtractR, a natural language processing system to extract medication information from clinical notes. Using a targeted approach, medExtractR focuses on individual drugs to facilitate creation of medication-specific research datasets from electronic health records., Materials and Methods: Written using the R programming language, medExtractR combines lexicon dictionaries and regular expressions to identify relevant medication entities (eg, drug name, strength, frequency). MedExtractR was developed on notes from Vanderbilt University Medical Center, using medications prescribed with varying complexity. We evaluated medExtractR and compared it with 3 existing systems: MedEx, MedXN, and CLAMP (Clinical Language Annotation, Modeling, and Processing). We also demonstrated how medExtractR can be easily tuned for better performance on an outside dataset using the MIMIC-III (Medical Information Mart for Intensive Care III) database., Results: On 50 test notes per development drug and 110 test notes for an additional drug, medExtractR achieved high overall performance (F-measures >0.95), exceeding performance of the 3 existing systems across all drugs. MedExtractR achieved the highest F-measure for each individual entity, except drug name and dose amount for allopurinol. With tuning and customization, medExtractR achieved F-measures >0.90 in the MIMIC-III dataset., Discussion: The medExtractR system successfully extracted entities for medications of interest. High performance in entity-level extraction provides a strong foundation for developing robust research datasets for pharmacological research. When working with new datasets, medExtractR should be tuned on a small sample of notes before being broadly applied., Conclusions: The medExtractR system achieved high performance extracting specific medications from clinical text, leading to higher-quality research datasets for drug-related studies than some existing general-purpose medication extraction tools., (© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
- Published
- 2020
- Full Text
- View/download PDF
18. Improving the phenotype risk score as a scalable approach to identifying patients with Mendelian disease.
- Author
-
Bastarache L, Hughey JJ, Goldstein JA, Bastraache JA, Das S, Zaki NC, Zeng C, Tang LA, Roden DM, and Denny JC
- Subjects
- Adult, Child, Cystic Fibrosis, Genetic Diseases, Inborn genetics, Humans, International Classification of Diseases, Risk Factors, Data Mining methods, Electronic Health Records, Genetic Diseases, Inborn diagnosis, Phenotype
- Abstract
Objective: The Phenotype Risk Score (PheRS) is a method to detect Mendelian disease patterns using phenotypes from the electronic health record (EHR). We compared the performance of different approaches mapping EHR phenotypes to Mendelian disease features., Materials and Methods: PheRS utilizes Mendelian diseases descriptions annotated with Human Phenotype Ontology (HPO) terms. In previous work, we presented a map linking phecodes (based on International Classification of Diseases [ICD]-Ninth Revision) to HPO terms. For this study, we integrated ICD-Tenth Revision codes and lab data. We also created a new map between HPO terms using customized groupings of ICD codes. We compared the performance with cases and controls for 16 Mendelian diseases using 2.5 million de-identified medical records., Results: PheRS effectively distinguished cases from controls for all 15 positive controls and all approaches tested (P < 4 × 1016). Adding lab data led to a statistically significant improvement for 4 of 14 diseases. The custom ICD groupings improved specificity, leading to an average 8% increase for precision at 100 (-2% to 22%). Eight of 10 adults with cystic fibrosis tested had PheRS in the 95th percentile prio to diagnosis., Discussion: Both phecodes and custom ICD groupings were able to detect differences between affected cases and controls at the population level. The ICD map showed better precision for the highest scoring individuals. Adding lab data improved performance at detecting population-level differences., Conclusions: PheRS is a scalable method to study Mendelian disease at the population level using electronic health record data and can potentially be used to find patients with undiagnosed Mendelian disease., (© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
- Published
- 2019
- Full Text
- View/download PDF
19. Cost-aware active learning for named entity recognition in clinical text.
- Author
-
Wei Q, Chen Y, Salimi M, Denny JC, Mei Q, Lasko TA, Chen Q, Wu S, Franklin A, Cohen T, and Xu H
- Subjects
- Big Data, Computer Simulation, Humans, Models, Economic, Algorithms, Electronic Health Records economics, Information Storage and Retrieval economics, Natural Language Processing
- Abstract
Objective: Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies., Materials and Methods: We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning., Results: Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%-30.2%., Discussion: Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users., Conclusion: Cost-CAUSE saves significant annotation cost compared to random sampling., (© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
- Published
- 2019
- Full Text
- View/download PDF
20. Phenome-wide Mendelian-randomization study of genetically determined vitamin D on multiple health outcomes using the UK Biobank study.
- Author
-
Meng X, Li X, Timofeeva MN, He Y, Spiliopoulou A, Wei WQ, Gifford A, Wu H, Varley T, Joshi P, Denny JC, Farrington SM, Zgaga L, Dunlop MG, McKeigue P, Campbell H, and Theodoratou E
- Subjects
- Adult, Age Distribution, Aged, Biological Specimen Banks, Blood Pressure, Body Mass Index, Databases, Factual, Depression epidemiology, Diabetes Mellitus, Type 2 epidemiology, Female, Fractures, Bone epidemiology, Genetic Predisposition to Disease, Genome-Wide Association Study, Health Behavior, Humans, Hypertension epidemiology, Male, Mendelian Randomization Analysis, Middle Aged, Mortality, Myocardial Ischemia epidemiology, Phenotype, Polymorphism, Single Nucleotide, Sexism, Socioeconomic Factors, United Kingdom epidemiology, Vitamin D Deficiency epidemiology, Vitamin D Deficiency genetics
- Abstract
Background: Vitamin D deficiency is highly prevalent across the globe. Existing studies suggest that a low vitamin D level is associated with more than 130 outcomes. Exploring the causal role of vitamin D in health outcomes could support or question vitamin D supplementation., Methods: We carried out a systematic literature review of previous Mendelian-randomization studies on vitamin D. We then implemented a Mendelian Randomization-Phenome Wide Association Study (MR-PheWAS) analysis on data from 339 256 individuals of White British origin from UK Biobank. We first ran a PheWAS analysis to test the associations between a 25(OH)D polygenic risk score and 920 disease outcomes, and then nine phenotypes (i.e. systolic blood pressure, diastolic blood pressure, risk of hypertension, T2D, ischaemic heart disease, body mass index, depression, non-vertebral fracture and all-cause mortality) that met the pre-defined inclusion criteria for further analysis were examined by multiple MR analytical approaches to explore causality., Results: The PheWAS analysis did not identify any health outcome associated with the 25(OH)D polygenic risk score. Although a selection of nine outcomes were reported in previous Mendelian-randomization studies or umbrella reviews to be associated with vitamin D, our MR analysis, with substantial study power (>80% power to detect an association with an odds ratio >1.2 for per standard deviation increase of log-transformed 25[OH]D), was unable to support an interpretation of causal association., Conclusions: We investigated the putative causal effects of vitamin D on multiple health outcomes in a White population. We did not support a causal effect on any of the disease outcomes tested. However, we cannot exclude small causal effects or effects on outcomes that we did not have enough power to explore due to the small number of cases., (© The Author(s) 2019. Published by Oxford University Press on behalf of the International Epidemiological Association.)
- Published
- 2019
- Full Text
- View/download PDF
21. A case study evaluating the portability of an executable computable phenotype algorithm across multiple institutions and electronic health record environments.
- Author
-
Pacheco JA, Rasmussen LV, Kiefer RC, Campion TR, Speltz P, Carroll RJ, Stallings SC, Mo H, Ahuja M, Jiang G, LaRose ER, Peissig PL, Shang N, Benoit B, Gainer VS, Borthwick K, Jackson KL, Sharma A, Wu AY, Kho AN, Roden DM, Pathak J, Denny JC, and Thompson WK
- Subjects
- Data Warehousing, Databases, Factual, Genomics, Humans, Male, Organizational Case Studies, Prostatic Hyperplasia genetics, Algorithms, Electronic Health Records, Phenotype, Prostatic Hyperplasia diagnosis
- Abstract
Electronic health record (EHR) algorithms for defining patient cohorts are commonly shared as free-text descriptions that require human intervention both to interpret and implement. We developed the Phenotype Execution and Modeling Architecture (PhEMA, http://projectphema.org) to author and execute standardized computable phenotype algorithms. With PhEMA, we converted an algorithm for benign prostatic hyperplasia, developed for the electronic Medical Records and Genomics network (eMERGE), into a standards-based computable format. Eight sites (7 within eMERGE) received the computable algorithm, and 6 successfully executed it against local data warehouses and/or i2b2 instances. Blinded random chart review of cases selected by the computable algorithm shows PPV ≥90%, and 3 out of 5 sites had >90% overlap of selected cases when comparing the computable algorithm to their original eMERGE implementation. This case study demonstrates potential use of PhEMA computable representations to automate phenotyping across different EHR systems, but also highlights some ongoing challenges.
- Published
- 2018
- Full Text
- View/download PDF
22. Evaluating statistical approaches to leverage large clinical datasets for uncovering therapeutic and adverse medication effects.
- Author
-
Choi L, Carroll RJ, Beck C, Mosley JD, Roden DM, Denny JC, and Van Driest SL
- Subjects
- Drug Discovery, Drug-Related Side Effects and Adverse Reactions, Electronic Health Records, Humans, Logistic Models, Probability, Datasets as Topic
- Abstract
Motivation: Phenome-wide association studies (PheWAS) have been used to discover many genotype-phenotype relationships and have the potential to identify therapeutic and adverse drug outcomes using longitudinal data within electronic health records (EHRs). However, the statistical methods for PheWAS applied to longitudinal EHR medication data have not been established., Results: In this study, we developed methods to address two challenges faced with reuse of EHR for this purpose: confounding by indication, and low exposure and event rates. We used Monte Carlo simulation to assess propensity score (PS) methods, focusing on two of the most commonly used methods, PS matching and PS adjustment, to address confounding by indication. We also compared two logistic regression approaches (the default of Wald versus Firth's penalized maximum likelihood, PML) to address complete separation due to sparse data with low exposure and event rates. PS adjustment resulted in greater power than PS matching, while controlling Type I error at 0.05. The PML method provided reasonable P-values, even in cases with complete separation, with well controlled Type I error rates. Using PS adjustment and the PML method, we identify novel latent drug effects in pediatric patients exposed to two common antibiotic drugs, ampicillin and gentamicin., Availability and Implementation: R packages PheWAS and EHR are available at https://github.com/PheWAS/PheWAS and at CRAN (https://www.r-project.org/), respectively. The R script for data processing and the main analysis is available at https://github.com/choileena/EHR., Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2018
- Full Text
- View/download PDF
23. Rare Variants in the Gene ALPL That Cause Hypophosphatasia Are Strongly Associated With Ovarian and Uterine Disorders.
- Author
-
Dahir KM, Tilden DR, Warner JL, Bastarache L, Smith DK, Gifford A, Ramirez AH, Simmons JS, Black MM, Newman JH, and Denny JC
- Subjects
- Adult, Aged, Aged, 80 and over, Alleles, DNA Mutational Analysis, Female, Gene Frequency, Genetic Predisposition to Disease, Genotype, Humans, Male, Middle Aged, Mutation, Phenotype, Alkaline Phosphatase genetics, Hypophosphatasia genetics, Ovarian Diseases genetics, Polymorphism, Single Nucleotide, Uterine Diseases genetics
- Abstract
Context: Mutations in alkaline phosphatase (AlkP), liver/bone/kidney (ALPL), which encodes tissue-nonspecific isozyme AlkP, cause hypophosphatasia (HPP). HPP is suspected by a low-serum AlkP. We hypothesized that some patients with bone or dental disease have undiagnosed HPP, caused by ALPL variants., Objective: Our objective was to discover the prevalence of these gene variants in the Vanderbilt University DNA Biobank (BioVU) and to assess phenotypic associations., Design: We identified subjects in BioVU, a repository of DNA, that had at least one of three known, rare HPP disease-causing variants in ALPL: rs199669988, rs121918007, and/or rs121918002. To evaluate for phenotypic associations, we conducted a sequential phenome-wide association study of ALPL variants and then performed a de-identified manual record review to refine the phenotype., Results: Out of 25,822 genotyped individuals, we identified 52 women and 53 men with HPP disease-causing variants in ALPL, 7/1000. None had a clinical diagnosis of HPP. For patients with ALPL variants, the average serum AlkP levels were in the lower range of normal or lower. Forty percent of men and 62% of women had documented bone and/or dental disease, compatible with the diagnosis of HPP. Forty percent of the female patients had ovarian pathology or other gynecological abnormalities compared with 15% seen in controls., Conclusions: Variants in the ALPL gene cause bone and dental disease in patients with and without the standard biomarker, low plasma AlkP. ALPL gene variants are more prevalent than currently reported and underdiagnosed. Gynecologic disease appears to be associated with HPP-causing variants in ALPL.
- Published
- 2018
- Full Text
- View/download PDF
24. Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records.
- Author
-
Bejan CA, Angiolillo J, Conway D, Nash R, Shirey-Rice JK, Lipworth L, Cronin RM, Pulley J, Kripalani S, Barkin S, Johnson KB, and Denny JC
- Subjects
- Child, Computational Biology, Humans, Adverse Childhood Experiences statistics & numerical data, Data Mining methods, Electronic Health Records, Ill-Housed Persons statistics & numerical data, Social Determinants of Health
- Abstract
Objective: Understanding how to identify the social determinants of health from electronic health records (EHRs) could provide important insights to understand health or disease outcomes. We developed a methodology to capture 2 rare and severe social determinants of health, homelessness and adverse childhood experiences (ACEs), from a large EHR repository., Materials and Methods: We first constructed lexicons to capture homelessness and ACE phenotypic profiles. We employed word2vec and lexical associations to mine homelessness-related words. Next, using relevance feedback, we refined the 2 profiles with iterative searches over 100 million notes from the Vanderbilt EHR. Seven assessors manually reviewed the top-ranked results of 2544 patient visits relevant for homelessness and 1000 patients relevant for ACE., Results: word2vec yielded better performance (area under the precision-recall curve [AUPRC] of 0.94) than lexical associations (AUPRC = 0.83) for extracting homelessness-related words. A comparative study of searches for the 2 phenotypes revealed a higher performance achieved for homelessness (AUPRC = 0.95) than ACE (AUPRC = 0.79). A temporal analysis of the homeless population showed that the majority experienced chronic homelessness. Most ACE patients suffered sexual (70%) and/or physical (50.6%) abuse, with the top-ranked abuser keywords being "father" (21.8%) and "mother" (15.4%). Top prevalent associated conditions for homeless patients were lack of housing (62.8%) and tobacco use disorder (61.5%), while for ACE patients it was mental disorders (36.6%-47.6%)., Conclusion: We provide an efficient solution for mining homelessness and ACE information from EHRs, which can facilitate large clinical and genetic studies of these social determinants of health., (© The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com)
- Published
- 2018
- Full Text
- View/download PDF
25. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD).
- Author
-
Wu Y, Denny JC, Trent Rosenbloom S, Miller RA, Giuse DA, Wang L, Blanquicett C, Soysal E, Xu J, and Xu H
- Subjects
- Humans, Patient Discharge, Abbreviations as Topic, Electronic Health Records, Machine Learning, Natural Language Processing
- Abstract
Objective: The goal of this study was to develop a practical framework for recognizing and disambiguating clinical abbreviations, thereby improving current clinical natural language processing (NLP) systems' capability to handle abbreviations in clinical narratives., Methods: We developed an open-source framework for clinical abbreviation recognition and disambiguation (CARD) that leverages our previously developed methods, including: (1) machine learning based approaches to recognize abbreviations from a clinical corpus, (2) clustering-based semiautomated methods to generate possible senses of abbreviations, and (3) profile-based word sense disambiguation methods for clinical abbreviations. We applied CARD to clinical corpora from Vanderbilt University Medical Center (VUMC) and generated 2 comprehensive sense inventories for abbreviations in discharge summaries and clinic visit notes. Furthermore, we developed a wrapper that integrates CARD with MetaMap, a widely used general clinical NLP system., Results and Conclusion: CARD detected 27 317 and 107 303 distinct abbreviations from discharge summaries and clinic visit notes, respectively. Two sense inventories were constructed for the 1000 most frequent abbreviations in these 2 corpora. Using the sense inventories created from discharge summaries, CARD achieved an F1 score of 0.755 for identifying and disambiguating all abbreviations in a corpus from the VUMC discharge summaries, which is superior to MetaMap and Apache's clinical Text Analysis Knowledge Extraction System (cTAKES). Using additional external corpora, we also demonstrated that the MetaMap-CARD wrapper improved MetaMap's performance in recognizing disorder entities in clinical notes. The CARD framework, 2 sense inventories, and the wrapper for MetaMap are publicly available at https://sbmi.uth.edu/ccb/resources/abbreviation.htm . We believe the CARD framework can be a valuable resource for improving abbreviation identification in clinical NLP systems., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com)
- Published
- 2017
- Full Text
- View/download PDF
26. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.
- Author
-
Teixeira PL, Wei WQ, Cronin RM, Mo H, VanHouten JP, Carroll RJ, LaRose E, Bastarache LA, Rosenbloom ST, Edwards TL, Roden DM, Lasko TA, Dart RA, Nikolai AM, Peissig PL, and Denny JC
- Subjects
- Aged, Blood Pressure Determination, Clinical Coding, Female, Humans, Information Storage and Retrieval methods, Male, Middle Aged, Natural Language Processing, Phenotype, ROC Curve, Algorithms, Electronic Health Records, Hypertension diagnosis, Machine Learning
- Abstract
Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites., Materials and Methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic., Results: Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar., Conclusion: This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2017
- Full Text
- View/download PDF
27. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability.
- Author
-
Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, Pacheco JA, Tromp G, Pathak J, Carrell DS, Ellis SB, Lingren T, Thompson WK, Savova G, Haines J, Roden DM, Harris PA, and Denny JC
- Subjects
- Data Mining methods, Electronic Health Records, Genomics, Humans, International Classification of Diseases, Natural Language Processing, Algorithms, Knowledge Bases, Phenotype
- Abstract
Objective: Health care generated data have become an important source for clinical and genomic research. Often, investigators create and iteratively refine phenotype algorithms to achieve high positive predictive values (PPVs) or sensitivity, thereby identifying valid cases and controls. These algorithms achieve the greatest utility when validated and shared by multiple health care systems.Materials and Methods We report the current status and impact of the Phenotype KnowledgeBase (PheKB, http://phekb.org), an online environment supporting the workflow of building, sharing, and validating electronic phenotype algorithms. We analyze the most frequent components used in algorithms and their performance at authoring institutions and secondary implementation sites., Results: As of June 2015, PheKB contained 30 finalized phenotype algorithms and 62 algorithms in development spanning a range of traits and diseases. Phenotypes have had over 3500 unique views in a 6-month period and have been reused by other institutions. International Classification of Disease codes were the most frequently used component, followed by medications and natural language processing. Among algorithms with published performance data, the median PPV was nearly identical when evaluated at the authoring institutions (n = 44; case 96.0%, control 100%) compared to implementation sites (n = 40; case 97.5%, control 100%)., Discussion: These results demonstrate that a broad range of algorithms to mine electronic health record data from different health systems can be developed with high PPV, and algorithms developed at one site are generally transportable to others., Conclusion: By providing a central repository, PheKB enables improved development, transportability, and validity of algorithms for research-grade phenotypes using health care generated data., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
28. Precision medicine informatics.
- Author
-
Frey LJ, Bernstam EV, and Denny JC
- Subjects
- Computational Biology, Medical Informatics, Precision Medicine
- Published
- 2016
- Full Text
- View/download PDF
29. A multi-institution evaluation of clinical profile anonymization.
- Author
-
Heatherly R, Rasmussen LV, Peissig PL, Pacheco JA, Harris P, Denny JC, and Malin BA
- Subjects
- Confidentiality, Humans, Hypothyroidism, International Classification of Diseases, Organizational Case Studies, Data Anonymization, Electronic Health Records, Information Dissemination
- Abstract
Background and Objective: There is an increasing desire to share de-identified electronic health records (EHRs) for secondary uses, but there are concerns that clinical terms can be exploited to compromise patient identities. Anonymization algorithms mitigate such threats while enabling novel discoveries, but their evaluation has been limited to single institutions. Here, we study how an existing clinical profile anonymization fares at multiple medical centers., Methods: We apply a state-of-the-artk-anonymization algorithm, withkset to the standard value 5, to the International Classification of Disease, ninth edition codes for patients in a hypothyroidism association study at three medical centers: Marshfield Clinic, Northwestern University, and Vanderbilt University. We assess utility when anonymizing at three population levels: all patients in 1) the EHR system; 2) the biorepository; and 3) a hypothyroidism study. We evaluate utility using 1) changes to the number included in the dataset, 2) number of codes included, and 3) regions generalization and suppression were required., Results: Our findings yield several notable results. First, we show that anonymizing in the context of the entire EHR yields a significantly greater quantity of data by reducing the amount of generalized regions from ∼15% to ∼0.5%. Second, ∼70% of codes that needed generalization only generalized two or three codes in the largest anonymization., Conclusions: Sharing large volumes of clinical data in support of phenome-wide association studies is possible while safeguarding privacy to the underlying individuals., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
30. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance.
- Author
-
Wei WQ, Teixeira PL, Mo H, Cronin RM, Warner JL, and Denny JC
- Subjects
- Diagnosis, Humans, Medical Records, Problem-Oriented, Predictive Value of Tests, Algorithms, Electronic Health Records, International Classification of Diseases, Phenotype
- Abstract
Objective: To evaluate the phenotyping performance of three major electronic health record (EHR) components: International Classification of Disease (ICD) diagnosis codes, primary notes, and specific medications., Materials and Methods: We conducted the evaluation using de-identified Vanderbilt EHR data. We preselected ten diseases: atrial fibrillation, Alzheimer's disease, breast cancer, gout, human immunodeficiency virus infection, multiple sclerosis, Parkinson's disease, rheumatoid arthritis, and types 1 and 2 diabetes mellitus. For each disease, patients were classified into seven categories based on the presence of evidence in diagnosis codes, primary notes, and specific medications. Twenty-five patients per disease category (a total number of 175 patients for each disease, 1750 patients for all ten diseases) were randomly selected for manual chart review. Review results were used to estimate the positive predictive value (PPV), sensitivity, andF-score for each EHR component alone and in combination., Results: The PPVs of single components were inconsistent and inadequate for accurately phenotyping (0.06-0.71). Using two or more ICD codes improved the average PPV to 0.84. We observed a more stable and higher accuracy when using at least two components (mean ± standard deviation: 0.91 ± 0.08). Primary notes offered the best sensitivity (0.77). The sensitivity of ICD codes was 0.67. Again, two or more components provided a reasonably high and stable sensitivity (0.59 ± 0.16). Overall, the best performance (Fscore: 0.70 ± 0.12) was achieved by using two or more components. Although the overall performance of using ICD codes (0.67 ± 0.14) was only slightly lower than using two or more components, its PPV (0.71 ± 0.13) is substantially worse (0.91 ± 0.08)., Conclusion: Multiple EHR components provide a more consistent and higher performance than a single one for the selected phenotypes. We suggest considering multiple EHR components for future phenotyping design in order to obtain an ideal result., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
31. Harnessing next-generation informatics for personalizing medicine: a report from AMIA's 2014 Health Policy Invitational Meeting.
- Author
-
Wiley LK, Tarczy-Hornoch P, Denny JC, Freimuth RR, Overby CL, Shah N, Martin RD, and Sarkar IN
- Subjects
- Humans, Societies, Medical, United States, Health Policy, Medical Informatics, Precision Medicine
- Abstract
The American Medical Informatics Association convened the 2014 Health Policy Invitational Meeting to develop recommendations for updates to current policies and to establish an informatics research agenda for personalizing medicine. In particular, the meeting focused on discussing informatics challenges related to personalizing care through the integration of genomic or other high-volume biomolecular data with data from clinical systems to make health care more efficient and effective. This report summarizes the findings (n = 6) and recommendations (n = 15) from the policy meeting, which were clustered into 3 broad areas: (1) policies governing data access for research and personalization of care; (2) policy and research needs for evolving data interpretation and knowledge representation; and (3) policy and research needs to ensure data integrity and preservation. The meeting outcome underscored the need to address a number of important policy and technical considerations in order to realize the potential of personalized or precision medicine in actual clinical contexts., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2016
- Full Text
- View/download PDF
32. Desiderata for computable representations of electronic health records-driven phenotype algorithms.
- Author
-
Mo H, Thompson WK, Rasmussen LV, Pacheco JA, Jiang G, Kiefer R, Zhu Q, Xu J, Montague E, Carrell DS, Lingren T, Mentch FD, Ni Y, Wehbe FH, Peissig PL, Tromp G, Larson EB, Chute CG, Pathak J, Denny JC, Speltz P, Kho AN, Jarvik GP, Bejan CA, Williams MS, Borthwick K, Kitchner TE, Roden DM, and Harris PA
- Subjects
- Humans, Phenotype, Algorithms, Diagnosis, Computer-Assisted, Electronic Health Records
- Abstract
Background: Electronic health records (EHRs) are increasingly used for clinical and translational research through the creation of phenotype algorithms. Currently, phenotype algorithms are most commonly represented as noncomputable descriptive documents and knowledge artifacts that detail the protocols for querying diagnoses, symptoms, procedures, medications, and/or text-driven medical concepts, and are primarily meant for human comprehension. We present desiderata for developing a computable phenotype representation model (PheRM)., Methods: A team of clinicians and informaticians reviewed common features for multisite phenotype algorithms published in PheKB.org and existing phenotype representation platforms. We also evaluated well-known diagnostic criteria and clinical decision-making guidelines to encompass a broader category of algorithms., Results: We propose 10 desired characteristics for a flexible, computable PheRM: (1) structure clinical data into queryable forms; (2) recommend use of a common data model, but also support customization for the variability and availability of EHR data among sites; (3) support both human-readable and computable representations of phenotype algorithms; (4) implement set operations and relational algebra for modeling phenotype algorithms; (5) represent phenotype criteria with structured rules; (6) support defining temporal relations between events; (7) use standardized terminologies and ontologies, and facilitate reuse of value sets; (8) define representations for text searching and natural language processing; (9) provide interfaces for external software algorithms; and (10) maintain backward compatibility., Conclusion: A computable PheRM is needed for true phenotype portability and reliability across different EHR products and healthcare systems. These desiderata are a guide to inform the establishment and evolution of EHR phenotype algorithm authoring platforms and languages., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.)
- Published
- 2015
- Full Text
- View/download PDF
33. Review and evaluation of electronic health records-driven phenotype algorithm authoring tools for clinical and translational research.
- Author
-
Xu J, Rasmussen LV, Shaw PL, Jiang G, Kiefer RC, Mo H, Pacheco JA, Speltz P, Zhu Q, Denny JC, Pathak J, Thompson WK, and Montague E
- Subjects
- Humans, Translational Research, Biomedical, Algorithms, Biomedical Research, Electronic Health Records, Software
- Abstract
Objective: To review and evaluate available software tools for electronic health record-driven phenotype authoring in order to identify gaps and needs for future development., Materials and Methods: Candidate phenotype authoring tools were identified through (1) literature search in four publication databases (PubMed, Embase, Web of Science, and Scopus) and (2) a web search. A collection of tools was compiled and reviewed after the searches. A survey was designed and distributed to the developers of the reviewed tools to discover their functionalities and features., Results: Twenty-four different phenotype authoring tools were identified and reviewed. Developers of 16 of these identified tools completed the evaluation survey (67% response rate). The surveyed tools showed commonalities but also varied in their capabilities in algorithm representation, logic functions, data support and software extensibility, search functions, user interface, and data outputs., Discussion: Positive trends identified in the evaluation included: algorithms can be represented in both computable and human readable formats; and most tools offer a web interface for easy access. However, issues were also identified: many tools were lacking advanced logic functions for authoring complex algorithms; the ability to construct queries that leveraged un-structured data was not widely implemented; and many tools had limited support for plug-ins or external analytic software., Conclusions: Existing phenotype authoring tools could enable clinical researchers to work with electronic health record data more efficiently, but gaps still exist in terms of the functionalities of such tools. The present work can serve as a reference point for the future development of similar tools., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
34. New data and an old puzzle: the negative association between schizophrenia and rheumatoid arthritis.
- Author
-
Lee SH, Byrne EM, Hultman CM, Kähler A, Vinkhuyzen AA, Ripke S, Andreassen OA, Frisell T, Gusev A, Hu X, Karlsson R, Mantzioris VX, McGrath JJ, Mehta D, Stahl EA, Zhao Q, Kendler KS, Sullivan PF, Price AL, O'Donovan M, Okada Y, Mowry BJ, Raychaudhuri S, Wray NR, Byerley W, Cahn W, Cantor RM, Cichon S, Cormican P, Curtis D, Djurovic S, Escott-Price V, Gejman PV, Georgieva L, Giegling I, Hansen TF, Ingason A, Kim Y, Konte B, Lee PH, McIntosh A, McQuillin A, Morris DW, Nöthen MM, O'Dushlaine C, Olincy A, Olsen L, Pato CN, Pato MT, Pickard BS, Posthuma D, Rasmussen HB, Rietschel M, Rujescu D, Schulze TG, Silverman JM, Thirumalai S, Werge T, Agartz I, Amin F, Azevedo MH, Bass N, Black DW, Blackwood DH, Bruggeman R, Buccola NG, Choudhury K, Cloninger RC, Corvin A, Craddock N, Daly MJ, Datta S, Donohoe GJ, Duan J, Dudbridge F, Fanous A, Freedman R, Freimer NB, Friedl M, Gill M, Gurling H, De Haan L, Hamshere ML, Hartmann AM, Holmans PA, Kahn RS, Keller MC, Kenny E, Kirov GK, Krabbendam L, Krasucki R, Lawrence J, Lencz T, Levinson DF, Lieberman JA, Lin DY, Linszen DH, Magnusson PK, Maier W, Malhotra AK, Mattheisen M, Mattingsdal M, McCarroll SA, Medeiros H, Melle I, Milanova V, Myin-Germeys I, Neale BM, Ophoff RA, Owen MJ, Pimm J, Purcell SM, Puri V, Quested DJ, Rossin L, Ruderfer D, Sanders AR, Shi J, Sklar P, St Clair D, Stroup TS, Van Os J, Visscher PM, Wiersma D, Zammit S, Bridges SL Jr, Choi HK, Coenen MJ, de Vries N, Dieud P, Greenberg JD, Huizinga TW, Padyukov L, Siminovitch KA, Tak PP, Worthington J, De Jager PL, Denny JC, Gregersen PK, Klareskog L, Mariette X, Plenge RM, van Laar M, and van Riel P
- Subjects
- Adolescent, Adult, Cohort Studies, Cross-Sectional Studies, Female, Gene-Environment Interaction, Genetic Predisposition to Disease, Genetic Variation, Genome-Wide Association Study, Humans, Male, Middle Aged, Young Adult, Arthritis, Rheumatoid genetics, Polymorphism, Single Nucleotide, Schizophrenia genetics
- Abstract
Background: A long-standing epidemiological puzzle is the reduced rate of rheumatoid arthritis (RA) in those with schizophrenia (SZ) and vice versa. Traditional epidemiological approaches to determine if this negative association is underpinned by genetic factors would test for reduced rates of one disorder in relatives of the other, but sufficiently powered data sets are difficult to achieve. The genomics era presents an alternative paradigm for investigating the genetic relationship between two uncommon disorders., Methods: We use genome-wide common single nucleotide polymorphism (SNP) data from independently collected SZ and RA case-control cohorts to estimate the SNP correlation between the disorders. We test a genotype X environment (GxE) hypothesis for SZ with environment defined as winter- vs summer-born., Results: We estimate a small but significant negative SNP-genetic correlation between SZ and RA (-0.046, s.e. 0.026, P = 0.036). The negative correlation was stronger for the SNP set attributed to coding or regulatory regions (-0.174, s.e. 0.071, P = 0.0075). Our analyses led us to hypothesize a gene-environment interaction for SZ in the form of immune challenge. We used month of birth as a proxy for environmental immune challenge and estimated the genetic correlation between winter-born and non-winter born SZ to be significantly less than 1 for coding/regulatory region SNPs (0.56, s.e. 0.14, P = 0.00090)., Conclusions: Our results are consistent with epidemiological observations of a negative relationship between SZ and RA reflecting, at least in part, genetic factors. Results of the month of birth analysis are consistent with pleiotropic effects of genetic variants dependent on environmental context.
- Published
- 2015
- Full Text
- View/download PDF
35. Assessing the role of a medication-indication resource in the treatment relation extraction from clinical text.
- Author
-
Bejan CA, Wei WQ, and Denny JC
- Subjects
- Algorithms, Humans, Knowledge Bases, RxNorm, Information Storage and Retrieval methods, Pharmaceutical Preparations, Semantics, Unified Medical Language System
- Abstract
Objective: To evaluate the contribution of the MEDication Indication (MEDI) resource and SemRep for identifying treatment relations in clinical text., Materials and Methods: We first processed clinical documents with SemRep to extract the Unified Medical Language System (UMLS) concepts and the treatment relations between them. Then, we incorporated MEDI into a simple algorithm that identifies treatment relations between two concepts if they match a medication-indication pair in this resource. For a better coverage, we expanded MEDI using ontology relationships from RxNorm and UMLS Metathesaurus. We also developed two ensemble methods, which combined the predictions of SemRep and the MEDI algorithm. We evaluated our selected methods on two datasets, a Vanderbilt corpus of 6864 discharge summaries and the 2010 Informatics for Integrating Biology and the Bedside (i2b2)/Veteran's Affairs (VA) challenge dataset., Results: The Vanderbilt dataset included 958 manually annotated treatment relations. A double annotation was performed on 25% of relations with high agreement (Cohen's κ = 0.86). The evaluation consisted of comparing the manual annotated relations with the relations identified by SemRep, the MEDI algorithm, and the two ensemble methods. On the first dataset, the best F1-measure results achieved by the MEDI algorithm and the union of the two resources (78.7 and 80, respectively) were significantly higher than the SemRep results (72.3). On the second dataset, the MEDI algorithm achieved better precision and significantly lower recall values than the best system in the i2b2 challenge. The two systems obtained comparable F1-measure values on the subset of i2b2 relations with both arguments in MEDI., Conclusions: Both SemRep and MEDI can be used to extract treatment relations from clinical text. Knowledge-based extraction with MEDI outperformed use of SemRep alone, but superior performance was achieved by integrating both systems. The integration of knowledge-based resources such as MEDI into information extraction systems such as SemRep and the i2b2 relation extractors may improve treatment relation extraction from clinical text., (© The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
36. Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record.
- Author
-
Lin C, Karlson EW, Dligach D, Ramirez MP, Miller TA, Mo H, Braggs NS, Cagan A, Gainer V, Denny JC, and Savova GK
- Subjects
- Humans, Liver drug effects, Algorithms, Arthritis, Rheumatoid drug therapy, Chemical and Drug Induced Liver Injury diagnosis, Electronic Health Records, Immunosuppressive Agents adverse effects, Methotrexate adverse effects
- Abstract
Objectives: To improve the accuracy of mining structured and unstructured components of the electronic medical record (EMR) by adding temporal features to automatically identify patients with rheumatoid arthritis (RA) with methotrexate-induced liver transaminase abnormalities., Materials and Methods: Codified information and a string-matching algorithm were applied to a RA cohort of 5903 patients from Partners HealthCare to select 1130 patients with potential liver toxicity. Supervised machine learning was applied as our key method. For features, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) was used to extract standard vocabulary from relevant sections of the unstructured clinical narrative. Temporal features were further extracted to assess the temporal relevance of event mentions with regard to the date of transaminase abnormality. All features were encapsulated in a 3-month-long episode for classification. Results were summarized at patient level in a training set (N=480 patients) and evaluated against a test set (N=120 patients)., Results: The system achieved positive predictive value (PPV) 0.756, sensitivity 0.919, F1 score 0.829 on the test set, which was significantly better than the best baseline system (PPV 0.590, sensitivity 0.703, F1 score 0.642). Our innovations, which included framing the phenotype problem as an episode-level classification task, and adding temporal information, all proved highly effective., Conclusions: Automated methotrexate-induced liver toxicity phenotype discovery for patients with RA based on structured and unstructured information in the EMR shows accurate results. Our work demonstrates that adding temporal features significantly improved classification results., (© The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
37. Seeing the forest through the trees: uncovering phenomic complexity through interactive network visualization.
- Author
-
Warner JL, Denny JC, Kreda DA, and Alterovitz G
- Subjects
- Audiovisual Aids, Data Display, Pattern Recognition, Automated, User-Computer Interface
- Abstract
Our aim was to uncover unrecognized phenomic relationships using force-based network visualization methods, based on observed electronic medical record data. A primary phenotype was defined from actual patient profiles in the Multiparameter Intelligent Monitoring in Intensive Care II database. Network visualizations depicting primary relationships were compared to those incorporating secondary adjacencies. Interactivity was enabled through a phenotype visualization software concept: the Phenomics Advisor. Subendocardial infarction with cardiac arrest was demonstrated as a sample phenotype; there were 332 primarily adjacent diagnoses, with 5423 relationships. Primary network visualization suggested a treatment-related complication phenotype and several rare diagnoses; re-clustering by secondary relationships revealed an emergent cluster of smokers with the metabolic syndrome. Network visualization reveals phenotypic patterns that may have remained occult in pairwise correlation analysis. Visualization of complex data, potentially offered as point-of-care tools on mobile devices, may allow clinicians and researchers to quickly generate hypotheses and gain deeper understanding of patient subpopulations., (© The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
38. Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality.
- Author
-
Xu H, Aldrich MC, Chen Q, Liu H, Peterson NB, Dai Q, Levy M, Shah A, Han X, Ruan X, Jiang M, Li Y, Julien JS, Warner J, Friedman C, Roden DM, and Denny JC
- Subjects
- Administration, Oral, Adult, Diabetes Mellitus, Type 2 complications, Diabetes Mellitus, Type 2 drug therapy, Diabetes Mellitus, Type 2 mortality, Humans, Natural Language Processing, Neoplasms complications, Neoplasms prevention & control, Registries, Survival Analysis, Drug Repositioning, Electronic Health Records, Hypoglycemic Agents therapeutic use, Information Storage and Retrieval methods, Metformin therapeutic use, Neoplasms mortality
- Abstract
Objectives: Drug repurposing, which finds new indications for existing drugs, has received great attention recently. The goal of our work is to assess the feasibility of using electronic health records (EHRs) and automated informatics methods to efficiently validate a recent drug repurposing association of metformin with reduced cancer mortality., Methods: By linking two large EHRs from Vanderbilt University Medical Center and Mayo Clinic to their tumor registries, we constructed a cohort including 32,415 adults with a cancer diagnosis at Vanderbilt and 79,258 cancer patients at Mayo from 1995 to 2010. Using automated informatics methods, we further identified type 2 diabetes patients within the cancer cohort and determined their drug exposure information, as well as other covariates such as smoking status. We then estimated HRs for all-cause mortality and their associated 95% CIs using stratified Cox proportional hazard models. HRs were estimated according to metformin exposure, adjusted for age at diagnosis, sex, race, body mass index, tobacco use, insulin use, cancer type, and non-cancer Charlson comorbidity index., Results: Among all Vanderbilt cancer patients, metformin was associated with a 22% decrease in overall mortality compared to other oral hypoglycemic medications (HR 0.78; 95% CI 0.69 to 0.88) and with a 39% decrease compared to type 2 diabetes patients on insulin only (HR 0.61; 95% CI 0.50 to 0.73). Diabetic patients on metformin also had a 23% improved survival compared with non-diabetic patients (HR 0.77; 95% CI 0.71 to 0.85). These associations were replicated using the Mayo Clinic EHR data. Many site-specific cancers including breast, colorectal, lung, and prostate demonstrated reduced mortality with metformin use in at least one EHR., Conclusions: EHR data suggested that the use of metformin was associated with decreased mortality after a cancer diagnosis compared with diabetic and non-diabetic cancer patients not on metformin, indicating its potential as a chemotherapeutic regimen. This study serves as a model for robust and inexpensive validation studies for drug repurposing signals using EHR data., (© The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.)
- Published
- 2015
- Full Text
- View/download PDF
39. SecureMA: protecting participant privacy in genetic association meta-analysis.
- Author
-
Xie W, Kantarcioglu M, Bush WS, Crawford D, Denny JC, Heatherly R, and Malin BA
- Subjects
- Genome-Wide Association Study methods, Genomics, Humans, Hypothyroidism genetics, Obesity genetics, Software, Genetic Association Studies methods, Genetic Privacy, Meta-Analysis as Topic
- Abstract
Motivation: Sharing genomic data is crucial to support scientific investigation such as genome-wide association studies. However, recent investigations suggest the privacy of the individual participants in these studies can be compromised, leading to serious concerns and consequences, such as overly restricted access to data., Results: We introduce a novel cryptographic strategy to securely perform meta-analysis for genetic association studies in large consortia. Our methodology is useful for supporting joint studies among disparate data sites, where privacy or confidentiality is of concern. We validate our method using three multisite association studies. Our research shows that genetic associations can be analyzed efficiently and accurately across substudy sites, without leaking information on individual participants and site-level association summaries., Availability and Implementation: Our software for secure meta-analysis of genetic association studies, SecureMA, is publicly available at http://github.com/XieConnect/SecureMA. Our customized secure computation framework is also publicly available at http://github.com/XieConnect/CircuitService., (© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2014
- Full Text
- View/download PDF
40. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment.
- Author
-
Carroll RJ, Bastarache L, and Denny JC
- Subjects
- Data Interpretation, Statistical, Humans, Multiple Sclerosis genetics, Genetic Association Studies methods, Genetic Variation, Phenotype, Software
- Abstract
Unlabelled: Phenome-wide association studies (PheWAS) have been used to replicate known genetic associations and discover new phenotype associations for genetic variants. This PheWAS implementation allows users to translate ICD-9 codes to PheWAS case and control groups, perform analyses using these and/or other phenotypes with covariate adjustments and plot the results. We demonstrate the methods by replicating a PheWAS on rs3135388 (near HLA-DRB, associated with multiple sclerosis) and performing a novel PheWAS using an individual's maximum white blood cell count (WBC) as a continuous measure. Our results for rs3135388 replicate known associations with more significant results than the original study on the same dataset. Our PheWAS of WBC found expected results, including associations with infections, myeloproliferative diseases and associated conditions, such as anemia. These results demonstrate the performance of the improved classification scheme and the flexibility of PheWAS encapsulated in this package., Availability and Implementation: This R package is freely available under the Gnu Public License (GPL-3) from http://phewascatalog.org. It is implemented in native R and is platform independent., (© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2014
- Full Text
- View/download PDF
41. Predicting changes in hypertension control using electronic health records from a chronic disease management program.
- Author
-
Sun J, McNaughton CD, Zhang P, Perer A, Gkoulalas-Divanis A, Denny JC, Kirby J, Lasko T, Saip A, and Malin BA
- Subjects
- Antihypertensive Agents therapeutic use, Chronic Disease, Humans, Models, Theoretical, Prognosis, Disease Management, Electronic Health Records, Hypertension therapy
- Abstract
Objective: Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control., Method: In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier., Results: The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780)., Conclusions: This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans.
- Published
- 2014
- Full Text
- View/download PDF
42. Applying active learning to high-throughput phenotyping algorithms for electronic health records data.
- Author
-
Chen Y, Carroll RJ, Hinz ER, Shah A, Eyler AE, Denny JC, and Xu H
- Subjects
- Genetic Association Studies, Humans, Support Vector Machine, Algorithms, Artificial Intelligence, Electronic Health Records, Phenotype
- Abstract
Objectives: Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms., Methods: We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling., Results: Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples., Conclusions: This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
- Published
- 2013
- Full Text
- View/download PDF
43. Automated extraction of clinical traits of multiple sclerosis in electronic medical records.
- Author
-
Davis MF, Sriram S, Bush WS, Denny JC, and Haines JL
- Subjects
- Adolescent, Adult, Aged, Aged, 80 and over, Child, Disease Progression, Female, Humans, Male, Middle Aged, Algorithms, Data Mining, Electronic Health Records, Multiple Sclerosis diagnosis, Natural Language Processing
- Abstract
Objectives: The clinical course of multiple sclerosis (MS) is highly variable, and research data collection is costly and time consuming. We evaluated natural language processing techniques applied to electronic medical records (EMR) to identify MS patients and the key clinical traits of their disease course., Materials and Methods: We used four algorithms based on ICD-9 codes, text keywords, and medications to identify individuals with MS from a de-identified, research version of the EMR at Vanderbilt University. Using a training dataset of the records of 899 individuals, algorithms were constructed to identify and extract detailed information regarding the clinical course of MS from the text of the medical records, including clinical subtype, presence of oligoclonal bands, year of diagnosis, year and origin of first symptom, Expanded Disability Status Scale (EDSS) scores, timed 25-foot walk scores, and MS medications. Algorithms were evaluated on a test set validated by two independent reviewers., Results: We identified 5789 individuals with MS. For all clinical traits extracted, precision was at least 87% and specificity was greater than 80%. Recall values for clinical subtype, EDSS scores, and timed 25-foot walk scores were greater than 80%., Discussion and Conclusion: This collection of clinical data represents one of the largest databases of detailed, clinical traits available for research on MS. This work demonstrates that detailed clinical information is recorded in the EMR and can be extracted for research purposes with high reliability.
- Published
- 2013
- Full Text
- View/download PDF
44. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives.
- Author
-
Pathak J, Kho AN, and Denny JC
- Subjects
- Academic Medical Centers, Genetic Association Studies, Humans, Algorithms, Electronic Health Records, Genomics methods, Phenotype
- Published
- 2013
- Full Text
- View/download PDF
45. Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.
- Author
-
Fan JW, Yang EW, Jiang M, Prasad R, Loomis RM, Zisook DS, Denny JC, Xu H, and Huang Y
- Subjects
- Electronic Health Records, Guidelines as Topic, Linguistics, Natural Language Processing
- Abstract
Objective: To develop, evaluate, and share: (1) syntactic parsing guidelines for clinical text, with a new approach to handling ill-formed sentences; and (2) a clinical Treebank annotated according to the guidelines. To document the process and findings for readers with similar interest., Methods: Using random samples from a shared natural language processing challenge dataset, we developed a handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions. Special considerations were incorporated into the guidelines for handling ill-formed sentences, which are common in clinical text. Intra- and inter-annotator agreement rates were used to evaluate consistency in following the guidelines. Quantitative and qualitative properties of the annotated Treebank, as well as its use to retrain a statistical parser, were reported., Results: A supplement to the Penn Treebank II guidelines was developed for annotating clinical sentences. After three iterations of annotation and adjudication on 450 sentences, the annotators reached an F-measure agreement rate of 0.930 (while intra-annotator rate was 0.948) on a final independent set. A total of 1100 sentences from progress notes were annotated that demonstrated domain-specific linguistic features. A statistical parser retrained with combined general English (mainly news text) annotations and our annotations achieved an accuracy of 0.811 (higher than models trained purely with either general or clinical sentences alone). Both the guidelines and syntactic annotations are made available at https://sourceforge.net/projects/medicaltreebank., Conclusions: We developed guidelines for parsing clinical text and annotated a corpus accordingly. The high intra- and inter-annotator agreement rates showed decent consistency in following the guidelines. The corpus was shown to be useful in retraining a statistical parser that achieved moderate accuracy.
- Published
- 2013
- Full Text
- View/download PDF
46. A hybrid system for temporal information extraction from clinical text.
- Author
-
Tang B, Wu Y, Jiang M, Chen Y, Denny JC, and Xu H
- Subjects
- Artificial Intelligence, Humans, Natural Language Processing, Time, Electronic Health Records, Patient Discharge Summaries, Translational Research, Biomedical
- Abstract
Objective: To develop a comprehensive temporal information extraction system that can identify events, temporal expressions, and their temporal relations in clinical text. This project was part of the 2012 i2b2 clinical natural language processing (NLP) challenge on temporal information extraction., Materials and Methods: The 2012 i2b2 NLP challenge organizers manually annotated 310 clinic notes according to a defined annotation guideline: a training set of 190 notes and a test set of 120 notes. All participating systems were developed on the training set and evaluated on the test set. Our system consists of three modules: event extraction, temporal expression extraction, and temporal relation (also called Temporal Link, or 'TLink') extraction. The TLink extraction module contains three individual classifiers for TLinks: (1) between events and section times, (2) within a sentence, and (3) across different sentences. The performance of our system was evaluated using scripts provided by the i2b2 organizers. Primary measures were micro-averaged Precision, Recall, and F-measure., Results: Our system was among the top ranked. It achieved F-measures of 0.8659 for temporal expression extraction (ranked fourth), 0.6278 for end-to-end TLink track (ranked first), and 0.6932 for TLink-only track (ranked first) in the challenge. We subsequently investigated different strategies for TLink extraction, and were able to marginally improve performance with an F-measure of 0.6943 for TLink-only track.
- Published
- 2013
- Full Text
- View/download PDF
47. Automated identification of drug and food allergies entered using non-standard terminology.
- Author
-
Epstein RH, St Jacques P, Stockin M, Rothman B, Ehrenfeld JM, and Denny JC
- Subjects
- Humans, Medical Records Systems, Computerized, Algorithms, Data Mining methods, Drug Hypersensitivity, Electronic Health Records, Food Hypersensitivity
- Abstract
Objective: An accurate computable representation of food and drug allergy is essential for safe healthcare. Our goal was to develop a high-performance, easily maintained algorithm to identify medication and food allergies and sensitivities from unstructured allergy entries in electronic health record (EHR) systems., Materials and Methods: An algorithm was developed in Transact-SQL to identify ingredients to which patients had allergies in a perioperative information management system. The algorithm used RxNorm and natural language processing techniques developed on a training set of 24 599 entries from 9445 records. Accuracy, specificity, precision, recall, and F-measure were determined for the training dataset and repeated for the testing dataset (24 857 entries from 9430 records)., Results: Accuracy, precision, recall, and F-measure for medication allergy matches were all above 98% in the training dataset and above 97% in the testing dataset for all allergy entries. Corresponding values for food allergy matches were above 97% and above 93%, respectively. Specificities of the algorithm were 90.3% and 85.0% for drug matches and 100% and 88.9% for food matches in the training and testing datasets, respectively., Discussion: The algorithm had high performance for identification of medication and food allergies. Maintenance is practical, as updates are managed through upload of new RxNorm versions and additions to companion database tables. However, direct entry of codified allergy information by providers (through autocompleters or drop lists) is still preferred to post-hoc encoding of the data. Data tables used in the algorithm are available for download., Conclusions: A high performing, easily maintained algorithm can successfully identify medication and food allergies from free text entries in EHR systems.
- Published
- 2013
- Full Text
- View/download PDF
48. Development and evaluation of an ensemble resource linking medications to their indications.
- Author
-
Wei WQ, Cronin RM, Xu H, Lasko TA, Bastarache L, and Denny JC
- Subjects
- Dictionaries as Topic, Internet, MedlinePlus, RxNorm, Drug Therapy, Electronic Health Records, Natural Language Processing, Pharmaceutical Preparations
- Abstract
Objective: To create a computable MEDication Indication resource (MEDI) to support primary and secondary use of electronic medical records (EMRs)., Materials and Methods: We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication-indication pairs that optimizes recall while maintaining high precision., Results: MEDI contains 3112 medications and 63 343 medication-indication pairs. Wikipedia was the largest resource, with 2608 medications and 34 911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia. The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13 304 unique indication pairs regarding 2136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22 ± 6.09. The estimated precision of MEDI-HPS is 92%., Conclusions: MEDI is a publicly available, computable resource that links medications with their indications as represented by concepts and billing codes. MEDI may benefit clinical EMR applications and reuse of EMR data for research.
- Published
- 2013
- Full Text
- View/download PDF
49. Genetic variants that confer resistance to malaria are associated with red blood cell traits in African-Americans: an electronic medical record-based genome-wide association study.
- Author
-
Ding K, de Andrade M, Manolio TA, Crawford DC, Rasmussen-Torvik LJ, Ritchie MD, Denny JC, Masys DR, Jouni H, Pachecho JA, Kho AN, Roden DM, Chisholm R, and Kullo IJ
- Subjects
- Electronic Health Records, Erythrocyte Count, Erythrocyte Indices genetics, Hematocrit, Hemoglobins genetics, Humans, Malaria blood, Black or African American genetics, Disease Resistance genetics, Erythrocytes pathology, Genome-Wide Association Study, Malaria genetics
- Abstract
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria-HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)-were associated (P ≤ 1 × 10(-6)) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10(-8)) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
- Published
- 2013
- Full Text
- View/download PDF
50. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.
- Author
-
Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, Basford M, Chute CG, Kullo IJ, Li R, Pacheco JA, Rasmussen LV, Spangler L, and Denny JC
- Subjects
- Computer Communication Networks, Genetic Research, Humans, Medical Audit, United States, Validation Studies as Topic, Algorithms, Electronic Health Records, Genetic Association Studies, Phenotype
- Abstract
Background: Genetic studies require precise phenotype definitions, but electronic medical record (EMR) phenotype data are recorded inconsistently and in a variety of formats., Objective: To present lessons learned about validation of EMR-based phenotypes from the Electronic Medical Records and Genomics (eMERGE) studies., Materials and Methods: The eMERGE network created and validated 13 EMR-derived phenotype algorithms. Network sites are Group Health, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University., Results: By validating EMR-derived phenotypes we learned that: (1) multisite validation improves phenotype algorithm accuracy; (2) targets for validation should be carefully considered and defined; (3) specifying time frames for review of variables eases validation time and improves accuracy; (4) using repeated measures requires defining the relevant time period and specifying the most meaningful value to be studied; (5) patient movement in and out of the health plan (transience) can result in incomplete or fragmented data; (6) the review scope should be defined carefully; (7) particular care is required in combining EMR and research data; (8) medication data can be assessed using claims, medications dispensed, or medications prescribed; (9) algorithm development and validation work best as an iterative process; and (10) validation by content experts or structured chart review can provide accurate results., Conclusions: Despite the diverse structure of the five EMRs of the eMERGE sites, we developed, validated, and successfully deployed 13 electronic phenotype algorithms. Validation is a worthwhile process that not only measures phenotype performance but also strengthens phenotype algorithm definitions and enhances their inter-institutional sharing.
- Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.