Back to Search Start Over

Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings.

Authors :
Woo KC
Simon GW
Akindutire O
Aphinyanaphongs Y
Austrian JS
Kim JG
Genes N
Goldenring JA
Major VJ
Pariente CS
Pineda EG
Kang SK
Source :
Journal of the American Medical Informatics Association : JAMIA [J Am Med Inform Assoc] 2024 Sep 01; Vol. 31 (9), pp. 1983-1993.
Publication Year :
2024

Abstract

Objectives: To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings.<br />Materials and Methods: Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale.<br />Results: For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision.<br />Conclusion: GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.<br /> (© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)

Details

Language :
English
ISSN :
1527-974X
Volume :
31
Issue :
9
Database :
MEDLINE
Journal :
Journal of the American Medical Informatics Association : JAMIA
Publication Type :
Academic Journal
Accession number :
38778578
Full Text :
https://doi.org/10.1093/jamia/ocae117