Start Over

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study.

Authors :: Seo J
Choi D
Kim T
Cha WC
Kim M
Yoo H
Oh N
Yi Y
Lee KH
Choi E
Source :: Journal of medical Internet research [J Med Internet Res] 2024 Nov 20; Vol. 26, pp. e58329. Date of Electronic Publication: 2024 Nov 20.
Publication Year :: 2024
Abstract: Background: The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application. Objective: This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation. Methods: We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria-appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators. Results: The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=-0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=-0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability. Conclusions: Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field. (©Junhyuk Seo, Dasol Choi, Taerim Kim, Won Chul Cha, Minha Kim, Haanju Yoo, Namkee Oh, YongJin Yi, Kye Hwa Lee, Edward Choi. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 20.11.2024.)

Subjects :: Humans
Republic of Korea
Emergency Service, Hospital
Electronic Health Records standards
Reproducibility of Results
Artificial Intelligence
Documentation methods
Documentation standards
Documentation statistics & numerical data

Details

Language :: English
ISSN :: 1438-8871
Volume :: 26
Database :: MEDLINE
Journal :: Journal of medical Internet research
Publication Type :: Academic Journal
Accession number :: 39566044
Full Text :: https://doi.org/10.2196/58329

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources