Start Over

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

Authors :: Lars Masanneck
Linea Schmidt
Antonia Seifert
Tristan Kölsche
Niklas Huntemann
Robin Jansen
Mohammed Mehsin
Michael Bernhard
Sven G Meuth
Lennert Böhm
Marc Pawlitzki
Source :: Journal of Medical Internet Research, Vol 26, p e53297 (2024)
Publication Year :: 2024
Publisher :: JMIR Publications, 2024.
Abstract: BackgroundLarge language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. ObjectiveThis study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models’ responses can enhance the triage proficiency of untrained personnel. MethodsA total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters’ MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. ResultsGPT-4–based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5–based ChatGPT (κ=mean 0.54, SD 0.024; P

Subjects :: Computer applications to medicine. Medical informatics
R858-859.7
Public aspects of medicine
RA1-1270

Details

Language :: English
ISSN :: 14388871
Volume :: 26
Database :: Directory of Open Access Journals
Journal :: Journal of Medical Internet Research
Publication Type :: Academic Journal
Accession number :: edsdoj.3e9c85a398b543bab44f5293f804c800
Document Type :: article
Full Text :: https://doi.org/10.2196/53297

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources