Back to Search Start Over

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity.

Authors :
Sarangi, Pradosh Kumar
Datta, Suvrankar
Swarup, M. Sarthak
Panda, Swaha
Nayak, Debasish Swapnesh Kumar
Malik, Archana
Datta, Ananda
Mondal, Himel
Source :
Indian Journal of Radiology & Imaging. Oct2024, Vol. 34 Issue 4, p653-660. 8p.
Publication Year :
2024

Abstract

Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001). Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09713026
Volume :
34
Issue :
4
Database :
Academic Search Index
Journal :
Indian Journal of Radiology & Imaging
Publication Type :
Academic Journal
Accession number :
179786457
Full Text :
https://doi.org/10.1055/s-0044-1787974