1. oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness
- Author
-
Ke, Yu He, Jin, Liyuan, Elangovan, Kabilan, Abdullah, Hairil Rizal, Liu, Nan, Sia, Alex Tiong Heng, Soh, Chai Rick, Tung, Joshua Yi Min, Ong, Jasmine Chiat Ling, Kuo, Chang-Fu, Wu, Shao-Chun, Kovacheva, Vesela P., and Ting, Daniel Shu Wei
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability., Comment: arXiv admin note: substantial text overlap with arXiv:2402.01733
- Published
- 2024