Performance of large language models in oral and maxillofacial surgery examinations.

Authors :: Quah, B.
Yong, C.W.
Lai, C.W.M.
Islam, I.
Source :: International Journal of Oral & Maxillofacial Surgery; Oct2024, Vol. 53 Issue 10, p881-886, 6p
Publication Year :: 2024
Abstract: This study aimed to determine the accuracy of large language models (LLMs) in answering oral and maxillofacial surgery (OMS) multiple choice questions. A total of 259 questions from the university's question bank were answered by the LLMs (GPT-3.5, GPT-4, Llama 2, Gemini, and Copilot). The scores per category as well as the total score out of 259 were recorded and evaluated, with the passing score set at 50%. The mean overall score amongst all LLMs was 62.5%. GPT-4 performed the best (76.8%, 95% confidence interval (CI) 71.4–82.2%), followed by Copilot (72.6%, 95% CI 67.2–78.0%), GPT-3.5 (62.2%, 95% CI 56.4–68.0%), Gemini (58.7%, 95% CI 52.9–64.5%), and Llama 2 (42.5%, 95% CI 37.1–48.6%). There was a statistically significant difference between the scores of the five LLMs overall (χ<superscript>2</superscript> = 79.9, df = 4, P < 0.001) and within all categories except 'basic sciences' (P = 0.129), 'dentoalveolar and implant surgery' (P = 0.052), and 'oral medicine/pathology/radiology' (P = 0.801). The LLMs performed best in 'basic sciences' (68.9%) and poorest in 'pharmacology' (45.9%). The LLMs can be used as adjuncts in teaching, but should not be used for clinical decision-making until the models are further developed and validated. [ABSTRACT FROM AUTHOR]

Subjects :: LANGUAGE models
GENERATIVE pre-trained transformers
ARTIFICIAL intelligence
ORAL medicine
ORAL surgery
MAXILLOFACIAL surgery

Full Text Access

Tools