Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

Authors :: Srinivasan, Sahana
Ai, Xuguang
Zou, Minjie
Zou, Ke
Kim, Hyunjae
Lo, Thaddaeus Wai Soon
Pushpanathan, Krithi
Kong, Yiming
Li, Anran
Singer, Maxwell
Jin, Kai
Antaki, Fares
Chen, David Ziyou
Liu, Dianbo
Adelman, Ron A.
Chen, Qingyu
Tham, Yih Chung
Publication Year :: 2025
Abstract: Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.<br />Comment: 44 pages

Subjects :: Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Tools