Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.
Authors
Affiliations (7)
Affiliations (7)
- Department of Medical Imaging, National Taiwan University Hospital Hsin-Chu Branch, Br. No.25, Lane 442, Sec.1, Jingguo Rd., Hsinchu City 300, Taiwan ROC (H.Y.L.); National Taiwan University College of Medicine, No.1 Jen Ai Road Section 1, Taipei 100, Taiwan ROC (H.Y.L.).
- Department of Medical Imaging, National Taiwan University Hospital, No. 7 Zhongshan S. Rd., Zhongzheng Dist., Taipei City 100229, Taiwan (S.J.C., W.J.L.).
- Institute of Applied Mathematical Sciences, Department of Mathematics and Data Science Degree Program, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 106319, Taiwan (W.W.).
- Graduate Institute of Health and Biotechnology Law, Taipei Medical University, 250 Wuxing Street, Taipei City, Taiwan (C.H.L.).
- Department of Radiology, Tri-Service General Hospital, National Defense Medical Center, No.325, Sec.2, Chenggong Rd., Neihu District, Taipei City, Taiwan (H.H.H.).
- Department of Radiology, Taipei Veterans General Hospital, No.201, Sec. 2, Shipai Rd., Beitou District, Taipei City, Taiwan (S.H.S., H.J.C.); National Yang Ming Chiao Tung University School of Medicine, No. 118, Sec. 1, Zhongxiao West Road, Taipei City 100, Taiwan (S.H.S., H.J.C.).
- Department of Medical Imaging, National Taiwan University Hospital, No. 7 Zhongshan S. Rd., Zhongzheng Dist., Taipei City 100229, Taiwan (S.J.C., W.J.L.). Electronic address: [email protected].
Abstract
The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design. This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance. As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation. This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.