Performance of Large Language Models on Radiology Residency In-Training Examination Questions.

November 11, 2025

papers

DOI: 10.1016/j.acra.2025.10.043 PMID: 41224539

Authors

Salbas A,Yogurtcu M

Affiliations (2)

Izmir Katip Celebi University, Ataturk Training and Research Hospital, Department of Radiology, Izmir, Turkey (A.S.). Electronic address: [email protected].
Tire State Hospital, Radiology Department, Izmir, Turkey (M.Y.). Electronic address: [email protected].

Abstract

Large language models (LLMs) are increasingly investigated in radiology education. This study evaluated the performance of several advanced LLMs on radiology residency in-training examination questions, with a focus on whether recently released versions show improved accuracy compared with earlier models. We analyzed 282 multiple-choice questions (191 text-only, 91 image-based) from institutional radiology residency examinations conducted between 2023 and 2025. Five LLMs were tested: ChatGPT-4o, ChatGPT-5, Claude 4 Opus, Claude 4.1 Opus, and Gemini 2.5 Pro. Radiology resident performance on the same set of questions was also analyzed for comparison. Accuracy rates were calculated for overall, text-only, and image-based questions, and results were compared using Cochran's Q and Bonferroni-adjusted McNemar tests. Outputs were also assessed for hallucinations. Gemini 2.5 Pro achieved the highest overall accuracy (83.0%), followed by ChatGPT-5 (82.3%). By comparison, radiology residents achieved an overall accuracy of 78.2%. ChatGPT-5 showed significantly higher accuracy compared with ChatGPT-4o (p = 0.021), and Gemini 2.5 Pro showed significantly higher accuracy compared with Claude 4 Opus (p = 0.026). For text-only questions, the highest accuracy was obtained with Gemini 2.5 Pro (88.0%). For image-based questions, radiology residents achieved the highest accuracy (80.4%), followed by ChatGPT-5 (73.6%). The highest accuracies by subspecialty were observed in interventional radiology and physics, whereas breast imaging yielded the lowest accuracy across the models. No instances of hallucination were observed. LLMs demonstrated generally good performance on radiology residency assessments, with newer versions showing measurable improvements. However, limitations persist in image-based interpretation and certain subspecialties. LLMs should therefore be regarded as supportive resources in radiology education, with careful validation and continued refinement of medical training data.

View Source Full Text PDF

Topics

Journal Article

Performance of Large Language Models on Radiology Residency In-Training Examination Questions.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?