Large Language Models Solving the European Diploma in Radiology: A Comparative Evaluation.
Authors
Affiliations (4)
Affiliations (4)
- Department of Radiology, Ministry of Health Izmir City Hospital, İzmir, Turkey (H.E.G.). Electronic address: [email protected].
- Department of Radiology, Hospital Clínic de Barcelona, Barcelona 08036, Spain (L.O.).
- Department of Radiology, Ataturk Training and Research Hospital, Izmir Katip Celebi University, İzmir, Turkey (A.M.K.).
- European Board of Radiology, Barcelona 08008, Spain (V.J., C.M.).
Abstract
Rapid advancements in multimodal large language models (LLMs) highlight their expanding potential in radiology education and assessment. This study aims to evaluate and compare the performance of three state-of-the-art LLMs; GPT-5, Gemini 2.5 Pro, and Claude 4.5 Sonnet, using a complete, retired version of the European Diploma in Radiology (EDiR) examination. The official EDiR examination, consisting of 78 Multiple Response Questions, 24 Short Cases, and 10 Clinically Oriented Reasoning Evaluation (CORE) cases, was administered to the models under standardized, zero-shot conditions. Inputs included text, static images, and videos (MP4 or sequential frames). Responses were scored according to official European Board of Radiology criteria. Passing thresholds were defined based on the historical human cohort mean for both the Weighted Written Score and the CORE component. In the written component (pass mark 50.9%), Gemini 2.5 Pro achieved the highest score (72.6%), followed by GPT-5 (67.3%), with both models surpassing the passing threshold. Claude 4.5 Sonnet scored 47.5%, failing to pass. In the CORE component (pass mark 55%), GPT-5 (62.5%) and Gemini 2.5 Pro (56.3%) successfully passed, whereas Claude 4.5 Sonnet (50.7%) did not. While models demonstrated high proficiency in text-dominant and static image interpretation, performance dropped in tasks requiring specific coordinate localization and dynamic video interpretation. GPT-5 and Gemini 2.5 Pro successfully met the passing criteria for the EDiR examination, demonstrating advanced capabilities suitable for educational augmentation. However, persistent limitations in spatial localization and temporal reasoning highlight that while semantic processing has matured, clinical visual grounding remains a challenge for autonomous deployment.