Item-Level Evaluation of Multimodal Large Language Models in Neuroradiology: Generational Performance and Execution Variability.
Authors
Affiliations (2)
Affiliations (2)
- From the Division of Neuroradiology (J.F.O.-E., C.S., D.B., A.F., K.-O.L., F.T.K.), Radiology (M.P., M.S., N.V., N.R.), Geneva University Hospitals, Geneva, Switzerland; Faculty of Medicine (J.F.O.-E., C.S., D.B., A.F., M.P., M.S., K.-O.L.), University of Geneva, Geneva, Switzerland; Division of Radiology (C.M.), German Cancer Research Center, Neuroradiology (Y.-C.Y.), Heidelberg University Hospital, Heidelberg, Germany.
- From the Division of Neuroradiology (J.F.O.-E., C.S., D.B., A.F., K.-O.L., F.T.K.), Radiology (M.P., M.S., N.V., N.R.), Geneva University Hospitals, Geneva, Switzerland; Faculty of Medicine (J.F.O.-E., C.S., D.B., A.F., M.P., M.S., K.-O.L.), University of Geneva, Geneva, Switzerland; Division of Radiology (C.M.), German Cancer Research Center, Neuroradiology (Y.-C.Y.), Heidelberg University Hospital, Heidelberg, Germany. [email protected].
Abstract
Multimodal large language models have demonstrated consistent generational improvements on medical benchmark tasks, including radiology applications. However, whether these gains represent meaningful convergence toward the expert reference performance in subspecialty imaging domains such as neuroradiology remains uncertain, particularly when evaluated using item-level, human-referenced designs. We compared expert neuroradiologists, radiology residents, and four vision-capable large language models (GPT-4, GPT-5, Gemini 1.5, Gemini 2.5) using 106 image-based neuroradiology multiple-choice questions derived from Radiopaedia. Analyses were conducted at the question (item) level, preserving within-question pairing across groups. Mean accuracy differences were estimated using non-parametric bootstrap 95% confidence intervals, and statistical inference was performed using paired permutation tests restricted to pre-specified contrasts with false discovery rate correction for primary comparisons. Repeated model executions were analyzed to characterize execution-level variability. Performance was further contextualized against Radiopaedia community accuracy as a community-level reference. Expert neuroradiologists achieved the highest mean item-level accuracy (0.915; 95% confidence interval, 0.877-0.953). Second-generation models demonstrated improved mean accuracy relative to earlier versions and approximated or exceeded resident-level performance in selected comparisons. However, a substantial gap relative to the expert reference persisted. GPT-5 and Gemini 2.5 underperformed the expert reference by mean per-item differences of -0.236 and -0.217, respectively. When contextualized against the community-level reference, advanced models aligned more closely with aggregate learner performance than with the expert reference accuracy. Improvements in mean accuracy were not uniformly accompanied by improved execution consistency. Although multimodal large language models show meaningful generational gains on neuroradiology tasks, these improvements do not constitute convergence toward the expert reference performance. Item-level, paired, human-referenced evaluation provides critical context for interpreting benchmark performance and helps distinguish apparent performance gains from true alignment with the expert reference. Importantly, relative stability in aggregate accuracy does not necessarily imply reliability at the level of individual decisions, underscoring the need to assess both performance magnitude and execution-level consistency.