The Expertise Paradox: Who Benefits from LLM-Assisted Brain MRI Differential Diagnosis?
Authors
Affiliations (1)
Affiliations (1)
- Technical University of Munich
Abstract
PurposeTo evaluate how reader experience influences the diagnostic benefit from LLM assistance in brain MRI differential diagnosis. Materials and MethodsNeuroradiologists (n = 4), radiology residents (n = 4), and neurology/neurosurgery residents (n = 4) were recruited. A dataset of complex brain MRI cases was curated from the local imaging database (n = 40). For each case, readers provided a textual description of the main imaging finding and their top three differential diagnoses ("Unassisted"). Three state-of-the-art large language models (GPT-4.1, Gemini 2.5 Pro, DeepSeek-R1) were prompted to generate top-three differentials based on the clinical case description and reader-specific findings. Readers then revised their differential diagnoses after reviewing GPT-4.1 suggestions ("Assisted"). To evaluate the association between reader experience and diagnostic benefit, a cumulative link mixed model (CLMM) was fitted, with change in diagnostic result as ordinal outcome, reader experience as predictor, and random intercepts for rater and case. ResultsLLM-generated differential diagnoses achieved the highest top-3 accuracy when provided with image descriptions from neuroradiologists (top-3: 78.8-83.8%), followed by radiology residents (top-3: 71.8-77.6%), and neurology/neurosurgery residents (top-3: 62.6-64.5%). In contrast, mean relative gains in top-3 accuracy through LLM assistance diminished with increasing experience, with +19.2% for neurology/neurosurgery residents (from 43.2% to 62.6%), +14.7% for radiology residents (from 59.6% to 74.4%), and +4.4% for neuroradiologists (from 83.1% to 87.5%). The CLMM demonstrated a significant negative association between reader experience and diagnostic benefit from LLM assistance ({beta} = -0.10, p = 0.005). ConclusionWith increasing reader experience, absolute diagnostic LLM performance with reader-generated input improved, while relative diagnostic gains through LLM assistance paradoxically diminished. Our findings call attention to the divergence between standalone LLM performance and clinically relevant reader benefit, and emphasize the need to account for human-AI interaction in this context.