Comparative Analysis of Large Language Models Performance in Appropriate Diagnostic Imaging Modality Selection.

May 8, 2026

papers

DOI: 10.1016/j.jacr.2026.05.012 PMID: 42107669

Authors

Rybczyk JR,Amin KS,Chamarty V,Fletcher P,Melnick ER,Forman HP

Affiliations (5)

Renaissance School of Medicine at Stony Brook University, Stony Brook, NY, USA. Electronic address: [email protected].
Yale School of Medicine, New Haven, CT, USA.
Department of Neuroscience, University of Connecticut Health Center, Farmington, CT, USA.
Department of Emergency Medicine, Associate Professor of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA; Department of Biostatistics (Health Informatics), Associate Professor of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA.
Yale School of Management, Professor of Management and of Economics, New Haven, CT, USA; Department of Health Policy and Management, Professor of Public Health, Yale School of Public Health, New Haven, CT, USA; Department of Radiology and Biomedical Imaging, Professor of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, CT, USA; American College of Radiology, Fellow, Reston, VA, USA.

Abstract

LLMs show promise for guiding appropriate diagnostic imaging modality selection according to ACR criteria. This study compared seven LLMs (OpenEvidence, OpenAI's GPT-5 Thinking and GPT-5, Anthropic's Opus 4.1 and Sonnet 4.5, Google's Gemini 2.5 Pro and 2.5 Flash) using 50 clinical vignettes to assess accuracy and clinical reasoning in formulating imaging modality recommendations. Fifty text-based clinical vignettes were created from ACR guidelines, featuring five variants of 10 different medical complaints with subtle symptomatic or demographic alterations. A 3-point Likert scale was used to evaluate four performance metrics: imaging appropriateness, technical specificity, clinical rationale strength, and citation quality. Readability and word count were also assessed. Two blinded, independent reviewers rated the LLM outputs, with discrepancies resolved via consensus. A third reviewer was included for persistent disagreements. Analysis involved Friedman's test followed by pairwise Wilcoxon signed-rank testing with Holm correction (P < .05). Friedman testing demonstrated significant differences across all performance domains (P< .031). Appropriateness scores (range 1.60-1.88 out of 2.00) revealed no significant pairwise differences. Technical specificity (range 1.82-2.00) and clinical rationale (range 1.52-1.88) showed no significant pairwise differences. Citation quality (range 0.40-2.00) was the most variable; Gemini 2.5 Pro and Gemini 2.5 Flash hallucinated citations in 80% and 76% of prompts, respectively, performing worse than all other models (P < .001). Readability scores ranged from 15.27 to 22.19, and word counts from 90.10 to 195.02. All LLMs selected appropriate imaging modalities using reasonable clinical justification. Citation validity varied widely. Ensuring congruence between clinical reasoning and cited sources is essential prior to successful implementation.

View Source Full Text PDF

Topics

Journal Article

Comparative Analysis of Large Language Models Performance in Appropriate Diagnostic Imaging Modality Selection.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?