Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

June 26, 2026

papers

DOI: 10.2196/86974 PMID: 42361338

Authors

Eminovic S,Schmidt R,Levita B,Lindholz M,Haack AM,Burdenski A,Bui M,Schobert IT,Dell'Orco A,Nawabi J,Penzkofer T

Affiliations (3)

Department of Radiology, Charité - Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353, Germany.
Berlin Institute of Health, Charité - Universitätsmedizin Berlin, Berlin, Germany.
Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany.

Abstract

Radiology trainees require efficient, accurate, and accessible resources to master complex imaging techniques and identify findings that guide clinical decision-making. Large language models (LLMs) are emerging as promising tools for medical education and clinical workflows, offering the potential to enhance learning by providing instant feedback, aiding in diagnostic accuracy, and offering personalized learning experiences. However, systematic comparisons of LLMs for radiology education and clinical support remain limited, particularly regarding differences across subspecialties and resident experience levels. This study aimed to evaluate and compare the response quality of 2 state-of-the-art reasoning-based LLMs, namely DeepSeek-R1 and ChatGPT-o1, as clinical and radiology residency support tools, comparing performance across clinical and didactic dimensions, including text- and image-based responses. Overall, 27 radiology questions covering 9 radiological subspecialties were answered by both LLMs. Additionally, 6 image-based questions were presented only to ChatGPT-o1 due to its image processing capabilities. Responses were independently rated by 7 radiology residents (postgraduate years 2-5) across 9 rating criteria grouped into 3 dimensions (factual accuracy, clinical practicality, and didactic value), using a 5-point Likert scale. Statistics compared LLMs, reader experience, and response types for text- as well as image-based for ChatGPT-o1 queries. DeepSeek-R1 consistently outperformed ChatGPT-o1 across all rating dimensions, with highly significant differences across all criteria (mean ratings: DeepSeek-R1 4.51, SD 0.73 vs ChatGPT-o1 3.73, SD 0.98; P<.001). In an exploratory subspecialty-level analysis, DeepSeek-R1 descriptively outperformed ChatGPT-o1 across all subspecialties. For both LLMs accumulated, junior residents tended to rate slightly higher than seniors in 7 of 9 criteria, although differences were not statistically significant. However, for ChatGPT-o1, junior residents rated significantly higher in overall score across all criteria (juniors 3.81, SD 0.64 vs seniors 3.63, SD 0.65; P=.02). Image-based responses by ChatGPT-o1 scored significantly lower than text-based (mean 3.19, SD 1.42; P=.007), particularly in factual accuracy (mean 2.75, SD 1.45; P<.001) and clinical practicality (mean 3.11, SD 1.47; P=.03). Both DeepSeek-R1 and ChatGPT-o1 demonstrate promising potential on simulated radiology question sets designed for educational and clinical contexts, with DeepSeek-R1 outperforming ChatGPT-o1 across all evaluated criteria. These results emphasize the value of open-source models for educational use and provide early evidence that LLMs may support radiology resident training under controlled conditions; however, their real-world educational and clinical effects require further investigation. Future research should prospectively evaluate how LLMs can be integrated into radiology training, assess their impact alongside conventional teaching methods, and investigate multimodal capabilities to better reflect realistic clinical scenarios.

View Source Full Text PDF

Topics

Journal Article

Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?