Back to all papers

Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning.

June 11, 2026pubmed logopapers

Authors

Duan R,Pang J,Zheng L,Guo Z,Li T,Bian Y,Hu Y

Affiliations (2)

  • Hebei Medical University, Shijiazhuang, China.
  • Department of Nuclear Medicine, Hebei General Hospital, Shijiazhuang, China.

Abstract

To evaluate the performance of DeepSeek (R1 version), an open-source large language model, in three core clinical scenarios: answering patients' common questions, interpreting PET/CT reports with follow-up inquiries, and diagnosing complex cases, and comparison with GPT-5.3, to verify the clinical applicability of DeepSeek-R1 as an alternative AI assistant. A total of 39 standardized tasks were assigned to both models, including responding to 15 frequently asked questions about [<sup>18</sup>F]FDG PET/CT, interpreting 12 anonymized reports of lung cancer and lymphoma (with follow-up inquiries regarding tumor staging or treatment), and providing primary and differential diagnoses for 10 difficult cases. Both models were accessed via their official platforms with default parameters, and all prompts and evaluation criteria were kept identical for cross-model comparison. Two senior nuclear medicine physicians independently rated the model responses using a 4-point standardized scale (assessing appropriateness, helpfulness, inter-trial consistency, and reference validity) and a binary scale for empathy; Cohen's Kappa coefficient was used to evaluate inter-rater agreement. McNemar's test was used to compare paired proportions of appropriateness, empathy, and response inconsistency between the two models. Across the 39 tasks, DeepSeek-R1 achieved 94.9% appropriateness and 100% helpfulness. Specifically, 91.7% of responses to follow-up inquiries about tumor staging or treatment were rated empathetic. However, 7.7% of regenerated responses showed substantial inconsistencies, primarily in tumor staging, and only 37% of cited references were fully valid, with 11.1% being invalid. GPT-5.3 exhibited equivalent core performance to DeepSeek-R1 with 94.9% appropriateness and 100% helpfulness, a slightly lower substantial inconsistency rate (5.1%), favorable reference validity (33% fully valid, 7.4% invalid), but a notably lower empathy score (66.7%) for follow-up inquiries. McNemar tests showed identical appropriateness (<i>p</i> = 1.00) and no significant difference in inconsistency (<i>p</i> = 1.00, 95% CI 0.60-14.80) between models. DeepSeek-R1 had higher empathy, the difference was not significant (<i>p</i> = 0.25, 95% CI 0.09-0.66). For the 10 identical difficult cases, both models reached 10% primary diagnosis accuracy and 60% differential diagnosis accuracy. DeepSeek-R1 and GPT-5.3 have complementary strengths but similar reference hallucination issues and cannot replace clinicians. DeepSeek-R1 is a cost-effective auxiliary tool, with future optimization needed for consistency, diagnostic accuracy and reference validity.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.