Back to all papers

Error Detection in Emergency Radiology Reports Using a Large Language Model: Multistage Evaluation Study.

April 14, 2026pubmed logopapers

Authors

Shen H,Wu T,Wang F,Fang J,Li Y,Wu X,Liu S,Chen L,Ren Q,Meng X,Xu J,Sun J,Zhao Y,Liu X,Wang L,Mai G,You J,Jin Z,Wu X,He W,Han X,Zhang S,Zeng D,Zhang B

Affiliations (6)

  • Department of Radiology, The First Affiliated Hospital of Jinan University, No. 613 Huangpu West Road, Tianhe, Guangzhou, Guangdong, 510630, China, 86 15217921427.
  • School of Biomedical Engineering, Southern Medical University, Guangzhou, Guangdong, China.
  • Department of Radiology, The Affiliated Hospital of Guangdong Medical University, Zhanjiang, Guangdong, China.
  • Department of Radiology, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, Guangdong, China.
  • Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong, China.
  • Department of Radiology, Longhu District People's Hospital of Shantou, Shantou, Guangdong, China.

Abstract

Emergency radiology requires highly accurate reporting under time constraints; yet, increasing workloads raise the risk of errors. While large language models (LLMs) show potential for proofreading in general radiology, their performance in emergency settings and non-English contexts remains unclear. We aim to evaluate the performance of a domain-optimized LLM, DeepSeek-R1, for identifying errors in Chinese emergency radiology reports, with comparison against assessments by board-certified radiologists. We compiled 7435 emergency reports (dataset 1; radiography, computed tomography, and magnetic resonance imaging) collected from November 2024 to April 2025. In stage 1, a total of 5 LLMs were evaluated using 200 reports. The best model, DeepSeek-R1, proceeded to stages 2 and 3, where 0-shot and few-shot learning were tested on a separate set (n=100). Model performance was compared against 12 radiologists. Stage 4 validated real-world utility on 800 verified reports. In subdataset 1, under stress-testing conditions, DeepSeek-R1 achieved a higher error detection rate in the few-shot setting than in the 0-shot setting (84.4% vs 60.9%, P=.003). Its performance exceeded that of radiology residents (84.4% vs 51.6% and 53.1%, respectively; both P<.05) and showed no statistically significant difference compared with senior radiologists and attending radiologists (84.4% vs 68.8%-93.8%, P=.26 to ≥.99). Compared with residents, DeepSeek-R1 detected more critical omissions (100% vs 25% and 50%; both P<.05) and other errors (92% vs 33% and 33%; both P=.02). In dataset 2, collected from independent institutions, DeepSeek-R1 achieved a detection rate of 95% under the few-shot setting. Reading time was shorter than that of human readers (92 vs 109 s). In real-world validation, DeepSeek-R1 identified 117 true reporting errors, yielding a positive predictive value of 56.5%. DeepSeek-R1 holds promise for improving quality control in emergency radiology reports. Its performance and efficiency support its use as an assistive proofreading tool in real-world radiology workflows.

Topics

LanguageDiagnostic ErrorsRadiologyJournal ArticleEvaluation Study

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.