Error Detection in Emergency Radiology Reports Using a Large Language Model: Multistage Evaluation Study.

April 14, 2026

DOI: 10.2196/86841 PMID: 41980012

Authors

Shen H,Wu T,Wang F,Fang J,Li Y,Wu X,Liu S,Chen L,Ren Q,Meng X,Xu J,Sun J,Zhao Y,Liu X,Wang L,Mai G,You J,Jin Z,Wu X,He W,Han X,Zhang S,Zeng D,Zhang B

Affiliations (6)

Department of Radiology, The First Affiliated Hospital of Jinan University, No. 613 Huangpu West Road, Tianhe, Guangzhou, Guangdong, 510630, China, 86 15217921427.
School of Biomedical Engineering, Southern Medical University, Guangzhou, Guangdong, China.
Department of Radiology, The Affiliated Hospital of Guangdong Medical University, Zhanjiang, Guangdong, China.
Department of Radiology, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou, Guangdong, China.
Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, Guangdong, China.
Department of Radiology, Longhu District People's Hospital of Shantou, Shantou, Guangdong, China.

Abstract

Emergency radiology requires highly accurate reporting under time constraints; yet, increasing workloads raise the risk of errors. While large language models (LLMs) show potential for proofreading in general radiology, their performance in emergency settings and non-English contexts remains unclear. We aim to evaluate the performance of a domain-optimized LLM, DeepSeek-R1, for identifying errors in Chinese emergency radiology reports, with comparison against assessments by board-certified radiologists. We compiled 7435 emergency reports (dataset 1; radiography, computed tomography, and magnetic resonance imaging) collected from November 2024 to April 2025. In stage 1, a total of 5 LLMs were evaluated using 200 reports. The best model, DeepSeek-R1, proceeded to stages 2 and 3, where 0-shot and few-shot learning were tested on a separate set (n=100). Model performance was compared against 12 radiologists. Stage 4 validated real-world utility on 800 verified reports. In subdataset 1, under stress-testing conditions, DeepSeek-R1 achieved a higher error detection rate in the few-shot setting than in the 0-shot setting (84.4% vs 60.9%, P=.003). Its performance exceeded that of radiology residents (84.4% vs 51.6% and 53.1%, respectively; both P<.05) and showed no statistically significant difference compared with senior radiologists and attending radiologists (84.4% vs 68.8%-93.8%, P=.26 to ≥.99). Compared with residents, DeepSeek-R1 detected more critical omissions (100% vs 25% and 50%; both P<.05) and other errors (92% vs 33% and 33%; both P=.02). In dataset 2, collected from independent institutions, DeepSeek-R1 achieved a detection rate of 95% under the few-shot setting. Reading time was shorter than that of human readers (92 vs 109 s). In real-world validation, DeepSeek-R1 identified 117 true reporting errors, yielding a positive predictive value of 56.5%. DeepSeek-R1 holds promise for improving quality control in emergency radiology reports. Its performance and efficiency support its use as an assistive proofreading tool in real-world radiology workflows.

View Source Full Text PDF

Topics

LanguageDiagnostic ErrorsRadiologyJournal ArticleEvaluation Study

Error Detection in Emergency Radiology Reports Using a Large Language Model: Multistage Evaluation Study.

Authors

Affiliations (6)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?