Back to all papers

Improving Radiology Report Error Detection Using a Multipass Large Language Model: Framework Development and Validation.

June 4, 2026pubmed logopapers

Authors

Kim S,Lee S,Lee SY,Kim J,Kan K,Lee H,Yoon D

Affiliations (8)

  • Department of Radiology, Seoul National University Hospital, Seoul, Republic of Korea.
  • Department of Radiology, Gangnam Severance Hospital, Seoul, Republic of Korea.
  • Department of Internal Medicine, Gangnam Severance Hospital, Seoul, Republic of Korea.
  • Department of Neurology, Severance Hospital, Seoul, Republic of Korea.
  • Department of Surgery, Samsung Medical Center, Seoul, Republic of Korea.
  • Department of Obstetrics and Gynecology, Kangbuk Samsung Hospital, Seoul, Republic of Korea.
  • Department of Biomedical Systems Informatics, College of Medicine, Yonsei University, 101-604, Seoul, 03687, Republic of Korea, 82 31-5189-8450, 82 31-5189-8450.
  • Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea.

Abstract

Large language model (LLM) proofreaders for radiology reports generate many false positives (FPs) due to the low prevalence of errors. This study aimed to determine whether an optimized LLM framework could improve both precision and cost-efficiency without compromising error detection capability. In this retrospective study, 1000 radiology reports (radiography, ultrasonography, computed tomography, and magnetic resonance imaging; 250 each) were sampled from the Medical Information Mart for Intensive Care III database. Two public chest radiography corpora (CheXpert and Open-i) served as external test sets. Three LLM frameworks were evaluated: single-prompt detector (framework 1); report extractor plus single-prompt detector (framework 2); and extractor, detector, and FP verifier (framework 3). Precision for each framework was assessed using positive predictive value (PPV) and detected errors per 1000 reports. Overall efficiency was estimated using model inference costs and reviewer labor costs. PPV increased from 0.063 (95% CI 0.036-0.101) in framework 1 to 0.079 (95% CI 0.049-0.118) in framework 2 and 0.159 (95% CI 0.090-0.252) in framework 3 (P<.001). Despite improved PPV, detected errors remained stable (detected errors per 1000 reports: 12-14). Human review burden decreased from 192 to 88 reports. Framework 3 also reduced model inference costs to US $5.57 per 1000 reports (vs US $9.72 and US $6.85 for frameworks 1 and 2; 42.6% and 18.5% reductions, respectively). External validation confirmed similar improvements. Qualitative analysis revealed that remaining FPs in framework 3 were largely confined to cases requiring deep clinical context (clinically equivalent rephrasing: 53%; unsupported discrepancy assertions: 43%). By eliminating structural FPs (eg, section mismatches and lexical errors: 0%), the framework effectively shifted the quality assurance burden to a smaller set of ambiguous cases, enabling a targeted human-in-the-loop workflow. The multipass LLM improved the precision and cost-efficiency of radiology report error detection in real-world, low-error prevalence settings. The framework demonstrates the feasibility of synergistic artificial intelligence-radiologist collaboration and provides a cost-effective and scalable approach to artificial intelligence-assisted quality assurance in both radiological practice and research.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.