Improving Radiology Report Error Detection Using a Multipass Large Language Model: Framework Development and Validation.
Authors
Affiliations (8)
Affiliations (8)
- Department of Radiology, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Radiology, Gangnam Severance Hospital, Seoul, Republic of Korea.
- Department of Internal Medicine, Gangnam Severance Hospital, Seoul, Republic of Korea.
- Department of Neurology, Severance Hospital, Seoul, Republic of Korea.
- Department of Surgery, Samsung Medical Center, Seoul, Republic of Korea.
- Department of Obstetrics and Gynecology, Kangbuk Samsung Hospital, Seoul, Republic of Korea.
- Department of Biomedical Systems Informatics, College of Medicine, Yonsei University, 101-604, Seoul, 03687, Republic of Korea, 82 31-5189-8450, 82 31-5189-8450.
- Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea.
Abstract
Large language model (LLM) proofreaders for radiology reports generate many false positives (FPs) due to the low prevalence of errors. This study aimed to determine whether an optimized LLM framework could improve both precision and cost-efficiency without compromising error detection capability. In this retrospective study, 1000 radiology reports (radiography, ultrasonography, computed tomography, and magnetic resonance imaging; 250 each) were sampled from the Medical Information Mart for Intensive Care III database. Two public chest radiography corpora (CheXpert and Open-i) served as external test sets. Three LLM frameworks were evaluated: single-prompt detector (framework 1); report extractor plus single-prompt detector (framework 2); and extractor, detector, and FP verifier (framework 3). Precision for each framework was assessed using positive predictive value (PPV) and detected errors per 1000 reports. Overall efficiency was estimated using model inference costs and reviewer labor costs. PPV increased from 0.063 (95% CI 0.036-0.101) in framework 1 to 0.079 (95% CI 0.049-0.118) in framework 2 and 0.159 (95% CI 0.090-0.252) in framework 3 (P<.001). Despite improved PPV, detected errors remained stable (detected errors per 1000 reports: 12-14). Human review burden decreased from 192 to 88 reports. Framework 3 also reduced model inference costs to US $5.57 per 1000 reports (vs US $9.72 and US $6.85 for frameworks 1 and 2; 42.6% and 18.5% reductions, respectively). External validation confirmed similar improvements. Qualitative analysis revealed that remaining FPs in framework 3 were largely confined to cases requiring deep clinical context (clinically equivalent rephrasing: 53%; unsupported discrepancy assertions: 43%). By eliminating structural FPs (eg, section mismatches and lexical errors: 0%), the framework effectively shifted the quality assurance burden to a smaller set of ambiguous cases, enabling a targeted human-in-the-loop workflow. The multipass LLM improved the precision and cost-efficiency of radiology report error detection in real-world, low-error prevalence settings. The framework demonstrates the feasibility of synergistic artificial intelligence-radiologist collaboration and provides a cost-effective and scalable approach to artificial intelligence-assisted quality assurance in both radiological practice and research.