Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation.
Authors
Affiliations (4)
Affiliations (4)
- Department of Medical Imaging, Chi Mei Medical Center, Tainan, Taiwan.
- Institute of Precision Medicine, College of Medicine, National Sun Yat-sen University, Kaohsiung, Taiwan.
- Department of Radiology, School and College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan.
- Department of Medical Research, Chi Mei Medical Center, Tainan, Taiwan.
Abstract
Laterality errors in radiology reports can endanger patient safety. Effective methods for screening for laterality errors in combined radiographic reports, which combine multiple studies into one, remain unexplored. First, we define and analyze the unstudied combined radiographic report format and its challenges. Second, we introduce a clinically deployable ensemble method (rule-based+GPT-4o), evaluated on large-scale, real-world, imbalanced data. Third, we demonstrate significant performance gaps between real-world imbalanced and synthetic balanced datasets, highlighting limitations of the benchmarking methodology commonly used in current studies. This retrospective study analyzed deidentified English radiology reports containing laterality terms in order. We split the data into TrainVal (combined training and validation dataset), Test-1 (both real-world, imbalanced), and Test-2 (synthetic, balanced). Test-1 comes from a distinct branch. Experiment 1 compared the baseline, workaround, and GPT-4o-augmented rule-based methods. Experiment 2 compared the rule-based method with the highest recall to fine-tuned RoBERTa, ClinicalBERT, and GPT-4o models. As of July 2024, our dataset included 10,000 real-world and 889 synthetic radiology reports. The laterality error rate in real-world reports was 1.20% (120/10,000), significantly higher in combined (103/7000, 1.47%) than in noncombined reports (17/3000, 0.57%; difference=0.90%; z=3.81; P<.001). In experiment 1, recall differed significantly among the 3 versions of rule-based methods (Q=6.0; P=.0498, Friedman test). The rule-based+GPT-4o method had the highest recall (average rank=1), significantly better than the baseline (average rank=3; P=.04, Nemenyi test). Most (5/6) of the false positives introduced by the GPT-4o information extraction were due to parser limitations hidden by error cancellation. In experiment 2, on Test-1, rule-based+GPT-4o (precision=0.696; recall=0.889; F<sub>1</sub>-score=0.780) outperformed GPT-4o (precision=0.219; recall=0.889; F<sub>1</sub>-score=0.352), ClinicalBERT (precision=0.047; recall=0.667; F<sub>1</sub>-score=0.088), and RoBERTa (F<sub>1</sub>-score=0.000). On Test-2, rule-based+GPT-4o (precision=0.996; recall=0.925; F<sub>1</sub>-score=0.959) and GPT-4o (precision=0.979; recall=0.953; F<sub>1</sub>-score=0.966) outperformed ClinicalBERT (precision=0.984; recall=0.749; F<sub>1</sub>-score=0.851) and RoBERTa (F<sub>1</sub>-score=0.013). Both ClinicalBERT and GPT-4o exhibited notable declines in precision on TrainVal and Test-1 compared to Test-2. Both Test-1 data membership (GPT-4o: odds ratio [OR] 239.89, 95% CI 111.05-518.01; P<.001; ClinicalBERT: OR 1924.07, 95% CI 687.46-5383.99; P<.001) and order count per study (GPT-4o: OR 1.79, 95% CI 1.38-2.31; P<.001; ClinicalBERT: OR 2.50, 95% CI 1.64-3.80; P<.001) independently predicted false positive errors in multivariate logistic regression. In subgroup analysis, all models showed reduced precision and F<sub>1</sub> in combined-study subgroups. The combined radiographic report format poses distinct challenges for both radiology report quality assurance and natural language processing. The combined rule-based and GPT-4o method effectively screens for laterality errors in imbalanced real-world reports. A significant performance gap exists between balanced synthetic datasets and imbalanced real-world data. Future studies should also include real-world imbalanced data.