Back to all papers

Automated Detection and Classification of Radiology Report Discrepancies Using NLP: A Tool for Resident Education and Quality Assurance.

March 31, 2026pubmed logopapers

Authors

Wang K,Wang H,Wu J,Wang K,Liu W,Zhang Y,Wang X

Affiliations (7)

  • Department of Radiology, Peking University First Hospital, Beijing 100034, China. Electronic address: [email protected].
  • Department of Radiology, Peking University First Hospital, Beijing 100034, China. Electronic address: [email protected].
  • Department of Radiology, Peking University First Hospital, Beijing 100034, China. Electronic address: [email protected].
  • Department of Radiology, Peking University First Hospital, Beijing 100034, China. Electronic address: [email protected].
  • Beijing Smart Tree Medical Technology Co.,Ltd., Beijing 100011, China. Electronic address: [email protected].
  • Beijing Smart Tree Medical Technology Co.,Ltd., Beijing 100011, China. Electronic address: [email protected].
  • Department of Radiology, Peking University First Hospital, Beijing 100034, China. Electronic address: [email protected].

Abstract

To develop and evaluate a natural language processing (NLP) system that automatically detects and classifies discrepancies between preliminary and final radiology reports, with the goal of enhancing resident education through structured feedback. We retrospectively analyzed 889 de-identified lumbar spine MRI reports (768 with revisions) from December 2023 to March 2024. Preliminary full diagnostic reports were generated by trainee residents during daytime rotations; final reports were subsequently verified by attending radiologists remotely. Discrepancies in the diagnostic impression section were extracted using a multi-step NLP pipeline: sentence segmentation, BERT-based sentence matching, GPT-4-based named entity recognition, and rule-based classification into 11 correction types (e.g., missed diagnosis, misdiagnosis, missed image feature, misidentified image feature, localization error, diagnostic reasoning error, clinical query omission, severity error, confidence difference, typographic error, terminology refinement). Ground truth was established by three radiologists. System performance was evaluated for each correction type individually using accuracy, sensitivity, specificity, and inter class coefficient (ICC). Resident and attending radiologist performance trends were analyzed at the report level. The NLP system achieved high accuracy (0.983-0.999), sensitivity (0.977-1.000), and specificity (0.900-1.000) for each of the 11 correction types, with strong inter-rater reliability (ICC > 0.75). Most common corrections were misdiagnosis (504/768, 65.6%) and missed diagnosis (356/768, 46.4%). Residents showed significant variability in error rates, especially in missed diagnosis (range 11.1-59.1% across 16 residents) and misdiagnosis (range 24.0-71.1% across 16 residents). Attending radiologists exhibited marked heterogeneity in correction patterns (n=6, individual workloads 95-187 reports, median 159), with significant variability across all major error types (p<0.001 for missed diagnosis [20.6%-82.0%], misdiagnosis [31.4%-66.7%], localization error [15.8%-54.7%], and terminology refinement [3.2%-36.7%]). The NLP-based discrepancy tracking system accurately identifies and classifies report modifications, enabling scalable, targeted feedback for radiology residents. Inter-resident and inter-attending variability highlights the need for individualized training and standardized review practices.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.