Artificial Intelligence for Teaching Case Curation: Evaluating Model Performance on Imaging Report Discrepancies.

Authors

Bartley M,Huemann Z,Hu J,Tie X,Ross AB,Kennedy T,Warner JD,Bradshaw T,Lawrence EM

Affiliations (3)

  • Department of Radiology, UW-Madison School of Medicine & Public Health, Madison, Wisconsin (M.B., Z.H., X.T., A.B.R., T.K., J.D.W., T.B., E.M.L.).
  • Department of Computer Science, UW-Madison, Madison, Wisconsin (J.H.).
  • Department of Radiology, UW-Madison School of Medicine & Public Health, Madison, Wisconsin (M.B., Z.H., X.T., A.B.R., T.K., J.D.W., T.B., E.M.L.); Department of Radiology, University of Galway, Galway, Ireland (E.M.L.). Electronic address: [email protected].

Abstract

Assess the feasibility of using a large language model (LLM) to identify valuable radiology teaching cases through report discrepancy detection. Retrospective study included after-hours head CT and musculoskeletal radiograph exams from January 2017 to December 2021. Discrepancy level between trainee's preliminary interpretation and final attending report was annotated on a 5-point scale. RadBERT, an LLM pretrained on a vast corpus of radiology text, was fine-tuned for discrepancy detection. For comparison and to ensure the robustness of the approach, Mixstral 8×7B, Mistral 7B, and Llama2 were also evaluated. The model's performance in detecting discrepancies was evaluated using a randomly selected hold-out test set. A subset of discrepant cases identified by the LLM was compared to a random case set by recording clinical parameters, discrepant pathology, and evaluating possible educational value. F1 statistic was used for model comparison. Pearson's chi-squared test was employed to assess discrepancy prevalence and score between groups (significance set at p<0.05). The fine-tuned LLM model achieved an overall accuracy of 90.5% with a specificity of 95.5% and a sensitivity of 66.3% for discrepancy detection. The model sensitivity significantly improved with higher discrepancy scores, 49% (34/70) for score 2 versus 67% (47/62) for score 3, and 81% (35/43) for score 4/5 (p<0.05 compared to score 2). LLM-curated set showed a significant increase in the prevalence of all discrepancies and major discrepancies (scores 4 or 5) compared to a random case set (P<0.05 for both). Evaluation of the clinical characteristics from both the random and discrepant case sets demonstrated a broad mix of pathologies and discrepancy types. An LLM can detect trainee report discrepancies, including both higher and lower-scoring discrepancies, and may improve case set curation for resident education as well as serve as a trainee oversight tool.

Topics

RadiologyArtificial IntelligenceJournal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.