Artificial Intelligence for Teaching Case Curation: Evaluating Model Performance on Imaging Report Discrepancies.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, UW-Madison School of Medicine & Public Health, Madison, Wisconsin (M.B., Z.H., X.T., A.B.R., T.K., J.D.W., T.B., E.M.L.).
- Department of Computer Science, UW-Madison, Madison, Wisconsin (J.H.).
- Department of Radiology, UW-Madison School of Medicine & Public Health, Madison, Wisconsin (M.B., Z.H., X.T., A.B.R., T.K., J.D.W., T.B., E.M.L.); Department of Radiology, University of Galway, Galway, Ireland (E.M.L.). Electronic address: [email protected].
Abstract
Assess the feasibility of using a large language model (LLM) to identify valuable radiology teaching cases through report discrepancy detection. Retrospective study included after-hours head CT and musculoskeletal radiograph exams from January 2017 to December 2021. Discrepancy level between trainee's preliminary interpretation and final attending report was annotated on a 5-point scale. RadBERT, an LLM pretrained on a vast corpus of radiology text, was fine-tuned for discrepancy detection. For comparison and to ensure the robustness of the approach, Mixstral 8×7B, Mistral 7B, and Llama2 were also evaluated. The model's performance in detecting discrepancies was evaluated using a randomly selected hold-out test set. A subset of discrepant cases identified by the LLM was compared to a random case set by recording clinical parameters, discrepant pathology, and evaluating possible educational value. F1 statistic was used for model comparison. Pearson's chi-squared test was employed to assess discrepancy prevalence and score between groups (significance set at p<0.05). The fine-tuned LLM model achieved an overall accuracy of 90.5% with a specificity of 95.5% and a sensitivity of 66.3% for discrepancy detection. The model sensitivity significantly improved with higher discrepancy scores, 49% (34/70) for score 2 versus 67% (47/62) for score 3, and 81% (35/43) for score 4/5 (p<0.05 compared to score 2). LLM-curated set showed a significant increase in the prevalence of all discrepancies and major discrepancies (scores 4 or 5) compared to a random case set (P<0.05 for both). Evaluation of the clinical characteristics from both the random and discrepant case sets demonstrated a broad mix of pathologies and discrepancy types. An LLM can detect trainee report discrepancies, including both higher and lower-scoring discrepancies, and may improve case set curation for resident education as well as serve as a trainee oversight tool.