Using a large language model for post-deployment monitoring of FDA approved AI: pulmonary embolism detection use case.

June 30, 2025

papers DOI: 10.1016/j.jacr.2025.06.036 PMID: 40602699

Authors

Sorin V,Korfiatis P,Bratt AK,Leiner T,Wald C,Butler C,Cook CJ,Kline TL,Collins JD

Affiliations (6)

Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA. Electronic address: [email protected].
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA.
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA; Chair, Thoracic Division, Department of Radiology, Mayo Clinic, Rochester, MN, USA.
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA; Medical Director, Artificial Intelligence for Cardiovascular Imaging Research and Exploration Program, Mayo Clinic, Rochester, MN, USA.
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA; Chair, ACR Informatics Commission; Vice Chair, ACR Board of Chancellors.
Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA; Chair, Informatics Division, Department of Radiology, Mayo Clinic, Rochester, MN, USA; Medical Director, Advanced Imaging Post-Processing Lab, Mayo Clinic, Rochester, MN, USA. Electronic address: [email protected].

Abstract

Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Post-deployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production. To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED). We retrospectively analyzed 11,999 CT pulmonary angiography (CTPA) studies performed between 04/30/2023-06/17/2024. Ground truth was determined by combining LLM-based radiology-report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval (CI) for seven consecutive days. The CI and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection. Of 11,999 CTPAs, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, one was an LLM false-negative, and the framework ultimately identified four true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be two months. A 2-3% decline in model specificity caused a 2-3-fold increase in discrepancies, while a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative-LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework. A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.

View Source Full Text PDF

Topics

Journal Article

Using a large language model for post-deployment monitoring of FDA approved AI: pulmonary embolism detection use case.

Authors

Affiliations (6)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?