Factors Impacting the Performance of Deep Learning Detection of Pulmonary Emboli.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA. Electronic address: [email protected].
- Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA.
- Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN, USA. Electronic address: [email protected].
Abstract
AI models are increasingly adopted in clinical practice, yet their generalizability outside controlled validation settings remains unclear. We aimed to evaluate the real-world performance of an FDA-cleared commercial pulmonary embolism (PE) detection model and identify technical, demographic, and clinical factors associated with performance variation, to inform post-production monitoring and deployment strategies. This retrospective study included 11,144 CT pulmonary angiography examinations performed in a single health system between 04/2023-06/2024, processed by a commercial PE detection model. Technical parameters (scanner manufacturer, slice thickness, dose index volume, contrast enhancement of pulmonary artery), demographic factors (age, gender, race, BMI), and clinical comorbidities (heart failure, pulmonary hypertension, cancer) were extracted from DICOM headers and electronic health records. Univariate and multivariable logistic regression analyses identified factors associated with decreased Performance. There were 1,193/11,144 (10.7%) PE-positive cases. The model had an overall 83.5% (95% confidence interval [CI] 81.3%-85.5%) sensitivity and positive predictive value (PPV) was 90.5% (95%CI 88.7%-92.1%). Multivariable analysis showed significant associations between decreased sensitivity and scanner manufacturer (odds ratio [OR] 0.25, 95%CI 0.14-0.46 and 0.34, 95%CI 0.17-0.69, for different vendors vs. reference, p<0.003), increased slice thickness (OR 0.74, 95%CI 0.57-0.95 per 1mm increase, p=0.018), presence of imaging artifacts (OR 0.33, 95%CI 0.23-0.48, p<0.001), heart failure (OR 0.58, 95%CI 0.38-0.88, p=0.010), and pulmonary hypertension (OR 0.44, 95%CI 0.25-0.77, p=0.004). Demographic factors including age, sex, race, and BMI showed no significant associations with model performance. AI performance in clinical practice varies significantly based on technical imaging parameters and patient comorbidities. Understanding these factors is essential for optimal product selection and for effective post-deployment monitoring, enabling investigation of model drift in evolving clinical settings. The findings highlight the need for local validation frameworks that account for institution-specific technical infrastructure and patient populations, to ensure safe AI deployment across diverse clinical environments.