Real-world performance evaluation of a commercial deep learning model for intracranial hemorrhage detection.
Authors
Affiliations (4)
Affiliations (4)
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, GA, USA.
- Department of Radiology, Mayo Clinic, Rochester, MN, USA.
- Department of Biomedical Informatics, Emory University, Atlanta, GA, USA.
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, GA, USA. [email protected].
Abstract
Intracranial hemorrhage (ICH) is a life-threatening emergency requiring rapid and accurate diagnosis, yet the real-world performance of FDA-cleared deep-learning models remains uncertain. We retrospectively evaluated a commercial AI model (Aidoc Medical Briefcase ICH Triage) across 101,944 non-contrast head CT examinations from 74,142 patients in a 17-facility academic health system (April 2023-April 2025). Reference-standard ICH labels and imaging characteristics were extracted from radiology reports using GPT-4o with a zero-shot prompt-refinement strategy, validated against 500 manually annotated cases. The LLM achieved 96% accuracy (κ = 0.85) for ICH classification. Overall, the Aidoc model demonstrated 82.2% sensitivity, 97.6% specificity, and 96.6% accuracy. Sensitivity was highest for acute (86.2%), large >10 mm (95.0%), and multi-compartment hemorrhages (93.6%), but substantially lower for subacute (45.5%), chronic (54.8%), small ≤10 mm (74.8%), and single-compartment bleeds (76.0%). Performance was also reduced in the outpatient setting (72.2%), where subtle hemorrhages were more common, while remaining consistent across demographic subgroups. These findings show that the model performs reliably for acute and extensive ICH but is less sensitive to subtle or localized presentations, underscoring the need for ongoing real-world evaluation and targeted improvements to support safe clinical triage.