AI-assisted detection of cerebral aneurysms on 3D time-of-flight MR angiography: user variability and clinical implications.
Authors
Affiliations (4)
Affiliations (4)
- Department of Diagnostic and Interventional Neuroradiology, CHRU Nancy, France; INRIA, LORIA, CNRS, Université de Lorraine, Nancy, France. Electronic address: [email protected].
- Department of Diagnostic and Interventional Neuroradiology, CHRU Nancy, France; IADI, INSERM U1254, Université de Lorraine, Nancy, France.
- Department of Diagnostic and Interventional Neuroradiology, CHRU Nancy, France.
- INRIA, LORIA, CNRS, Université de Lorraine, Nancy, France.
Abstract
The generalizability and reproducibility of AI-assisted detection for cerebral aneurysms on 3D time-of-flight MR angiography remain unclear. We aimed to evaluate physician performance using AI assistance, focusing on inter- and intra-user variability, identifying factors influencing performance and clinical implications. In this retrospective study, four state-of-the-art AI models were hyperparameter-optimized on an in-house dataset (2019-2021) and evaluated via 5-fold cross-validation on a public external dataset. The two best-performing models were selected for evaluation on an expert-revised external dataset. saccular aneurysms without prior treatment. Five physicians, grouped by expertise, each performed two AI-assisted evaluations, one with each model. Lesion-wise sensitivity and false positives per case (FPs/case) were calculated for each physician-AI pair and AI models alone. Agreement was assessed using kappa. Aneurysm size comparisons used the Mann-Whitney U test. The in-house dataset included 132 patients with 206 aneurysms (mean size: 4.0 mm); the revised external dataset, 270 patients with 174 aneurysms (mean size: 3.7 mm). Standalone AI achieved 86.8% sensitivity and 0.58 FPs/case. With AI assistance, non-experts achieved 72.1% sensitivity and 0.037 FPs/case; experts, 88.6% and 0.076 FPs/case; the intermediate-level physician, 78.5% and 0.037 FPs/case. Intra-group agreement was 80% for non-experts (kappa: 0.57, 95% CI: 0.54-0.59) and 77.7% for experts (kappa: 0.53, 95% CI: 0.51-0.55). In experts, false positives were smaller than true positives (2.7 vs. 3.8 mm, p < 0.001); no difference in non-experts (p = 0.09). Missed aneurysm locations were mainly model-dependent, while true- and false-positive locations reflected physician expertise. Non-experts more often rejected AI suggestions and added fewer annotations; experts were more conservative and added more. Evaluating AI models in isolation provides an incomplete view of their clinical applicability. Detection performance and patterns differ between standalone AI and AI-assisted use, and are modulated by physician expertise. Rigorous external validation is essential before clinical deployment.