Diagnostic Accuracy of Deep Learning for Automated Detection of Spinal Degenerative Disease on MRI: A Systematic Review and Meta-Analysis.
Authors
Affiliations (10)
Affiliations (10)
- School of Medicine, College of Medicine and Health Sciences, Bahir Dar University, Bahir Dar, Ethiopia. [email protected].
- EPIC Health Systems, Addis Ababa, Ethiopia. [email protected].
- Indira Gandhi Government Medical College & Hospital, Nagpur, Maharashtra, India.
- School of Public Health, Washington University in St. Louis, Saint Louis, MO, USA.
- School of Medicine, College of Health Sciences, Addis Ababa University, Addis Ababa, Ethiopia.
- Johns Hopkins Bloomberg School of Public Health, Baltimore, USA.
- School of Medicine, Hayat Medical College, Addis Ababa, Ethiopia.
- Rollins School of Public Health, Emory University, Atlanta, GA, USA.
- School of Medicine and Public Health, College of Health, Medicine and Wellbeing, The University of Newcastle, Newcastle, NSW, Australia.
- Amhara Regional Health Bureau, Amhara, Ethiopia.
Abstract
This study aims to estimate the diagnostic accuracy of deep learning (DL) models for automated detection/classification of spinal degenerative disease (SDD) on spine MRI and explore clinically relevant heterogeneity. We searched Ovid MEDLINE, Ovid Embase and Web of Science (January 2010-5 December 2025) for diagnostic accuracy studies of DL applied to spine MRI with reconstructible 2 × 2 data (TP/FP/FN/TN). Risk of bias was assessed with QUADAS-2. Pooled sensitivity and specificity were synthesised using hierarchical bivariate/HSROC models with a prespecified arm-selection hierarchy. Prespecified subgroup/sensitivity analyses examined spinal region, severity threshold, validation type and target focus. Fourteen studies (2020-2025) were included from 2363 records. Sample sizes ranged from 29 to 2991. Overall pooled sensitivity was 0.94 (95% CI 0.89-0.97) and specificity 0.95 (0.90-0.97) (LR + 17.5; LR - 0.06). Stenosis-focused studies showed lower pooled sensitivity/specificity (0.88/0.92) than studies targeting broader degenerative changes (0.96/0.96). Excluding small studies (n ≤ 50) yielded similar estimates (sensitivity 0.95; specificity 0.95; 12 studies). No study was low risk across all QUADAS-2 domains; 9/14 had ≥ 1 high-risk domain. Deeks' test showed no evidence of small-study effects (p = 0.28). DL models show high pooled accuracy for SDD detection on MRI, but clinical readiness is constrained by risk of bias, predominantly retrospective single-centre designs, subjective reference standards and limited external validation; prospective multicentre evaluations with prespecified clinically meaningful thresholds are needed.