Deep Learning for Opportunistic Vertebral Fracture Detection on Routine Thoraco-abdominal Computed Tomography: A Systematic Review and Hierarchical Summary Receiver Operating Characteristic Meta-analysis of Patient-level Diagnostic Test Accuracy.
Authors
Affiliations (3)
Affiliations (3)
- School of Medicine, College of Medicine and Health Sciences, Bahir Dar University, Bahir Dar, Ethiopia (K.Y.G.); EPIC Health Systems, Addis Ababa, Ethiopia (K.Y.G.). Electronic address: [email protected].
- Federal Police Hospital, Addis Ababa, Ethiopia (Z.N.I.).
- Amhara Regional Health Bureau, Amhara, Ethiopia (B.A.A.).
Abstract
Vertebral fractures (VFs) are common, clinically important, and often missed on routine chest or abdominal computed tomography (CT). Deep learning (DL) may support opportunistic case-finding. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies (PRISMA-DTA), we searched MEDLINE, Embase, and Web of Science. Risk of bias was assessed using Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2), and artificial intelligence (AI)-specific reporting was assessed descriptively. We fitted bivariate random-effects/hierarchical summary receiver operating characteristic (HSROC) models and conducted restriction-based sensitivity analyses. Seven retrospective studies (2020-2025; N = 11,615) were included. Most cohorts were opportunistic or osteoporosis-related; 5/7 used external validation. All evaluated stand-alone DL for detecting ≥1 vertebra with Genant semiquantitative (SQ) grade 2-3 fracture. Pooled sensitivity was 0.83 (95% confidence interval [CI]: 0.73-0.90) and specificity was 0.92 (95% CI: 0.90-0.94). The positive likelihood ratio was 10.44, negative likelihood ratio was 0.19, and diagnostic odds ratio was 55.98. Sensitivity varied more than specificity, and HSROC asymmetry suggested differences in case mix or thresholds. Risk of bias was low to moderate, and AI-specific reporting was incomplete. DL showed high specificity and moderate-to-high sensitivity for patient-level VF detection on routine CT. However, evidence remains early, small, retrospective, and heterogeneous. Current findings support prospective evaluation of DL as a reader-alert, triage, or rule-in aid, rather than routine deployment or stand-alone exclusion of VF.