Deep Learning in Vertebral Fracture Detection: Systematic Review and Meta-analysis of Subject- vs. Vertebra-Level Approaches.
Authors
Affiliations (4)
Affiliations (4)
- Department of Radiology, University of Florida, Gainesville, Florida (M.H., H.S.S., A.R., S.M., A.R., A.D., K.R.P.).
- Department of Industrial and Systems Engineering, University of Florida, Gainesville, Florida (B.A.); Department of Neurology, University of Florida, Gainesville, Florida (B.A., A.B.F.); Magnetoencephalography (MEG) Lab, The Norman Fixel Institute of Neurological Diseases, University of Florida Health, Gainesville, Florida (B.A., A.B.).
- Department of Neurology, University of Florida, Gainesville, Florida (B.A., A.B.F.); Magnetoencephalography (MEG) Lab, The Norman Fixel Institute of Neurological Diseases, University of Florida Health, Gainesville, Florida (B.A., A.B.).
- Department of Radiology, University of Florida, Gainesville, Florida (M.H., H.S.S., A.R., S.M., A.R., A.D., K.R.P.). Electronic address: [email protected].
Abstract
To provide a context-aware evaluation of deep learning algorithms for vertebral fracture detection by disentangling subject-level from vertebra-level approaches, quantifying the influence of key technical and methodological factors, and generating evidence to guide task-specific clinical use and standardized reporting. MATERIALS AND METHODS: In this PRISMA‑compliant review (PROSPERO CRD42024523301), five databases were searched to February 2025 for English‑language studies reporting accuracy metrics. Risk of bias was assessed with QUADAS‑AI. Hierarchical summary ROC models pooled sensitivity, specificity, and AUC for each analytical level; subgroup analysis and meta‑regression explored heterogeneity by test‑set origin, imaging modality, and scanner vendor. 36 studies (96,956 patients; 171,552 images) were eligible; 28 provided 113 contingency tables. Pooled subject‑level sensitivity/specificity was 84%/91% (AUC 0.94); vertebra‑level 80%/97% (AUC 0.96). Subject-level models prioritized sensitivity, whereas vertebra-level models achieved higher specificity with precise localization. External validation lowered sensitivity, yet retained high specificity. Radiographs favored subject‑level screening, whereas CT supported vertebra‑level precision. Multi‑vendor datasets improved subject‑level sensitivity, and single‑vendor datasets enhanced vertebra‑level specificity. Methodological quality varied across studies; QUADAS-AI identified high risk of patient selection bias-the most commonly identified bias source-in 61% of studies. Deep learning models demonstrate high accuracy for vertebral fracture detection; subject-level approaches are suited to screening/triage due to higher sensitivity, whereas vertebra-level approaches offer higher specificity and precise localization for confirmatory diagnosis and treatment planning. Given performance variability across imaging modality and data sources, clinical use should align model granularity with the intended task and context.