External validation of SpineNetv2 deep learning system for automated lumbar spine MRI analysis: A multi-pathology diagnostic agreement study.
Authors
Affiliations (5)
Affiliations (5)
- The Second Affiliated Hospital of Zunyi Medical University, Zunyi, China.
- The Seventh Affiliated Hospital of Sun Yat-sen University, Shenzhen, China.
- The First Affiliated Hospital of Sun Yat- sen University, Guangzhou, China.
- Guizhou University, Guiyang, China. [email protected].
- The Second Affiliated Hospital of Zunyi Medical University, Zunyi, China. [email protected].
Abstract
Magnetic resonance imaging (MRI) is the reference standard for evaluating degenerative lumbar spine disorders, but interpretation is time-consuming and subject to inter-observer variability. SpineNetv2, a publicly available deep learning system, enables automated analysis of multiple spinal pathologies. This study conducted an independent external validation of SpineNetv2 against expert reference assessments. A total of 491 patients (2,455 lumbar discs, L1/2-L5/S1) were retrospectively included. Disc-level reference assessments were provided by an expert orthopedic surgeon, with a junior orthopedic surgeon serving as comparator. Six pathologies were assessed: disc degeneration (Pfirrmann grading), central canal stenosis (CCS), spondylolisthesis, herniation, and bilateral foraminal stenosis (FS). Performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1-score, Matthews correlation coefficient, exact agreement, weighted kappa, and mean absolute error. McNemar's test and bootstrap resampling (1,000 iterations) were used for statistical analysis. Overall agreement ranged from 83.5% to 97.5% (mean 92.8%). SpineNetv2 significantly outperformed the junior orthopedic surgeon in CCS, spondylolisthesis, and bilateral FS (all p ≤ 0.001), with comparable performance in herniation (p = 0.293). Pfirrmann grading showed lower MAE for SpineNetv2 compared with the junior surgeon (0.213 vs. 0.254, p = 0.001), though agreement declined in older patients and upper lumbar discs. Error analysis revealed a specificity-oriented profile, with false negatives exceeding false positives. SpineNetv2 demonstrated high agreement across five binary lumbar pathologies, while Pfirrmann grading remained the main limitation, particularly in elderly upper lumbar discs. Its specificity-oriented profile supports use as a confirmatory second reader, but reliance on negative findings is not recommended. Broader reliability will require multicenter, multi-reader validation and sensitivity-oriented calibration.