Deep learning-based lightweight model for automated lumbar foraminal stenosis classification: sagittal CT diagnostic performance compared to clinical subspecialists.
Authors
Affiliations (2)
Affiliations (2)
- The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China.
- The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China. [email protected].
Abstract
Magnetic resonance imaging (MRI) is essential for diagnosing lumbar foraminal stenosis (LFS). However, access remains limited in China due to uneven equipment distribution, high costs, and long waiting times. Therefore, this study developed a lightweight deep learning (DL) model using sagittal CT images to classify LFS severity as a potential clinical alternative where MRI is unavailable. A retrospective study included 868 sagittal CT images from 177 patients (2016-2025). Data were split at the patient level into training (n = 125), validation (n = 31), and test sets (n = 21), with annotations, based on the Lee grading system, provided by two spine surgeons. Two DL models were developed: DL1 (EfficientNet-B0) and DL2 (MobileNetV3-Large-100), both of which incorporated a Faster R-CNN with a ResNet-50-based region-of-interest (ROI) detector. Diagnostic performance was benchmarked against spine surgeons with different levels of clinical experience. DL1 achieved 82.35% diagnostic accuracy (matching the senior spine surgeon's 83.33%), with DL2 at 80.39% (mean 81.37%), both exceeding the junior spine surgeon's 62.75%. DL1 demonstrated near-perfect diagnostic agreement with the senior spine surgeon, as validated by Cohen's kappa analysis (κ = 0.815; 95% CI: 0.723-0.907), whereas DL2 showed substantial consistency (κ = 0.799; 95% CI: 0.703-0.895). Inter-model agreement yielded κ = 0.782 (95% CI: 0.682-0.882). The DL models achieved a mean diagnostic accuracy of 81.37%, comparable to that of the senior spine surgeon (83.33%) in grading LFS severity on sagittal CT. However, given the limited sample size and absence of external validation, their applicability and generalisability to other populations and in multi-centre, large-scale datasets remain uncertain.