Interpretable deep learning for rotator cuff calcific tendinopathy diagnosis: a multi-center study.
Authors
Affiliations (11)
Affiliations (11)
- Radiology Department, Hospital Universitario Rey Juan Carlos, Móstoles, Madrid, Spain. [email protected].
- Health Research Institute of the Jiménez Díaz Foundation (IIS-FJD), Madrid, Spain. [email protected].
- Department of Physical Therapy, Occupational Therapy, Rehabilitation and Physical Medicine, Rey Juan Carlos University, Madrid, Spain. [email protected].
- Department of Morphology and Cell Biology, Universidad de Oviedo, Oviedo, Spain. [email protected].
- Radiology Department, Hospital Universitario Rey Juan Carlos, Móstoles, Madrid, Spain.
- Health Research Institute of the Jiménez Díaz Foundation (IIS-FJD), Madrid, Spain.
- Advanced Computing and E-Science Research Group, Instituto de Física de Cantabria (IFCA), CSIC-UC, Santander, Spain.
- Department of Morphology and Cell Biology, Universidad de Oviedo, Oviedo, Spain.
- Facultad de Ciencias de la Salud, Universidad Autónoma de Chile, Santiago, Chile.
- Department of Physical Therapy, Occupational Therapy, Rehabilitation and Physical Medicine, Rey Juan Carlos University, Madrid, Spain.
- Escuela Politécnica Superior, Universidad CEU San Pablo, Madrid, Spain.
Abstract
The reliable deployment of artificial intelligence systems in medical imaging requires high diagnostic performance, robustness and interpretability. In this study, we developed and evaluated two automated frameworks for binary classification of shoulder radiographs (XRs) using deep learning (DL) and hybrid DL-machine learning (ML) approaches. A convolutional neural network (CNN) based on a fine-tuned VGG19 architecture was trained end-to-end on a large, balanced dataset of 4,268 shoulder XRs. In parallel, hybrid models were constructed by extracting deep feature representations from the trained network and combining them with traditional ML classifiers. Model performance was evaluated on independent internal (n = 480) and external (n = 308) validation sets. Both approaches achieved high discriminative performance. Paired comparison of Receiver Operating Characteristic (ROC) curves using the DeLong test revealed no statistically significant differences between the end-to-end CNN and the hybrid CNN-ML pipeline for either internal validation (AUC 0.956 vs. 0.961) or external generalization (AUC 0.940 vs. 0.942). Model interpretability was assessed using Grad-CAM and SHAP values. Our results suggest that while both frameworks are robust, the end-to-end DL approach offers a more streamlined workflow and more direct visual explainability via saliency maps. These findings support the potential of AI-based tools for shoulder XR analysis; however, prospective real-world validation, assessment under routine prevalence conditions, and direct comparison with human readers are still needed before clinical integration can be established.