A Calibrated Deep Learning Framework Integrating Spatial Annotations and Clinical Metadata for Safe Three-Class Bone Lesion Classification on Radiographs.
Authors
Affiliations (3)
Affiliations (3)
- Department of Basic Medicine Science, Anatomy, Faculty of Dentistry, Ankara University, Ankara 06560, Türkiye.
- Department of Forensic Anthropology, Institute of Forensic Sciences, Ankara University, Ankara 06560, Türkiye.
- Department of Forensic Anthropology, Graduate School of Health Sciences, Ankara University, Ankara 06560, Türkiye.
Abstract
<b>Background/Objectives</b>: Accurate bone lesion classification on radiographs is critical for clinical decision-making and forensic identification. Existing deep learning approaches treat radiographs as whole images, neglecting available spatial annotations and clinical metadata. To develop an ROI-guided deep learning framework integrating clinical metadata for three-class (Normal, Benign, Malignant) bone lesion classification and to assess its clinical safety profile. <b>Methods</b>: Using the BTXRD (3746 radiographs: 1879 Normal, 1525 Benign, 342 Malignant), an EfficientNetV2-S backbone was combined with an 11-dimensional metadata MLP trained on ROI-cropped regions. Training employed Focal Loss with adaptive class weighting, Mixup/CutMix augmentations, Stochastic Weight Averaging, and Test-Time Augmentation. Five-fold stratified cross-validation with bootstrap confidence intervals (<i>n</i> = 2000) and probability calibration metrics were used. <b>Results</b>: The framework achieved 96.05% accuracy (95% CI: 95.41-96.66%), 93.94% balanced accuracy, 92.62% macro F1-score, and 99.21% macro-AUC (95% CI: 98.89-99.42%). Critically, near-zero Malignant-to-Normal misclassifications occurred (1/342, 0.29%; 95% Clopper-Pearson CI: 0.01-1.62%) across all 3746 predictions. The minority Malignant class attained F1 = 83.53% despite comprising only 9.1% of the dataset. <b>Conclusions</b>: ROI-guided deep learning with metadata fusion achieves state-of-the-art bone lesion classification with clinically safe error patterns and probability outputs whose calibration was explicitly quantified, supporting its potential as a decision support tool in diagnostic radiology and forensic anthropology, pending external validation on independent cohorts.