Deep Learning for Differentiating Benign From Malignant Bile Duct Dilation on MRCP: Development and Prospective Evaluation of an Xception-Logistic Regression Ensemble Model.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, The Affiliated Hospital, Southwest Medical University, Luzhou, Sichuan, China.
- Precision Imaging and Intelligent Analysis Key Laboratory of Luzhou, Southwest Medical University, Luzhou, Sichuan, China.
- Department of Oncology, The Affiliated Hospital, Southwest Medical University, Luzhou, Sichuan, China.
Abstract
Accurate identification of benign and malignant bile duct dilatation (BDD) is needed to determine its management plan. Conventional imaging evaluation is subjective, whereas deep learning (DL) offers potential for automated objective assessment. To construct and evaluate DL models and ensemble strategies based on magnetic resonance cholangiopancreatography (MRCP) images for identifying benign and malignant BDD. Retrospective and prospective. A retrospective cohort (n = 378; median age, 60 years [range: 14, 90]; 194 male) from two institutions and a prospective cohort (n = 60; median age, 62.5 years [range: 15, 86]; 30 male) were included. Retrospective data were randomly stratified split into training, validation, and internal test sets (2:1:1) and an independent external test set. Benign cases were downsampled to balance class distribution. 3 T MRCP (3D turbo spin echo: VISTA and SPACE). The primary retrospective endpoint was area under the curve (AUC) across DL algorithms and ensembles. Prospectively, the accuracy, sensitivity, and specificity of the model was compared with those of three radiologists. Group comparisons used Mann-Whitney U and Chi-square tests (p < 0.05). Model performance was evaluated using the Hosmer-Lemeshow test, DeLong's test with Bonferroni correction (α = 0.005), and McNemar's test. The Xception model achieved AUCs of 0.816 (95% CI, 0.788-0.844) on the internal test set and 0.807 (95% CI, 0.779-0.835) on the external test set. The ensemble model incorporating logistic regression yielded higher patient-level AUCs of 0.890 and 0.885, with good calibration (p = 0.109). No significant differences were observed among the five ensemble strategies (minimum adjusted p = 0.62). In the prospective cohort, the model showed 90.0% accuracy, sensitivity, and specificity, comparable to radiologists (76.7%-86.7%) without a significant difference (p = 0.143, 0.302, and 0.774, respectively). The Xce-LR model shows potential for automating BDD differentiation using MRCP. Stage 2.