Empirical evaluation of variability and multi-institutional generalizability of deep learning survival models: application to renal cancer CT scans.
Authors
Affiliations (5)
Affiliations (5)
- Case Western Reserve University, Department of Biomedical Engineering, 10900 Euclid Ave, Cleveland, OH 44106, USA. Electronic address: [email protected].
- Case Western Reserve University, Department of Biomedical Engineering, 10900 Euclid Ave, Cleveland, OH 44106, USA.
- Case Western Reserve University, Department of Computer and Data Sciences, 10900 Euclid Ave, Cleveland, OH 44106, USA.
- Cleveland Clinic, Department of Urology, 9500 Euclid Ave, Cleveland, OH 44106, USA.
- Case Western Reserve University, Department of Biomedical Engineering, 10900 Euclid Ave, Cleveland, OH 44106, USA; Louis Stokes Veteran Affairs Medical Center, 10701 East Blvd, Cleveland, OH 44106, USA. Electronic address: [email protected].
Abstract
Deep learning survival (DLS) modeling has shown significant promise across multiple disease conditions for predicting patient outcomes via biomedical data, but remain relatively under-explored in conjunction with medical imaging. This poses the question of what methodological choices promote model generalization, especially when predicting continuous survival outcomes in external validation cohorts. In this work, we systematically examined the impact of data partitioning, data order, model initialization, and augmentation strategies on the generalization of CT-based DLS survival models in renal cancers. A multi-institutional CT cohort was assembled, leveraging public repositories such as TCGA-KIRC, TCGA-KICH, TCGA-KIRP, and KITS19; totaling 525 patients from across 9 institutions. A 3D ResNet-18 model was trained to predict overall patient survival with a Cox loss function and evaluated across three controlled experiments: (i) random versus covariate-balanced data partitioning, (ii) randomizing vs fixing data ordering and initial model weights, and (iii) varying intensity, spatial, and noise based augmentations; evaluated in terms of concordance index (c-index) and hazard ratio (HR). Intelligent covariate-balanced data partitioning demonstrated optimal performance on an external validation set (c-index 0.74, HR 5.08) compared to random partitioning (c-index 0.56, HR 0.29). Different model initializations significantly impacted the average performance and variance across replicate trainings, while data ordering had little to no effect. Combining multiple augmentations produced the best performance on external validation (c-index +4.76%, HR +44.39%) compared to no augmentations, while most impactful individual augmentation type was additive gaussian noise (c-index +3.17%, HR +23.47%). Our results demonstrate that careful selection of strategies for data partitioning and ordering, weight initialization, and image augmentation are critical for developing robust and generalizable DL models for continuous survival prediction using medical imaging data.