Back to all papers

Uncertainty-aware 3D tumor segmentation with deep ensembles: an evaluation of temporal versus data diversity.

June 17, 2026pubmed logopapers

Authors

Zhang C,Rimez D,Lee JA,Barragan Montero AM

Affiliations (4)

  • Institut de recherche expérimentale et clinique (IREC), UCLouvain, MIRO - Claude Bernard, B1.54.07, Av. Hippocrate 54, Woluwe-Saint-Lambert, Brussels, 1200, Belgium.
  • Center for Molecular Imaging and Experimental Radiotherapy, Universite Catholique de Louvain, av Hippocrate 55 B1.54.07, Brussels, Brussels, 1200, Belgium.
  • Center for Molecular Imaging and Experimental Radiotherapy, Universite Catholique de Louvain, av Hippocrate 55 B1.54.07, Brussels, 1200, Belgium.
  • Center of Molecular Imaging, Radiotherapy and Oncology, Universite catholique de Louvain, Av. Hippocrate 54, Brussels, 1200, Belgium.

Abstract

Artificial intelligence (AI) is transforming segmentation tasks in radiotherapy, but model reliability remains a critical concern, particularly for tumor segmentation. While deep ensembles are among the most reliable uncertainty quantification strategies, their computational cost often drives practitioners toward other approaches. This study explores different sources of model diversity to assess the trade-off between training efficiency and reliability.

Approach: Cross-Validation Ensemble (CVE) served as the baseline, compared against two alternative strategies: Checkpoint Ensemble (CPE), a training-efficient approach leveraging temporal diversity from a single training run, and Data Diversity Ensemble (DDE), which enforces strong data diversity through clustering-based subgroup partitioning rather than random data splits. Evaluations were conducted on three 3D tumor segmentation tasks: lung CT (NSCLC), brain MRI (BraTS), and prostate MRI (LUND-PROBE).

Main results: CPE achieves segmentation performance comparable to CVE, with marginal Dice differences (-1.20% on NSCLC, -0.27% on BraTS, and +0.05% on LUND-PROBE), while requiring only a single training run. In contrast, DDE exhibits larger performance degradation on NSCLC (-5.67%), while the differences are smaller on BraTS (-0.81%) and LUND-PROBE (-0.11%). We evaluate reliability by measuring the correlation between Dice and global uncertainty metrics. The results reveal a dataset-dependent pattern: on NSCLC and BraTS, CPE achieves the strongest uncertainty-performance associations, with Spearman correlations between mean foreground entropy and Dice of |ρs| = 0.85 and 0.73, respectively. In contrast, on LUND-PROBE, DDE shows substantially higher correlations (|ρs| = 0.53) compared to CVE and CPE (0.28 and 0.29).

Significance: Our findings highlight that the effectiveness of ensemble diversity for UQ depends on the interplay between dataset characteristics, diversity sources, and evaluation metrics. CPE provides uncertainty estimates comparable to CVE with substantially reduced computational cost, offering a practical trade-off between efficiency and reliability.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.