Uncertainty-aware deep learning for segmentation of primary tumor and pathologic lymph nodes in oropharyngeal cancer: Insights from a multi-center cohort.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiation Oncology, University Medical Centre Groningen (UMCG), Groningen 9700 RB, the Netherlands; Data Science Centre in Health (DASH), University Medical Centre Groningen (UMCG), Groningen 9700 RB, the Netherlands. Electronic address: [email protected].
- Department of Radiation Oncology, University Medical Centre Groningen (UMCG), Groningen 9700 RB, the Netherlands.
- Department of Radiation Oncology, University Medical Centre Groningen (UMCG), Groningen 9700 RB, the Netherlands; Data Science Centre in Health (DASH), University Medical Centre Groningen (UMCG), Groningen 9700 RB, the Netherlands.
Abstract
Information on deep learning (DL) tumor segmentation accuracy on a voxel and a structure level is essential for clinical introduction. In a previous study, a DL model was developed for oropharyngeal cancer (OPC) primary tumor (PT) segmentation in PET/CT images and voxel-level predicted probabilities (TPM) quantifying model certainty were introduced. This study extended the network to simultaneously generate TPMs for PT and pathologic lymph nodes (PL) and explored whether structure-level uncertainty in TPMs predicts segmentation model accuracy in an independent external cohort. We retrospectively gathered PET/CT images and manual delineations of gross tumor volume of the PT (GTVp) and PL (GTVln) of 407 OPC patients treated with (chemo)radiation in our institute. The HECKTOR 2022 challenge dataset served as external test set. The pre-existing architecture was modified for multi-label segmentation. Multiple models were trained, and the non-binarized ensemble average of TPMs was considered per patient. Segmentation accuracy was quantified by surface and aggregate DSC, model uncertainty by coefficient of variation (CV) of multiple predictions. Predicted GTVp and GTVln segmentations in the external test achieved 0.75 and 0.70 aggregate DSC. Patient-specific CV and surface DSC showed a significant correlation for both structures (-0.54 and -0.66 for GTVp and GTVln) in the external set, indicating significant calibration. Significant accuracy versus uncertainty calibration was achieved for TPMs in both internal and external test sets, indicating the potential use of quantified uncertainty from TPMs to identify cases with lower GTVp and GTVln segmentation accuracy, independently of the dataset.