Blinded, bias-controlled multi-rater evaluation of human-versus-AI brain metastasis segmentation using a hybrid foundation-model framework.

June 22, 2026

papers

DOI: 10.1002/mp.70538 PMID: 42329664

Authors

Han Y,Zhu E,Mekdash MHA,Awad O,Pathak P,Liang S,Hamstra DA,Zhang X,Siddiqui ZA,Sun B

Affiliations (2)

Radiation Oncology Department, Baylor College of Medicine, Houston, Texas, USA.
Nanjing Medical University, Nanjing, Jiangsu, China.

Abstract

Accurate segmentation of brain metastases (BM) is essential for diagnosis, stereotactic radiosurgery planning, and longitudinal assessment. However, manual contouring is time-intensive, limiting clinical scalability, and exhibits substantial inter-observer variability. This variability complicates objective assessment of automated segmentation methods and challenges interpretation of model performance. To address these limitations, we developed TUM-SAM, a hybrid foundation-model framework for fully automated BM segmentation, and introduced a bias-controlled, blinded multi-rater evaluation paradigm to determine whether AI-based BM segmentation has reached expert-level performance and whether AI-generated contours are preferred by human experts under unbiased assessment. TUM-SAM integrates nnU-Net-based lesion detection with a tumor-adapted Med-SAM segmentation model to enable prompt-free, fully automated segmentation. Training used 301 patients (2548 lesions), and external evaluation used an independent cohort of 105 patients (397 lesions). Segmentation accuracy was benchmarked against DeepMedic and nnU-Net using Dice similarity coefficient (DSC) and 95th-percentile Hausdorff distance (HD95). Two physicians contoured all external cases, and a third physician contoured a 20-patient subset for a blinded, tumor-level, multi-rater preference study. Pairwise contour preferences were analyzed using a Bradley-Terry probabilistic model to obtain bias-adjusted estimates of relative contour quality while accounting for rater-specific tendencies and case difficulty. In the external cohort, TUM-SAM achieved a lesion-wise detection sensitivity of 0.94 and outperformed DeepMedic and nnU-Net across all tumor sizes, with a mean DSC of 0.84 and HD95 of 1.9 mm (nnU-Net/DeepMedic: DSC < 0.70, HD95 > 3.3 mm). Across voxel-wise evaluation, TUM-SAM's geometric performance fell within the range of inter-observer variability among physicians and was sensitive to reference construction. In contrast, in the blinded rater study, experts preferred TUM-SAM-generated contours over individual physician contours in 81-87% of raw comparisons; Bradley-Terry analysis yielded conservative, bias-corrected win probabilities of 55-56%, indicating consistent preference after adjustment for rater and case difficulty. Using a bias-controlled, blinded multi-rater evaluation framework, TUM-SAM demonstrates brain metastasis segmentation quality that is consistently preferred by expert physicians, highlighting the limitations of agreement-based voxel-wise metrics under inter-observer variability. These findings underscore the dependence of conventional evaluation on reference definition and support preference-based assessment as a complementary approach for evaluating AI segmentation quality in BM MRI.

View Source Full Text PDF

Topics

Brain NeoplasmsImage Processing, Computer-AssistedArtificial IntelligenceJournal Article

Blinded, bias-controlled multi-rater evaluation of human-versus-AI brain metastasis segmentation using a hybrid foundation-model framework.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?