Back to all papers

Blinded, bias-controlled multi-rater evaluation of human-versus-AI brain metastasis segmentation using a hybrid foundation-model framework.

June 22, 2026pubmed logopapers

Authors

Han Y,Zhu E,Mekdash MHA,Awad O,Pathak P,Liang S,Hamstra DA,Zhang X,Siddiqui ZA,Sun B

Affiliations (2)

  • Radiation Oncology Department, Baylor College of Medicine, Houston, Texas, USA.
  • Nanjing Medical University, Nanjing, Jiangsu, China.

Abstract

Accurate segmentation of brain metastases (BM) is essential for diagnosis, stereotactic radiosurgery planning, and longitudinal assessment. However, manual contouring is time-intensive, limiting clinical scalability, and exhibits substantial inter-observer variability. This variability complicates objective assessment of automated segmentation methods and challenges interpretation of model performance. To address these limitations, we developed TUM-SAM, a hybrid foundation-model framework for fully automated BM segmentation, and introduced a bias-controlled, blinded multi-rater evaluation paradigm to determine whether AI-based BM segmentation has reached expert-level performance and whether AI-generated contours are preferred by human experts under unbiased assessment. TUM-SAM integrates nnU-Net-based lesion detection with a tumor-adapted Med-SAM segmentation model to enable prompt-free, fully automated segmentation. Training used 301 patients (2548 lesions), and external evaluation used an independent cohort of 105 patients (397 lesions). Segmentation accuracy was benchmarked against DeepMedic and nnU-Net using Dice similarity coefficient (DSC) and 95th-percentile Hausdorff distance (HD95). Two physicians contoured all external cases, and a third physician contoured a 20-patient subset for a blinded, tumor-level, multi-rater preference study. Pairwise contour preferences were analyzed using a Bradley-Terry probabilistic model to obtain bias-adjusted estimates of relative contour quality while accounting for rater-specific tendencies and case difficulty. In the external cohort, TUM-SAM achieved a lesion-wise detection sensitivity of 0.94 and outperformed DeepMedic and nnU-Net across all tumor sizes, with a mean DSC of 0.84 and HD95 of 1.9 mm (nnU-Net/DeepMedic: DSC < 0.70, HD95 > 3.3 mm). Across voxel-wise evaluation, TUM-SAM's geometric performance fell within the range of inter-observer variability among physicians and was sensitive to reference construction. In contrast, in the blinded rater study, experts preferred TUM-SAM-generated contours over individual physician contours in 81-87% of raw comparisons; Bradley-Terry analysis yielded conservative, bias-corrected win probabilities of 55-56%, indicating consistent preference after adjustment for rater and case difficulty. Using a bias-controlled, blinded multi-rater evaluation framework, TUM-SAM demonstrates brain metastasis segmentation quality that is consistently preferred by expert physicians, highlighting the limitations of agreement-based voxel-wise metrics under inter-observer variability. These findings underscore the dependence of conventional evaluation on reference definition and support preference-based assessment as a complementary approach for evaluating AI segmentation quality in BM MRI.

Topics

Brain NeoplasmsImage Processing, Computer-AssistedArtificial IntelligenceJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.