Evaluating Sociodemographic Biases in Artificial Intelligence-Based Glioblastoma Response Assessment Algorithms.
Authors
Affiliations (2)
Affiliations (2)
- From the Duke University School of Medicine (R.S.L.); Department of Radiation Oncology (D.L.B.), Department of Radiology (K.M., E.C.), Duke University Medical Center; Department of Electrical and Computer Engineering (J.Z.), Duke University. [email protected].
- From the Duke University School of Medicine (R.S.L.); Department of Radiation Oncology (D.L.B.), Department of Radiology (K.M., E.C.), Duke University Medical Center; Department of Electrical and Computer Engineering (J.Z.), Duke University.
Abstract
Recent studies have demonstrated bias in various medical imaging artificial intelligence (AI) models, yet the factors underpinning these biases remain relatively unclear. This study evaluated potential sociodemographic biases in AI-based glioblastoma MRI segmentation models trained on datasets varying in size and demographic composition. We evaluated four nnUNet models with different training datasets: (1) the Federated Tumor Segmentation postoperative (FeTS2) model trained on a large (>10k exams) multi-national, multi-institution dataset, (2) the Brain Tumor Segmentation (BraTS) 2024 postoperative glioma model trained on a moderate size (>2k exams) multi-institution, North American dataset, (3) a model trained on a small (>200 exams), private, demographically homogenous, single-institution dataset, and (4) a model trained on an equally small (>200 exams), but demographically heterogenous dataset. Models were evaluated for bias using an independent, manually corrected dataset of 480 patients (mean age 52 ± 14) that was prospectively collected from a single high-volume academic brain tumor center. Automated FLAIR and enhancing tumor segmentations from the AI models were evaluated using Dice scores. Sociodemographic factors were collected and analyzed using beta regression to assess their influence on model performance. The model trained exclusively on White, non-Hispanic males had the lowest overall Dice scores (0.943 for FLAIR, 0.909 for Enhancement) and exhibited biases in age and smoking status. The BraTS model demonstrated the highest Dice scores (0.996 for FLAIR, 0.999 for Enhancement) and had the least bias overall. Demographic bias was relatively low in glioblastoma MRI segmentation models. The model trained on the smallest and most homogenous dataset exhibited the most bias. Greater demographic heterogeneity even without increasing training dataset size was associated with reduced bias. The BraTS model, trained on a moderate-sized cohort that included more diverse tumor types, performed better and demonstrated less bias than the FeTS2 model, despite the FeTS2 being trained on the largest dataset.