GEOGRAPHIC DOMAIN SHIFT PRECIPITATES DIVERGENT FAILURE MODES IN DEEP LEARNING BASED TUBERCULOSIS SCREENING: A MULTI-NATIONAL EXTERNAL VALIDATION STUDY
Authors
Affiliations (1)
Affiliations (1)
- Bahcesehir Cyprus University
Abstract
BackgroundDeep learning algorithms for tuberculosis (TB) screening frequently achieve radiologist-level performance during internal evaluation, yet their reliability often degrades when deployed to populations differing from the training domain. Such degradation is clinically consequential for screening tools, where the World Health Organization (WHO) emphasizes high sensitivity to minimize missed infectious cases. MethodsA DenseNet-121 convolutional neural network was trained using transfer learning on the Shenzhen chest X-ray dataset (China; total n=662). To prevent anatomically implausible augmentation, horizontal flipping was excluded during training. The model was trained in two stages (head training followed by fine-tuning) and evaluated on: (i) an internal test set from China, (ii) an external balanced cohort from Montgomery County (USA; n=138), and (iii) an external TB-positive cohort from India (n=155). The India dataset served as a sensitivity stress test; specificity and ROC-AUC were not computed for this cohort due to the absence of negative controls. Model attention was explored using Grad-CAM. ResultsInternal validation yielded an Area Under the Curve (AUC) of 0.889 and accuracy of 85.6%. External testing revealed divergent failure modes. On the USA cohort, sensitivity was high (94.8%) but specificity decreased significantly (43.7%), indicating false-positive inflation. Conversely, on the India TB-only cohort, sensitivity collapsed to 52.3%, implying that 47.7% of confirmed TB cases were missed under domain shift. All metrics are reported as point estimates ConclusionGeographic domain shift produced non-uniform degradation false-positive surges in a low-burden setting and sensitivity collapse in a high-burden setting. These findings highlight the safety risks of deploying single-source TB screening AI without local validation and calibration.