Back to all papers

Foundation Model Robustness to Technical Acquisition Parameters in Chest X-Ray AI A Multi-Architecture Comparative Study with External Validation

January 27, 2026medrxiv logopreprint

Authors

Farquhar, H.

Affiliations (1)

  • No affiliation

Abstract

BackgroundFoundation models have emerged as a promising paradigm for medical imaging AI [7], with claims of improved generalization and reduced bias. However, their robustness to technical acquisition parameters remains unexplored. We evaluated whether foundation models exhibit greater robustness to chest radiograph view type (anteroposterior [AP] versus posteroanterior [PA]) compared to traditional convolutional neural networks. MethodsWe compared four model architectures on the RSNA Pneumonia Detection Challenge dataset (n=26,684 images) and externally validated on the NIH ChestX-ray14 dataset (n=112,120 images): DenseNet-121 (supervised CNN), BiomedCLIP (vision-language model trained on 15 million biomedical image-text pairs), RAD-DINO (self-supervised model trained on 5+ million radiographs), and CheXzero (vision-language model trained on MIMIC-CXR chest radiographs). Primary outcome was the sensitivity gap between AP and PA views, with bootstrap confidence intervals and permutation testing. ResultsOn RSNA, CheXzero showed the smallest gap (14.3%, 95% CI: 11.2-17.5%), followed by RAD-DINO (25.2%, 22.6-27.9%), DenseNet-121 (35.7%, 32.9-38.7%), and BiomedCLIP (36.1%, 33.5-39.0%). However, on external validation (NIH), model rankings reversed completely: RAD-DINO demonstrated the smallest gap (22.3%, 95% CI: 21.0-23.6%), while CheXzeros gap increased dramatically to 48.9% (95% CI: 47.7-50.1%). Domain-specific training provided robustness within the training domain but failed to generalize. On PA view pneumonia cases in NIH, 31% were missed by all four models, representing a systematic blind spot. View type explained 61-100% of performance variance across models on both datasets, compared to 0-38% for age and less than 4% for sex. ConclusionsFoundation models do not eliminate technical acquisition parameter biases in chest X-ray AI. While domain-specific training (CheXzero) provided superior robustness on internal validation, this advantage collapsed on external data. Self-supervised learning (RAD-DINO) demonstrated the most generalizable robustness, with consistent view type gap stability across datasets with different labeling schemes (25.2% [->] 22.3%, despite substantial AUC differences). These findings challenge assumptions about foundation model generalization and highlight the need for acquisition parameter auditing in AI regulatory frameworks and multi-site external validation for robustness claims.

Topics

radiology and imaging

Ready to Sharpen Your Edge?

Subscribe to join 9,500+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.