Vision-Language Foundation Models Do Not Transfer to Medical Imaging Classification: A Negative Result on Chest X-ray Diagnosis
Authors
Affiliations (1)
Affiliations (1)
- Independent Researcher
Abstract
Vision-language models (VLMs) pretrained on web-scale data have achieved remarkable performance across diverse tasks, leading to widespread adoption in industry. A natural question is whether these powerful representations transfer to specialized medical imaging domains, and whether domain-specific medical pretraining improves transfer. We tested these hypotheses using two VLMs on the NIH ChestX-ray14 benchmark: Qwen2.5-VL (pretrained on web data) and BiomedCLIP (pretrained on 15 million PubMed biomedical image-text pairs). Both models dramatically underperformed compared to convolutional neural networks (CNNs) with ImageNet pretraining. The best VLM achieved F1=0.203 versus a CNN baseline of F1=0.811. Domain-specific pretraining provided marginal improvement: Biomed-CLIPs frozen encoder achieved F1=0.161 versus Qwens F1=0.124 (+30%), but this remains clinically inadequate. Fine-tuning both models led to catastrophic overfitting, with sensitivity collapsing from >65% to <15% as the models learned to predict "no disease" for all inputs. These results demonstrate that neither general-purpose nor medical-specific vision-language pretraining produces features suitable for dense multi-label medical imaging classification. For chest X-ray diagnosis, traditional CNNs with ImageNet pretraining remain substantially more effective than VLM-based approaches.