Multimodal foundation models exploit text to make medical image predictions.

June 12, 2026

papers

DOI: 10.1038/s41467-026-74207-5 PMID: 42285926

Authors

Buckley TA,Diao JA,Srivastava CN,Brodeur PG,Rajpurkar P,Rodman A,Manrai AK

Affiliations (4)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. [email protected].

Abstract

Multimodal foundation models have shown compelling but conflicting performance in medical image interpretation. However, the ways in which these models integrate and prioritize different data modalities, including images and text, remain poorly understood. Here we evaluate 8 proprietary and open-source multimodal foundation models using 1090 multimodal medical cases. We show that image predictions are largely driven by text, with accuracy increasing monotonically with the amount of informative text. Exploitation of text is a double-edged sword; even mild suggestions of an incorrect diagnosis in text diminish image-based classification, dramatically reducing performance in cases the model could previously answer using images alone-o3 accuracy fell from 84% to 28% when a misleading clinical vignette was introduced. In physician evaluations of long-form cases, adding images reduces or does not improve performance when text is highly informative (e.g., GPT-4V showed decreased accuracy when images were added to highly informative text across 69 clinicopathological conferences). Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by text.

View Source Full Text PDF

Topics

Journal Article

Multimodal foundation models exploit text to make medical image predictions.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?