Multimodal foundation models exploit text to make medical image predictions.
Authors
Affiliations (4)
Affiliations (4)
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. [email protected].
Abstract
Multimodal foundation models have shown compelling but conflicting performance in medical image interpretation. However, the ways in which these models integrate and prioritize different data modalities, including images and text, remain poorly understood. Here we evaluate 8 proprietary and open-source multimodal foundation models using 1090 multimodal medical cases. We show that image predictions are largely driven by text, with accuracy increasing monotonically with the amount of informative text. Exploitation of text is a double-edged sword; even mild suggestions of an incorrect diagnosis in text diminish image-based classification, dramatically reducing performance in cases the model could previously answer using images alone-o3 accuracy fell from 84% to 28% when a misleading clinical vignette was introduced. In physician evaluations of long-form cases, adding images reduces or does not improve performance when text is highly informative (e.g., GPT-4V showed decreased accuracy when images were added to highly informative text across 69 clinicopathological conferences). Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by text.