From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification.
Authors
Affiliations (5)
Affiliations (5)
- Department of Radiology, University of Wisconsin-Madison, Madison, WI, USA. [email protected].
- Microsoft Health and Life Sciences, Redmond, WA, USA.
- Microsoft Health and Life Sciences, John Hopkins Medicine, Redmond, WA, USA.
- Department of Radiology, Department of Medical Physics, Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine & Public Health, Madison, WI, USA.
- Department of Radiology, University of Wisconsin-Madison, Madison, WI, USA.
Abstract
Foundation models, pre-trained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment and related findings, with comparison to the end-to-end training of an established convolutional neural network. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using seven foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, MedSigLIP, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms, including K-nearest neighbors (KNN), logistic regression (LR), support vector machines (SVM), random forest (RF), and multi-layer perceptron (MLP). Among these combinations, MedImageInsight embeddings paired with an SVM or MLP adapter yielded the highest mean area under the curve (mAUC) at 93.1%, followed closely by MedSigLIP with MLP (91.0%), Rad-DINO with SVM (90.7%), and CXR-Foundation with LR (88.6%), achieving a higher mAUC score than a fully finetuned convolutional neural network, DenseNet121 (87.2%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with SVM, obtaining mAUC scores of 82.8% and 81.1%, respectively, whereas Med-Flamingo delivered the lowest performance at 78.5% when combined with RF. Significant differences were found between each embedding model and MedImageInsight using the Wilcoxon signed-rank test at the significance level 0.05 (before Bonferroni correction). Notably, most adapter models demonstrated computational efficiency, achieving training within minutes and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analysis on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 1.8% and standard deviations across age groups not exceeding 1.4%. Further analysis indicated there is no significant difference across gender and age at a significance level of 0.05. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.