PM<sup>2</sup>: A new prompting multi-modal model paradigm for few-shot medical image classification.
Authors
Affiliations (5)
Affiliations (5)
- Key Laboratory of Social Computing and Cognitive Intelligence (Ministry of Education), Dalian University of Technology, Dalian, 116024, China; School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China; School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China.
- School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, China.
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China.
- Key Laboratory of Social Computing and Cognitive Intelligence (Ministry of Education), Dalian University of Technology, Dalian, 116024, China; School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China.
- Key Laboratory of Social Computing and Cognitive Intelligence (Ministry of Education), Dalian University of Technology, Dalian, 116024, China; School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China. Electronic address: [email protected].
Abstract
Few-shot learning has emerged as a key technological solution to address challenges such as limited data and the difficulty of acquiring annotations in medical image classification. However, relying solely on a single image modality is insufficient to capture conceptual categories. Therefore, medical image classification requires a comprehensive approach to capture conceptual category information that aids in the interpretation of image content. This study proposes a novel medical image classification paradigm based on a multi-modal foundation model, called PM<sup>2</sup>. In addition to the image modality, PM<sup>2</sup> introduces supplementary text input (prompt) to further describe images or conceptual categories and facilitate cross-modal few-shot learning. We empirically studied five different prompting schemes under this new paradigm. Furthermore, linear probing in multi-modal models only takes class token as input, ignoring the rich statistical data contained in high-level visual tokens. Therefore, we alternately perform linear classification on the feature distributions of visual tokens and class token. To effectively extract statistical information, we use global covariance pool with efficient matrix power normalization to aggregate the visual tokens. We then combine two classification heads: one for handling image class token and prompt representations encoded by the text encoder, and the other for classifying the feature distributions of visual tokens. Experimental results on three datasets: breast cancer, brain tumor, and diabetic retinopathy demonstrate that PM<sup>2</sup> effectively improves the performance of medical image classification. Compared to existing multi-modal models, PM<sup>2</sup> achieves state-of-the-art performance. Integrating text prompts as supplementary samples effectively enhances the model's performance. Additionally, by leveraging second-order features of visual tokens to enrich the category feature space and combining them with class token, the model's representational capacity is significantly strengthened.