Zero-shot medical image classification via large multimodal models and knowledge graphs-driven processing.
Authors
Affiliations (3)
Affiliations (3)
- Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China; College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China. Electronic address: [email protected].
- Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China; College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China. Electronic address: [email protected].
- Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China; College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China. Electronic address: [email protected].
Abstract
With the continuous advancement of medical enterprise, intelligent medical technologies supported by natural language processing and knowledge representation have made significant progress. However, with the continuous generation of vast amounts of medical data, the current methods still perform poorly in handling specialized medical data, particularly unlabeled medical diagnostic data. Inspired by the outstanding performance of large language models in various downstream expert tasks in recent years, this article leverages large language models to handle the massive unlabelled medical data, aiming to provide more accurate technical solutions for medical image classification tasks. Specifically, we propose a novel Cross-Modal Knowledge Representation framework (CMKR) to handle vast unlabeled medical data, which utilizes large language models to extract implicit knowledge from medical images, while also extracting explicit textual knowledge with the aid of knowledge graphs. To better utilize the associative information between medical images and textual records, we have designed a cross-modal alignment strategy that enhances knowledge representation capabilities both intra- and inter-modal. We conducted extensive experiments on public datasets, demonstrating that our method outperforms most mainstream approaches.