Medical visual question answering with multimodal: a systematic mini review (2023-2026).
Authors
Affiliations (6)
Affiliations (6)
- Department of Electrical and Electronics Engineering, Islamic University of Technology, Gazipur, Bangladesh.
- ELITE Research Lab, New York, NY, United States.
- Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh.
- Computer Science & Engineering, Nagoya Institute of Technology, Nagoya, Japan.
- Center for Advanced Analytics (CAA), Faculty of Engineering & Technology (FET), Multimedia University, Melaka, Malaysia.
- Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.
Abstract
Medical visual question answering (Med-VQA) has emerged as a critical application of artificial intelligence within a short period of time. Large language models (LLMs) and vision-language models (VLMs) have fundamentally rewritten the architecture of medical question answering (QA). This study aims to systematically analyze recent developments in Med-VQA. Like past methods, which were simple, text-heavy database systems, there has been a shift toward multimodal frameworks. Recent methods are now highly capable of explaining radiology, pathology, and dermatological images along with clinical questions. This review was conducted following PRISMA guidelines, covering 27 representative studies published in various databases, using predefined inclusion and exclusion criteria. The findings reveal a clear shift toward generative models, supported by retrieval mechanisms and structured reasoning strategies such as Chain-of-Thought and multi-agent frameworks. Generative models, along with retrieval-augmented generation (RAG) and preference optimization, are not just more consistent than traditional classification-based methods but also can enable free-form clinical question answering. Though frameworks like multi-agent and hierarchical CoT have significantly improved interpretability and mitigated hallucinations, they also come with some limitations, like higher computational time, multi-view analysis, multi-lingual question answering, lack of standardized evaluation and exploration, domain-specific evaluation, and real-world clinical settings. Med-VQA systems demonstrate significant potential as a clinical decision answer generation with a vision language model. Future work should focus on computational efficiency during real-world validation, fairness evaluation, standardized diagnostic benchmarks, and interpretable reasoning frameworks including specialized domain knowledge and practical skills.