Performance of multimodal large language models on image-based surgical anatomy, anatomical pathology, and radiology questions.
Authors
Affiliations (3)
Affiliations (3)
- Medical Education, School of Medicine and Dentistry, Griffith University, Southport, Queensland, Australia.
- Gold Coast University Hospital, Southport, Queensland, Australia.
- Logan Hospital, Meadowbrook, Queensland, Australia.
Abstract
Multimodal large language models (LLMs) are now deeply integrated into medical education and widely used by medical students, yet it remains unclear whether current models possess the accuracy and reliability needed to support image-based learning. We evaluated four state-of-the-art multimodal LLMs (ChatGPT-5.1, Gemini-2.5, Grok-4, Claude Sonnet-4.5) on 208 image-based examination questions from a Doctor of Medicine program, spanning anatomical pathology (histopathology; 47.6%), radiology (31.7%), and surgical anatomy (20.7%). To isolate visual reasoning, all items were presented in image-only form with contextual information removed. Items covered seven organ systems, included both constructed-response and selected-response formats, and were categorized as recognition-only or recognition-plus-reasoning. ChatGPT-5.1 achieved the highest accuracy (75.5%; 95% CI [69.2-80.8]), followed by Gemini-2.5 (59.6%; 95% CI [52.8-66.1]), Claude Sonnet-4.5 (41.8%; 95% CI [35.3-48.6]), and Grok-4 (34.6%; 95% CI [28.5-41.3]). Overall model performance differed significantly (p < 0.001; Cramér's V = 0.45). Pairwise McNemar comparisons showed significant differences between ChatGPT-5.1 and all other models (all p < 0.001; effect sizes 15.9-40.9 percentage points). Subgroup analyses demonstrated a consistent hierarchy (ChatGPT > Gemini > Claude ≈ Grok) across different categories. Accuracy was uniformly higher for recognition-only and selected-response items. Even the best-performing model, ChatGPT-5.1, answered approximately one in four questions incorrectly. These findings suggest that current multimodal LLMs cannot yet replace expert teaching in image-based learning. Their use in medical education should therefore remain supervised and critically appraised, serving as adjuncts rather than authoritative sources.