Evaluating Reasoning Effect for LLMs: Prompt Sensitivity and Text-Image Based Performance in Musculoskeletal Radiology.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, Ankara 29 Mayis State Hospital, Ankara, Türkiye.
- Department of Radiology, Ankara Mamak State Hospital, Ankara, Türkiye.
- Department of Radiology, Kırıkkale Yüksek İhtisas Hospital, Kırıkkale, Türkiye.
Abstract
Multimodal large language models (LLMs) are increasingly applied in radiology, but the effect of reasoning capabilities across text- and image-based tasks remains unclear. We evaluated four multimodal LLMs-two non-reasoning (ChatGPT-4, Gemini 1.5 Pro) and two reasoning-capable (ChatGPT-5.1, Gemini 3)-using 50 text-based and 50 arrow-localized MSK radiographic anatomy questions, compared with two board-certified radiologists. Accuracy with 95% confidence intervals was calculated, and image-based errors were categorized. Reasoning-capable models outperformed non-reasoning models in text-based tasks, achieving near-ceiling accuracy (96% and 94%; all p≤0.008) with minimal prompt sensitivity. In image-based tasks, reasoning models performed better than non-reasoning models (70-72% vs 46-48%; p<0.001) but remained inferior to radiologists (88-90%). Errors were mainly adjacent-structure substitution and projection-related overlap. While reasoning enhances text-based performance and robustness, multimodal LLMs remain limited in fine-grained visual grounding and are best suited for supportive roles.