Back to all papers

Evaluating Reasoning Effect for LLMs: Prompt Sensitivity and Text-Image Based Performance in Musculoskeletal Radiology.

May 21, 2026pubmed logopapers

Authors

Çamur E,Cesur T,Güneş YC

Affiliations (3)

  • Department of Radiology, Ankara 29 Mayis State Hospital, Ankara, Türkiye.
  • Department of Radiology, Ankara Mamak State Hospital, Ankara, Türkiye.
  • Department of Radiology, Kırıkkale Yüksek İhtisas Hospital, Kırıkkale, Türkiye.

Abstract

Multimodal large language models (LLMs) are increasingly applied in radiology, but the effect of reasoning capabilities across text- and image-based tasks remains unclear. We evaluated four multimodal LLMs-two non-reasoning (ChatGPT-4, Gemini 1.5 Pro) and two reasoning-capable (ChatGPT-5.1, Gemini 3)-using 50 text-based and 50 arrow-localized MSK radiographic anatomy questions, compared with two board-certified radiologists. Accuracy with 95% confidence intervals was calculated, and image-based errors were categorized. Reasoning-capable models outperformed non-reasoning models in text-based tasks, achieving near-ceiling accuracy (96% and 94%; all p≤0.008) with minimal prompt sensitivity. In image-based tasks, reasoning models performed better than non-reasoning models (70-72% vs 46-48%; p<0.001) but remained inferior to radiologists (88-90%). Errors were mainly adjacent-structure substitution and projection-related overlap. While reasoning enhances text-based performance and robustness, multimodal LLMs remain limited in fine-grained visual grounding and are best suited for supportive roles.

Topics

Natural Language ProcessingMusculoskeletal DiseasesRadiologyJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.