Back to all papers

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT-5, Gemini 2.5, and Grok 4 across normal and thinking modes.

June 26, 2026pubmed logopapers

Authors

Sivri I,Ozden FM,Celik H,Gokturk O,Colak T

Affiliations (2)

  • Department of Anatomy, Faculty of Medicine, Kocaeli University, İzmit, Türkiye.
  • Department of Anatomy, Institute of Health Sciences, Kocaeli University, İzmit, Türkiye.

Abstract

This study compared the performance of three large language models, ChatGPT-5 Plus, Gemini 2.5 Pro, and SuperGrok 4, in identifying anatomical structures on radiographic images using standardized anatomical terminology. Thirty radiographs from different body regions were selected from an open-access atlas and analyzed by the models in Normal and Thinking modes using standardized prompts based on Terminologia Anatomica (version 2.07). Responses were evaluated independently by two anatomists using a 0-2 scoring system. Overall accuracy across both modes and models ranged from 47.4% to 85.7%. Data were analyzed using Friedman and Wilcoxon signed-rank tests. Temporal response consistency was assessed with weighted kappa coefficients. Gemini 2.5 Pro and ChatGPT-5 Plus significantly outperformed SuperGrok 4 in both modes. In Normal mode, Gemini 2.5 Pro achieved the highest overall accuracy (82.7%), significantly exceeding ChatGPT-5 Plus (60.7%, p = 0.001) and SuperGrok 4 (47.4%, p < 0.001). In Thinking mode, accuracies were 85.7% for Gemini 2.5 Pro, 77.6% for ChatGPT-5 Plus, and 49.5% for SuperGrok 4. Gemini 2.5 Pro demonstrated a significant advantage over ChatGPT-5 Plus only in Normal mode (p = 0.001), whereas Thinking mode significantly improved performance only for ChatGPT-5 Plus (p = 0.01). Temporal stability analysis showed high response consistency for Gemini 2.5 Pro and SuperGrok 4 across all modes (r > 0.94, p < 0.001). Conversely, ChatGPT-5 Plus' stability decreased from substantial agreement in normal mode (r = 0.697, p < 0.001) to moderate agreement in Thinking mode (r = 0.539, p < 0.001). Despite their educational potential, these models need refinement to reliably identify anatomical structures on radiographic images.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.