Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT-5, Gemini 2.5, and Grok 4 across normal and thinking modes.

June 26, 2026

papers

DOI: 10.1002/ase.70292 PMID: 42363443

Authors

Sivri I,Ozden FM,Celik H,Gokturk O,Colak T

Affiliations (2)

Department of Anatomy, Faculty of Medicine, Kocaeli University, İzmit, Türkiye.
Department of Anatomy, Institute of Health Sciences, Kocaeli University, İzmit, Türkiye.

Abstract

This study compared the performance of three large language models, ChatGPT-5 Plus, Gemini 2.5 Pro, and SuperGrok 4, in identifying anatomical structures on radiographic images using standardized anatomical terminology. Thirty radiographs from different body regions were selected from an open-access atlas and analyzed by the models in Normal and Thinking modes using standardized prompts based on Terminologia Anatomica (version 2.07). Responses were evaluated independently by two anatomists using a 0-2 scoring system. Overall accuracy across both modes and models ranged from 47.4% to 85.7%. Data were analyzed using Friedman and Wilcoxon signed-rank tests. Temporal response consistency was assessed with weighted kappa coefficients. Gemini 2.5 Pro and ChatGPT-5 Plus significantly outperformed SuperGrok 4 in both modes. In Normal mode, Gemini 2.5 Pro achieved the highest overall accuracy (82.7%), significantly exceeding ChatGPT-5 Plus (60.7%, p = 0.001) and SuperGrok 4 (47.4%, p < 0.001). In Thinking mode, accuracies were 85.7% for Gemini 2.5 Pro, 77.6% for ChatGPT-5 Plus, and 49.5% for SuperGrok 4. Gemini 2.5 Pro demonstrated a significant advantage over ChatGPT-5 Plus only in Normal mode (p = 0.001), whereas Thinking mode significantly improved performance only for ChatGPT-5 Plus (p = 0.01). Temporal stability analysis showed high response consistency for Gemini 2.5 Pro and SuperGrok 4 across all modes (r > 0.94, p < 0.001). Conversely, ChatGPT-5 Plus' stability decreased from substantial agreement in normal mode (r = 0.697, p < 0.001) to moderate agreement in Thinking mode (r = 0.539, p < 0.001). Despite their educational potential, these models need refinement to reliably identify anatomical structures on radiographic images.

View Source Full Text PDF

Topics

Journal Article

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT-5, Gemini 2.5, and Grok 4 across normal and thinking modes.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?