Diagnostic Performance of Large Language Models in Musculoskeletal Ultrasound: A Comparative Evaluation of ChatGPT-5.1 and Gemini for Plantar Fasciitis.
Authors
Affiliations (2)
Affiliations (2)
- Department of Physical Medicine and Rehabilitation, Elazıg Fethi Sekin City Hospital, Elazıg, Turkey. [email protected].
- Erzurum Technical University, Faculty of Health Sciences, Department of Physiotherapy and Rehabilitation, Erzurum, Turkey.
Abstract
Recent advances in large language models (LLMs) have prompted growing interest in their potential to assist with musculoskeletal ultrasound interpretation. However, evidence regarding their diagnostic performance in plantar fasciitis remains scarce. This study evaluated the accuracy of ChatGPT-5.1 and Gemini in classifying plantar fasciitis on longitudinal ultrasound images, using a board-certified physiatrist's assessment as the reference standard. In this prospective diagnostic accuracy study, 80 anonymized plantar fascia ultrasound images were analyzed independently by ChatGPT-5.1 and Gemini using standardized interpretive prompts. Images were classified as normal or pathological based on established criteria, including fascial thickness ≥ 4 mm, hypoechogenicity, and disturbance of fibrillar structure. Diagnostic metrics (sensitivity, specificity, predictive values, accuracy, F1-score, and Cohen's κ) were calculated. Comparative performance was assessed using McNemar's test. ChatGPT-5.1 demonstrated a balanced diagnostic profile, achieving a sensitivity of 67.6%, specificity of 97.7%, and overall accuracy of 83.8%. In contrast, Gemini exhibited maximal sensitivity (100%) but substantially reduced specificity (37.2%), yielding an overall accuracy of 66.3%. Confusion matrices revealed divergent error structures: ChatGPT-5.1 produced few false positives (n = 1), whereas Gemini generated many (n = 27). Agreement with the reference standard was substantial for ChatGPT-5.1 (κ = 0.666) and fair for Gemini (κ = 0.354). McNemar's test showed a significant difference in paired classification decisions between the models (p = 0.035). Precision-recall analysis indicated a more favorable precision-recall balance for ChatGPT-5.1. ChatGPT-5.1 showed high specificity, whereas Gemini demonstrated higher sensitivity, indicating different diagnostic operating characteristics rather than overall superiority. These patterns suggest potential screening-oriented use for Gemini and confirmatory use for ChatGPT-5.1. The findings support the feasibility of LLM-assisted musculoskeletal ultrasound interpretation and highlight the need for multicenter validation and integration strategies based on complementary model characteristics.