ChatGPT, Claude Sonnet, and Grok Display Similarly Low Rates of Accuracy in Identifying Image-Based Orthopaedic Sports Pathologies.
Authors
Affiliations (2)
Affiliations (2)
- Mayo Clinic Alix School of Medicine Scottsdale Arizona U.S.A.
- Department of Orthopedic Surgery Mayo Clinic Arizona Phoenix Arizona U.S.A.
Abstract
To assess the ability of artificial intelligence (AI) to identify common sports-related pathologies using radiologic imaging and to compare ChatGPT 4.0 with 2 competitors, Grok 2 and Claude 3.5 Sonnet. ChatGPT 4.0, Grok 2, and Claude 3.5 Sonnet were used. Five common orthopaedic sports pathologies were chosen: anterior cruciate ligament tears, posterior cruciate ligament tears, meniscal tears, chondral pathologies, and rotator cuff tears. Fifty images representing each pathology were collected from a radiologic imaging database when possible, which included radiographic images, computed tomography, and magnetic resonance imaging. Normal images were collected that corresponded to each diagnostic category. Receiver operator characteristic curves and area under the curve values were calculated to assess the accuracy of each AI platform. ChatGPT 4.0, Grok 2, and Claude 3.5 Sonnet accurately identified the pathology in 23.6%, 15.7%, and 17.1% of diseased images, respectively. ChatGPT and Grok were most accurate at identifying meniscus pathologies (ChatGPT: 48%, Grok: 42%), whereas Claude Sonnet was most accurate at identifying anterior cruciate ligament pathologies (30%). The area under the curve for ChatGPT, Grok, and Claude Sonnet was 0.21, 0.16, and 0.15, respectively (ChatGPT 4.0 vs Grok 2, <i>P</i> = .30; ChatGPT 4.0 vs Claude 3.5 Sonnet, <i>P</i> = .24; Grok 2 vs Claude 3.5 Sonnet, <i>P</i> > .99). There were no differences in performance between the 3 platforms overall or within any of the diagnostic categories. ChatGPT 4.0, Grok 2, and Claude 3.5 Sonnet correctly identified the pathology in less than 25% of images of common sports-related pathologies and showed area under the curve values well below 0.5, indicating poor accuracy. Based on these findings, we do not recommend the current use of these generative AI models for image-based diagnosis in orthopaedics. As the use of AI becomes more popular within the general public, it becomes increasingly important to make aware the capabilities and limitations of popular AI platforms in regard to their current image-based diagnostic capabilities.