Large language models with image processing in automated Cobb angle.

February 20, 2026

papers

DOI: 10.1007/s00586-026-09803-6 PMID: 41718808

Authors

Gibson J,Kharwadkar S,Lam C,Harland W,Jones M,Botchu R

Affiliations (5)

College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.
King's College London GKT School of Medical Education, King's College London, London, UK.
Department of Orthopaedics, St George's University Hospitals NHS Foundation Trust, London, UK.
Department of Spinal Surgery, Royal Orthopaedic Hospital, Birmingham, UK.
Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, UK. [email protected].

Abstract

The degree of scoliosis is assessed through the Cobb angle, which quantifies severity and is measured by clinicians on radiographs. With the increasing adoption of artificial intelligence (AI) in clinical workflows, there is uncertainty as to whether large language models (LLMs) with image processing capabilities can streamline and improve spinal deformity classification. This study aims to assess the diagnostic capabilities of 4 leading LLMs: ChatGPT, Gemini, Perplexity and Grok in calculating Cobb angles from radiographs. A cross-sectional analysis of 122 scoliosis patients was undertaken. Cobb angles were independently calculated using Horos software by a fellowship-trained radiologist, serving as the reference standard. All 122 radiographs were further uploaded to each of the 4 AI models to identify the type of scoliosis, generate a Cobb angle overlay and calculate the Cobb angle. Qualitative usability was assessed through pre-defined questions ranked on a Likert scale. Statistical tests included mean difference, paired t-tests and intraclass correlation coefficients. Gemini produced no calculated Cobb angles. ChatGPT failed to produce Cobb angles in 90 radiographs, and, even when Cobb angles were calculated, there were large errors (MAE 58.6° ± 45.9°). Both Perplexity and Grok generated estimates for all thoracolumbar cases, with mean differences of 18.8° (± 13.3°) and 24.2° (± 18.3°), respectively. None of the AI models successfully identified the S-shaped scoliosis cases. All AI models demonstrated a difference greater than the clinically accepted difference (≤ 10%). This study concludes that current commercially available LLMs show limited accuracy in Cobb angle measurement. Whilst out of the 4 AI models assessed, Perplexity and Grok displayed the highest performance, no model displayed an acceptable clinical ability. These findings highlight the need for a dedicated and rigorous development of a spinal deformity AI tool before clinical integration of Cobb angle determination.

View Source Full Text PDF

Topics

Journal Article

Large language models with image processing in automated Cobb angle.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?