Back to all papers

Generalist Large-Language Models for Spine Imaging Diagnostics: An Early Analysis of Detection Performance for Scoliosis and Lumbar Stenosis.

May 2, 2026pubmed logopapers

Authors

Hoglund ZT,Wu AQ,Kathawate VG,Sollenberger C,Englander R,Rani N,Saadoun J,Massaad E,Dagli MM,Malhotra N,Yoon JW,Welch WC,Ozturk AK,Shin JH,Judy BF

Affiliations (3)

  • Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. Electronic address: [email protected].
  • Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
  • Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. Electronic address: [email protected].

Abstract

Web-based large language models (LLMs) are increasingly used by patients for medical self-assessment, but their efficacy in spine imaging diagnostics remains underexplored. This study systematically evaluated five leading multimodal LLMs-Grok 2, Grok 3, Grok 4, ChatGPT, and Gemini-for detecting scoliosis and lumbar spinal stenosis across radiographs and MRI modalities. We assessed 171 full-length anterior-posterior radiographs (100 with scoliosis, 71 normal) and 200 axial T2-weighted lumbar spine MRIs (100 with severe stenosis, 100 normal) from public databases. Models were prompted without examples to identify pathology and quantify certainty (0-100%). Analyses included McNemar's test for accuracy and ANOVA for confidence levels. In scoliosis detection, Grok 4 exhibited superior accuracy (0.942), followed by Gemini (0.912), Grok 2 (0.890), ChatGPT (0.643), and Grok 3 (0.637). For stenosis, Gemini performed best (0.600), then Grok 4 (0.575), ChatGPT (0.545), Grok 2 (0.500), and Grok 3 (0.450). All models sustained >70% mean certainty (SD <5.3%) across pathologies. ChatGPT and Grok 3 demonstrated reduced confidence in erroneous scoliosis responses (p<0.0001), while only ChatGPT did so for stenosis. Gemini reported elevated confidence in incorrect stenosis responses (p<0.0001). LLMs perform highly in scoliosis detection but struggle to identify lumbar stenosis. ChatGPT's superior confidence calibration, suggests enhanced reliability. Performance inconsistencies across model iterations (e.g., Grok 3 underperforming Grok 2) underscore the necessity for specialized medical imaging training. Although promising for patient education in simple spine conditions, substantial advancements in accuracy and confidence metrics are essential prior to clinical adoption or broad patient utilization.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.