Multimodal LLMs achieved up to 94% accuracy for scoliosis detection on spine x-rays, but struggled with lumbar stenosis on MRI.
Key Details
- 1Five multimodal LLMs tested: Grok 2, 3, 4, ChatGPT 4o, Gemini 1.5 Flash.
- 2171 spine x-rays (100 scoliosis, 71 normal) and 200 lumbar spine MRIs (100 severe stenosis, 100 normal) used in the study.
- 3Best x-ray result: Grok 4 with 94.2% accuracy for scoliosis detection; best MRI result: Gemini at 60% for stenosis.
- 4ChatGPT 4o showed better confidence calibration when incorrect, considered a 'superior metacognitive capability.'
- 5Authors emphasize LLMs not ready for clinical diagnosis; highlight potential for patient education in obvious cases.
- 6Study published in World Neurosurgery on May 2, 2024.
Why It Matters
As patients increasingly use commercial LLMs for medical advice, understanding their capabilities and risks in radiology is crucial. These results highlight both the promise and current limitations of generalist AI in medical image interpretation, especially for more subtle pathologies.

Source
AuntMinnie
Related News

•Radiology Business
Private Equity Backs AIRS Medical to Expand MRI AI Globally
TA Associates is investing in AIRS Medical to accelerate its global expansion of AI-powered MRI efficiency solutions.

•Cardiovascular Business
Radiology Maintains Lead in FDA-Cleared AI Algorithms, Cardiology Follows
Radiology remains the top specialty for FDA-cleared AI, with cardiology as a strong second, particularly in cardiovascular imaging.

•AuntMinnie
Deep Learning Models Rival Radiologists for Pancreatic Cancer Detection on CT
Deep-learning models achieved comparable or superior accuracy to experienced radiologists in detecting pancreatic cancer on CT scans, especially for small tumors.