Multimodal LLMs achieved up to 94% accuracy for scoliosis detection on spine x-rays, but struggled with lumbar stenosis on MRI.
Key Details
- 1Five multimodal LLMs tested: Grok 2, 3, 4, ChatGPT 4o, Gemini 1.5 Flash.
- 2171 spine x-rays (100 scoliosis, 71 normal) and 200 lumbar spine MRIs (100 severe stenosis, 100 normal) used in the study.
- 3Best x-ray result: Grok 4 with 94.2% accuracy for scoliosis detection; best MRI result: Gemini at 60% for stenosis.
- 4ChatGPT 4o showed better confidence calibration when incorrect, considered a 'superior metacognitive capability.'
- 5Authors emphasize LLMs not ready for clinical diagnosis; highlight potential for patient education in obvious cases.
- 6Study published in World Neurosurgery on May 2, 2024.
Why It Matters
As patients increasingly use commercial LLMs for medical advice, understanding their capabilities and risks in radiology is crucial. These results highlight both the promise and current limitations of generalist AI in medical image interpretation, especially for more subtle pathologies.

Source
AuntMinnie
Related News

•Radiology Business
SimonMed Imaging Introduces Paid AI Add-Ons for Routine Exams
SimonMed Imaging is launching new AI-powered elective services for routine imaging exams with additional out-of-pocket costs for patients.

•Radiology Business
LLMs May Streamline Radiology Insurance Appeal Letters, but Caution Needed
Large language models show promise in drafting appeals for denied radiology claims but require oversight.

•AuntMinnie
MRI and Deep Learning Uncover Muscle Fat's Link to Heart Risks
MRI and deep learning can identify hidden muscle fat linked to heart and metabolic risks, offering a new imaging-based biomarker for preventive care.