Back to all papers

Evaluating the Performance of ChatGPT-4V in Detecting Inflammatory Magnetic Resonance Imaging Findings of Sacroiliitis: Potentials, Challenges, and Limitations.

November 20, 2025pubmed logopapers

Authors

Erden Y,Dilek G,Temel MH,Soylu HH,Kalfaoğlu ME,Bağcıer F

Affiliations (6)

  • Department of Physical Medicine and Rehabilitation, İzzet Baysal Physical Treatment and Rehabilitation Training and Research Hospital, Bolu, Turkey. [email protected].
  • Department of Rheumatology, Faculty of Medicine, Bolu Abant İzzet Baysal University, Bolu, Turkey.
  • Department of Physical Medicine and Rehabilitation, Univeristy of Health Sciences Sultan 2.Abdulhamid Han Training and Research Hospital, İstanbul, Turkey.
  • Department of Radiology, Kelkit State Hospital, Gümüshane, Turkey.
  • Department of Radiology, Faculty of Medicine, Bolu Abant İzzet Baysal University, Bolu, Turkey.
  • Department of Physical Medicine and Rehabilitation, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey.

Abstract

This study aims to evaluate the diagnostic accuracy of ChatGPT-4V, an AI model with visual capabilities, in detecting sacroiliitis on MRI and compares its performance to expert radiologists. This retrospective study included 125 patients (250 sacroiliac joint images) from a tertiary hospital's Picture Archiving and Communication System. MRI scans, including coronal T1-weighted and semicoronal STIR sequences, were assessed by two experienced radiologists. ChatGPT-4V was prompted with standardized queries to analyze the images for signs of active or chronic sacroiliitis. Its diagnostic outputs were compared to the radiologists' assessments. Performance metrics, including sensitivity, specificity, precision, and area under the curve (AUC), were calculated. ChatGPT-4V demonstrated high sensitivity for detecting bone marrow edema (0.955; AUC, 0.84) but lower sensitivity for sclerosis (0.211; AUC, 0.55), joint space narrowing (0.298; AUC, 0.59), and joint surface irregularities (0.433; AUC, 0.59). The overall accuracy of the model was 0.624, with a weighted-average AUC of 0.62. ChatGPT-4V excelled in identifying active inflammatory changes but underperformed in detecting chronic structural abnormalities. ChatGPT-4V shows promise in detecting active inflammatory sacroiliitis, particularly bone marrow edema, but its current inability to reliably identify chronic structural abnormalities limits its standalone clinical utility. To achieve enhanced diagnostic capability and enable clinical integration, future efforts must focus on model fine-tuning using specialist-labeled radiological datasets.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.