Evaluating the Performance of ChatGPT-4V in Detecting Inflammatory Magnetic Resonance Imaging Findings of Sacroiliitis: Potentials, Challenges, and Limitations.
Authors
Affiliations (6)
Affiliations (6)
- Department of Physical Medicine and Rehabilitation, İzzet Baysal Physical Treatment and Rehabilitation Training and Research Hospital, Bolu, Turkey. [email protected].
- Department of Rheumatology, Faculty of Medicine, Bolu Abant İzzet Baysal University, Bolu, Turkey.
- Department of Physical Medicine and Rehabilitation, Univeristy of Health Sciences Sultan 2.Abdulhamid Han Training and Research Hospital, İstanbul, Turkey.
- Department of Radiology, Kelkit State Hospital, Gümüshane, Turkey.
- Department of Radiology, Faculty of Medicine, Bolu Abant İzzet Baysal University, Bolu, Turkey.
- Department of Physical Medicine and Rehabilitation, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey.
Abstract
This study aims to evaluate the diagnostic accuracy of ChatGPT-4V, an AI model with visual capabilities, in detecting sacroiliitis on MRI and compares its performance to expert radiologists. This retrospective study included 125 patients (250 sacroiliac joint images) from a tertiary hospital's Picture Archiving and Communication System. MRI scans, including coronal T1-weighted and semicoronal STIR sequences, were assessed by two experienced radiologists. ChatGPT-4V was prompted with standardized queries to analyze the images for signs of active or chronic sacroiliitis. Its diagnostic outputs were compared to the radiologists' assessments. Performance metrics, including sensitivity, specificity, precision, and area under the curve (AUC), were calculated. ChatGPT-4V demonstrated high sensitivity for detecting bone marrow edema (0.955; AUC, 0.84) but lower sensitivity for sclerosis (0.211; AUC, 0.55), joint space narrowing (0.298; AUC, 0.59), and joint surface irregularities (0.433; AUC, 0.59). The overall accuracy of the model was 0.624, with a weighted-average AUC of 0.62. ChatGPT-4V excelled in identifying active inflammatory changes but underperformed in detecting chronic structural abnormalities. ChatGPT-4V shows promise in detecting active inflammatory sacroiliitis, particularly bone marrow edema, but its current inability to reliably identify chronic structural abnormalities limits its standalone clinical utility. To achieve enhanced diagnostic capability and enable clinical integration, future efforts must focus on model fine-tuning using specialist-labeled radiological datasets.