Comparative Performance of ChatGPT-4o, Grok-3, and Claude 3.7 Sonnet in Differentiating Cerebellopontine Angle Schwannomas and Meningiomas: A Pilot Study of Visual Versus Text-Based Diagnostic Prompting.
Authors
Affiliations (7)
Affiliations (7)
- Department of Radiology, Sureyyapasa Chest Diseases and Thoracic Surgery Training Hospital, Başıbüyük Mah. Hastane Yolu Cad, 34844, Istanbul, Turkey. [email protected].
- Department of Radiology, Istanbul University-Cerrahpasa, Istanbul, Turkey.
- Department of Neuroradiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
- Division of Neuroradiology, Department of Radiology, Istanbul University-Cerrahpasa, Istanbul, Turkey.
- Department of Neurosurgery, Istanbul University-Cerrahpasa, Istanbul, Turkey.
- Department of Radiation Oncology, Istanbul University-Cerrahpasa, Istanbul, Turkey.
- Department of Pathology, Istanbul University-Cerrahpasa, Istanbul, Turkey.
Abstract
This study aimed to assess the diagnostic accuracy of three large language models (LLMs)-ChatGPT-4o, Grok-3, and Claude 3.7 Sonnet-in differentiating cerebellopontine angle (CPA) schwannomas and meningiomas using clinical data combined with either raw images or structured imaging features. This retrospective pilot study included 53 patients with pathologically confirmed CPA tumors (28 meningiomas, 25 schwannomas). Each case was submitted to LLMs with clinical data and either raw imaging slices (clinical + images), expert-derived structured imaging features (clinical + expert prompt), or resident-derived structured features (clinical + resident prompt). To establish task difficulty and provide performance context, traditional machine learning algorithms were trained via nested cross-validation. Two board-certified neuroradiologists (each with 10 years of experience) and two third-year radiology residents independently contributed: one pair (B.K. and B.Y.) generated structured imaging features, while the other pair (S.A. and K.H.O.) independently evaluated all cases for performance comparison. Diagnostic performance was compared using McNemar's and Cochran's Q tests with 95% confidence intervals. With expert-derived structured features, LLMs achieved high accuracy (86.8-94.3%), with GPT-4o reaching 94.3% (95%CI 84.6-98.1%), matching the expert neuroradiologist (92.5%, 95%CI 82.1-97.0%, p = 1.000) and significantly outperforming the resident (79.2%, p = 0.021). Performance was moderate with resident-derived structured features (71.7-73.6%), significantly better than raw image input (p < 0.05), but declined markedly with raw images alone (18.9-52.8%), where all LLMs performed significantly worse than radiologists (p < 0.05). Baseline machine learning models confirmed task feasibility with structured features (AUC-ROC 0.968-0.992 for expert features; 0.669-0.875 for resident features). LLMs, particularly GPT-4o, can achieve expert-level diagnostic performance in CPA tumor differentiation when structured textual imaging features are provided. Performance degrades moderately with resident-derived features and dramatically with raw image input, highlighting the critical importance of structured prompting for clinical LLM applications. Current automated visual reasoning remains markedly limited compared to text-based diagnostic inference.