Diagnostic Accuracy of GPT-4 With Vision in Neuroradiology Board-Style Exam Questions: Cross-Sectional Case-Based Study.
Authors
Affiliations (4)
Affiliations (4)
- Lake Erie College of Medicine, 5000 Lakewood Ranch Blvd., Bradenton, FL, 34211, United States, 1 9417825761.
- Department of Computer Science, Stetson University, Deland, FL, United States.
- College of Medicine, University of Central Florida, Orlando, FL, United States.
- Department of Radiology, University of Florida College of Medicine, Gainesville, FL, United States.
Abstract
Multimodal artificial intelligence systems combining text and image analysis represent a paradigm shift in clinical decision support. While GPT-4 with Vision (GPT-4V) has shown promise in medical imaging interpretation, existing studies report inconsistent performance (16%-80% accuracy) across radiological subspecialties. Critical knowledge gaps persist regarding GPT-4V's capability to integrate clinical history with imaging findings in complex neuroradiology scenarios, and fundamental questions remain about whether the model appropriately balances visual and textual information sources when formulating diagnoses. Furthermore, documented artificial intelligence hallucination rates of 35.5% to 63% in radiology applications raise urgent safety concerns, yet the relationship between modality utilization patterns and diagnostic accuracy remains unexplored. This study aims to evaluate GPT-4V's diagnostic accuracy on expert-validated neuroradiology board-style examination questions and to examine the model's self-reported reliance on imaging versus clinical text data when making diagnostic decisions. A secondary objective was to examine whether self-characterized modality utilization patterns differed systematically between correct and incorrect diagnoses, potentially identifying specific failure modes requiring targeted mitigation strategies. This cross-sectional study evaluated GPT-4V using 29 neuroradiology cases from the RSNA (Radiological Society of North America) Case Collection, covering adult brain and central nervous system pathologies imaged via computed tomography or magnetic resonance imaging. The cases were authored by board-certified radiologists. GPT-4V was accessed via ChatGPT Plus (July 2024) with standardized prompts selecting 1 answer from 4 options, providing diagnostic rationale, and quantifying the percentage contributions of image versus text data. Binary scoring assessed diagnostic performance (correct=1, incorrect=0). Statistical analysis included Wilson score CIs, a binomial test comparing accuracy to chance, and a 2-tailed <i>t</i> test comparing self-reported modality reliance between correct and incorrect diagnoses (α=.05, Cohen <i>d</i> calculated). GPT-4V correctly diagnosed 22 of 29 cases (76% accuracy, 95% CI 57.9%-87.8%), significantly exceeding the chance performance of 25% (<i>z</i>=6.33; <i>P</i><<i>.</i>001). The model self-reported mean contributions of 66.1% from imaging (95% CI 63.5%-68.8%) and 33.9% from text (95% CI 31.2%-36.5%). Correct diagnoses (n=22) showed significantly lower self-reported image reliance (62.8%, 95% CI 61.3%-64.3%) compared to incorrect diagnoses (n=7; 76.7%, 95% CI 73.5%-80.0%), with a mean difference of 13.9 percentage points (95% CI 10.6-17.3; <i>P</i><.001; Cohen <i>d</i>=4.08, 95% CI 2.73-5.43). All 7 incorrect diagnoses demonstrated image-dominant attribution ≥70% (Fisher exact test <i>P</i><<i>.</i>001), suggesting that excessive visual reliance may indicate diagnostic risk. The 76% accuracy substantially exceeds prior GPT-4V radiology studies (43%), demonstrating that focused domain application with structured prompting enhances performance. Incorrect diagnoses are associated with higher self-reported visual reliance, suggesting a potential failure mode warranting experimental validation. This pattern identifies a potentially actionable signal for quality assurance systems. Clinical deployment should remain restricted to supervised educational applications with mandatory radiologist oversight until balanced context-aware integration is validated.