Evaluating ChatGPT's performance across radiology subspecialties: A meta-analysis of board-style examination accuracy and variability.

Authors

Nguyen D,Kim GHJ,Bedayat A

Affiliations (3)

  • University of Massachusetts Chan Medical School, Worcester, MA, United States of America.
  • Biostatistics, Fielding School of Public Health, University of California, Los Angeles, David Geffen School of Medicine, University of California, Los Angeles (UCLA), United States of America.
  • Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA, United States of America. Electronic address: [email protected].

Abstract

Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities. Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1). Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53-62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (p < .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, p < .001). Prompting strategies showed significant improvement (p < .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (p < .001). ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.