Back to all papers

Artificial Intelligence in Radiology: Performance of ChatGPT-4v and GPT-4o on Diagnostic Radiology In-Training (DXIT) Examination Questions.

October 29, 2025pubmed logopapers

Authors

Martini R,Sang A,Saunders P,Bala W,Li H,Moon JT,Balthazar P

Affiliations (5)

  • Emory University School of Medicine, 3338 Peachtree Rd. NE Apt. 1410, Atlanta, Georgia, USA 30322. Electronic address: [email protected].
  • Robert Wood Johnson Medical School.
  • Emory University.
  • Emory University School of Medicine.
  • Emory University School of Medicine; Medical Director, Quality and Patient Safety for Emory Enterprise Radiology.

Abstract

The purpose of this study is to examine the performance of GPT-4vision (GPT-4v) and GPT-4omni (GPT-4o) on the American College of Radiology's Diagnostic Radiology In-Training (DXIT) examination, comparing performance on image-based and text-only questions. 1136 publicly available DXIT examination questions were input into GPT-4v and GPT-4o with a prompt asking the LLM to provide its answer, rationale, and confidence level (0-100). Accuracy of each model across different categories was then analyzed, with chi-square tests to compare proportions, t-tests to compare means, and receiver operating characteristic (ROC) curves to evaluate confidence levels. GPT-4o and -4v achieved accuracies of 73.5% and 69.3%, respectively (p<.0001) while scoring 55.6% and 50.3% on image-based questions (p<0.0001). ROC curves of confidence levels and correctness produced areas under the curve (AUC) of 0.64 and 0.66 for GPT-4o and -4v, respectively. GPT-4o outperformed GPT-4v on nearly every metric, with both models outperforming the national average performance of post-graduate year 3 radiology residents (61.9%) on the 2022 DXIT examination. However, performance on image-based questions remains significantly worse than text-only questions, where both models score below radiology trainees from the same cohort. Both models exhibit limited ability to predict correctness using an intrinsic confidence level. Use of ChatGPT for test preparation and image interpretation must therefore be approached with caution.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.