Artificial Intelligence in Radiology: Performance of ChatGPT-4v and GPT-4o on Diagnostic Radiology In-Training (DXIT) Examination Questions.

October 29, 2025

papers

DOI: 10.1016/j.jacr.2025.10.026 PMID: 41173086

Authors

Martini R,Sang A,Saunders P,Bala W,Li H,Moon JT,Balthazar P

Affiliations (5)

Emory University School of Medicine, 3338 Peachtree Rd. NE Apt. 1410, Atlanta, Georgia, USA 30322. Electronic address: [email protected].
Robert Wood Johnson Medical School.
Emory University.
Emory University School of Medicine.
Emory University School of Medicine; Medical Director, Quality and Patient Safety for Emory Enterprise Radiology.

Abstract

The purpose of this study is to examine the performance of GPT-4vision (GPT-4v) and GPT-4omni (GPT-4o) on the American College of Radiology's Diagnostic Radiology In-Training (DXIT) examination, comparing performance on image-based and text-only questions. 1136 publicly available DXIT examination questions were input into GPT-4v and GPT-4o with a prompt asking the LLM to provide its answer, rationale, and confidence level (0-100). Accuracy of each model across different categories was then analyzed, with chi-square tests to compare proportions, t-tests to compare means, and receiver operating characteristic (ROC) curves to evaluate confidence levels. GPT-4o and -4v achieved accuracies of 73.5% and 69.3%, respectively (p<.0001) while scoring 55.6% and 50.3% on image-based questions (p<0.0001). ROC curves of confidence levels and correctness produced areas under the curve (AUC) of 0.64 and 0.66 for GPT-4o and -4v, respectively. GPT-4o outperformed GPT-4v on nearly every metric, with both models outperforming the national average performance of post-graduate year 3 radiology residents (61.9%) on the 2022 DXIT examination. However, performance on image-based questions remains significantly worse than text-only questions, where both models score below radiology trainees from the same cohort. Both models exhibit limited ability to predict correctness using an intrinsic confidence level. Use of ChatGPT for test preparation and image interpretation must therefore be approached with caution.

View Source Full Text PDF

Topics

Journal Article

Artificial Intelligence in Radiology: Performance of ChatGPT-4v and GPT-4o on Diagnostic Radiology In-Training (DXIT) Examination Questions.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?