Back to all papers

Performance Comparison of Cutting-Edge Large Language Models on the ACR In-Training Examination: An Update for 2025.

Authors

Young A,Paloka R,Islam A,Prasanna P,Hill V,Payne D

Affiliations (6)

  • Northwell Health, Mather Hospital, Port Jefferson, New York (A.Y.). Electronic address: [email protected].
  • Department of Radiology, Stony Brook University Hospital, Stony Brook, New York (R.P.).
  • New York Institute of Technology College of Medicine, Old Westbury, New York (A.I.).
  • Department of Biomedical Informatics, Stony Brook University Hospital, Stony Brook, New York (P.P.).
  • Department of Radiology, Northwestern University Hospital, Chicago, Illinois (V.H.).
  • Department of Radiology, Northwestern University Hospital, Chicago, Illinois (D.P.).

Abstract

This study represents a continuation of prior work by Payne et al. evaluating large language model (LLM) performance on radiology board-style assessments, specifically the ACR diagnostic radiology in-training examination (DXIT). Building upon earlier findings with GPT-4, we assess the performance of newer, cutting-edge models, such as GPT-4o, GPT-o1, GPT-o3, Claude, Gemini, and Grok on standardized DXIT questions. In addition to overall performance, we compare model accuracy on text-based versus image-based questions to assess multi-modal reasoning capabilities. As a secondary aim, we investigate the potential impact of data contamination by comparing model performance on original versus revised image-based questions. Seven LLMs - GPT-4, GPT-4o, GPT-o1, GPT-o3, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Grok 2.0-were evaluated using 106 publicly available DXIT questions. Each model was prompted using a standardized instruction set to simulate a radiology resident answering board-style questions. For each question, the model's selected answer, rationale, and confidence score were recorded. Unadjusted accuracy (based on correct answer selection) and logic-adjusted accuracy (based on clinical reasoning pathways) were calculated. Subgroup analysis compared model performance on text-based versus image-based questions. Additionally, 63 image-based questions were revised to test novel reasoning while preserving the original diagnostic image to assess the impact of potential training data contamination. Across 106 DXIT questions, GPT-o1 demonstrated the highest unadjusted accuracy (71.7%), followed closely by GPT-4o (69.8%) and GPT-o3 (68.9%). GPT-4 (59.4%) and Grok 2.0 exhibited similar scores (59.4% and 52.8%). Claude 3.5 Sonnet had the lowest unadjusted accuracies (34.9%). Similar trends were observed for logic-adjusted accuracy, with GPT-o1 (60.4%), GPT-4o (59.4%), and GPT-o3 (59.4%) again outperforming other models, while Grok 2.0 and Claude 3.5 Sonnet lagged behind (34.0% and 30.2%, respectively). GPT-4o's performance was significantly higher on text-based questions compared to image-based ones. Unadjusted accuracy for the revised DXIT questions was 49.2%, compared to 56.1% on matched original DXIT questions. Logic-adjusted accuracy for the revised DXIT questions was 40.0% compared to 44.4% on matched original DXIT questions. No significant difference in performance was observed between original and revised questions. Modern LLMs, especially those from OpenAI, demonstrate strong and improved performance on board-style radiology assessments. Comparable performance on revised prompts suggests that data contamination may have played a limited role. As LLMs improve, they hold strong potential to support radiology resident learning through personalized feedback and board-style question review.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.