Performance Comparison of Cutting-Edge Large Language Models on the ACR In-Training Examination: An Update for 2025.
Authors
Affiliations (6)
Affiliations (6)
- Northwell Health, Mather Hospital, Port Jefferson, New York (A.Y.). Electronic address: [email protected].
- Department of Radiology, Stony Brook University Hospital, Stony Brook, New York (R.P.).
- New York Institute of Technology College of Medicine, Old Westbury, New York (A.I.).
- Department of Biomedical Informatics, Stony Brook University Hospital, Stony Brook, New York (P.P.).
- Department of Radiology, Northwestern University Hospital, Chicago, Illinois (V.H.).
- Department of Radiology, Northwestern University Hospital, Chicago, Illinois (D.P.).
Abstract
This study represents a continuation of prior work by Payne et al. evaluating large language model (LLM) performance on radiology board-style assessments, specifically the ACR diagnostic radiology in-training examination (DXIT). Building upon earlier findings with GPT-4, we assess the performance of newer, cutting-edge models, such as GPT-4o, GPT-o1, GPT-o3, Claude, Gemini, and Grok on standardized DXIT questions. In addition to overall performance, we compare model accuracy on text-based versus image-based questions to assess multi-modal reasoning capabilities. As a secondary aim, we investigate the potential impact of data contamination by comparing model performance on original versus revised image-based questions. Seven LLMs - GPT-4, GPT-4o, GPT-o1, GPT-o3, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Grok 2.0-were evaluated using 106 publicly available DXIT questions. Each model was prompted using a standardized instruction set to simulate a radiology resident answering board-style questions. For each question, the model's selected answer, rationale, and confidence score were recorded. Unadjusted accuracy (based on correct answer selection) and logic-adjusted accuracy (based on clinical reasoning pathways) were calculated. Subgroup analysis compared model performance on text-based versus image-based questions. Additionally, 63 image-based questions were revised to test novel reasoning while preserving the original diagnostic image to assess the impact of potential training data contamination. Across 106 DXIT questions, GPT-o1 demonstrated the highest unadjusted accuracy (71.7%), followed closely by GPT-4o (69.8%) and GPT-o3 (68.9%). GPT-4 (59.4%) and Grok 2.0 exhibited similar scores (59.4% and 52.8%). Claude 3.5 Sonnet had the lowest unadjusted accuracies (34.9%). Similar trends were observed for logic-adjusted accuracy, with GPT-o1 (60.4%), GPT-4o (59.4%), and GPT-o3 (59.4%) again outperforming other models, while Grok 2.0 and Claude 3.5 Sonnet lagged behind (34.0% and 30.2%, respectively). GPT-4o's performance was significantly higher on text-based questions compared to image-based ones. Unadjusted accuracy for the revised DXIT questions was 49.2%, compared to 56.1% on matched original DXIT questions. Logic-adjusted accuracy for the revised DXIT questions was 40.0% compared to 44.4% on matched original DXIT questions. No significant difference in performance was observed between original and revised questions. Modern LLMs, especially those from OpenAI, demonstrate strong and improved performance on board-style radiology assessments. Comparable performance on revised prompts suggests that data contamination may have played a limited role. As LLMs improve, they hold strong potential to support radiology resident learning through personalized feedback and board-style question review.