Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.

June 1, 2025

papers DOI: 10.3348/kjr.2024.1096 PMID: 40288896

Authors

Choi A,Kim HG,Choi MH,Ramasamy SK,Kim Y,Jung SE

Affiliations (4)

Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea. [email protected].
Department of Radiology, Molecular Imaging Program at Stanford, Stanford University School of Medicine, Stanford, CA, USA.
Department of Diagnostic Radiology, Dankook University Hospital, Cheonan, Republic of Korea.

Abstract

Despite the potential of large language models for radiology training, their ability to handle image-based radiological questions remains poorly understood. This study aimed to evaluate the performance of the GPT-4 Turbo and GPT-4o in radiology resident examinations, to analyze differences across question types, and to compare their results with those of residents at different levels. A total of 776 multiple-choice questions from the Korean Society of Radiology In-Training Examinations were used, forming two question sets: one originally written in Korean and the other translated into English. We evaluated the performance of GPT-4 Turbo (gpt-4-turbo-2024-04-09) and GPT-4o (gpt-4o-2024-11-20) on these questions with the temperature set to zero, determining the accuracy based on the majority vote from five independent trials. We analyzed their results using the question type (text-only vs. image-based) and benchmarked them against nationwide radiology residents' performance. The impact of the input language (Korean or English) on model performance was examined. GPT-4o outperformed GPT-4 Turbo for both image-based (48.2% vs. 41.8%, P = 0.002) and text-only questions (77.9% vs. 69.0%, P = 0.031). On image-based questions, GPT-4 Turbo and GPT-4o showed comparable performance to that of 1st-year residents (41.8% and 48.2%, respectively, vs. 43.3%, P = 0.608 and 0.079, respectively) but lower performance than that of 2nd- to 4th-year residents (vs. 56.0%-63.9%, all P ≤ 0.005). For text-only questions, GPT-4 Turbo and GPT-4o performed better than residents across all years (69.0% and 77.9%, respectively, vs. 44.7%-57.5%, all P ≤ 0.039). Performance on the English- and Korean-version questions showed no significant differences for either model (all P ≥ 0.275). GPT-4o outperformed the GPT-4 Turbo in all question types. On image-based questions, both models' performance matched that of 1st-year residents but was lower than that of higher-year residents. Both models demonstrated superior performance compared to residents for text-only questions. The models showed consistent performances across English and Korean inputs.

View Source Full Text PDF

Topics

RadiologyInternship and ResidencyEducational MeasurementJournal Article

Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?