Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.

Authors

Choi A,Kim HG,Choi MH,Ramasamy SK,Kim Y,Jung SE

Affiliations (4)

  • Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
  • Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea. [email protected].
  • Department of Radiology, Molecular Imaging Program at Stanford, Stanford University School of Medicine, Stanford, CA, USA.
  • Department of Diagnostic Radiology, Dankook University Hospital, Cheonan, Republic of Korea.

Abstract

Despite the potential of large language models for radiology training, their ability to handle image-based radiological questions remains poorly understood. This study aimed to evaluate the performance of the GPT-4 Turbo and GPT-4o in radiology resident examinations, to analyze differences across question types, and to compare their results with those of residents at different levels. A total of 776 multiple-choice questions from the Korean Society of Radiology In-Training Examinations were used, forming two question sets: one originally written in Korean and the other translated into English. We evaluated the performance of GPT-4 Turbo (gpt-4-turbo-2024-04-09) and GPT-4o (gpt-4o-2024-11-20) on these questions with the temperature set to zero, determining the accuracy based on the majority vote from five independent trials. We analyzed their results using the question type (text-only vs. image-based) and benchmarked them against nationwide radiology residents' performance. The impact of the input language (Korean or English) on model performance was examined. GPT-4o outperformed GPT-4 Turbo for both image-based (48.2% vs. 41.8%, <i>P</i> = 0.002) and text-only questions (77.9% vs. 69.0%, <i>P</i> = 0.031). On image-based questions, GPT-4 Turbo and GPT-4o showed comparable performance to that of 1st-year residents (41.8% and 48.2%, respectively, vs. 43.3%, <i>P</i> = 0.608 and 0.079, respectively) but lower performance than that of 2nd- to 4th-year residents (vs. 56.0%-63.9%, all <i>P</i> ≤ 0.005). For text-only questions, GPT-4 Turbo and GPT-4o performed better than residents across all years (69.0% and 77.9%, respectively, vs. 44.7%-57.5%, all <i>P</i> ≤ 0.039). Performance on the English- and Korean-version questions showed no significant differences for either model (all <i>P</i> ≥ 0.275). GPT-4o outperformed the GPT-4 Turbo in all question types. On image-based questions, both models' performance matched that of 1st-year residents but was lower than that of higher-year residents. Both models demonstrated superior performance compared to residents for text-only questions. The models showed consistent performances across English and Korean inputs.

Topics

RadiologyInternship and ResidencyEducational MeasurementJournal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.