Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

May 17, 2025

papers DOI: 10.1016/j.acra.2025.04.060 PMID: 40383659

Authors

Nakaura T,Takamure H,Kobayashi N,Shiraishi K,Yoshida N,Nagayama Y,Uetani H,Kidoh M,Funama Y,Hirai T

Affiliations (3)

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Chuo-ku, Kumamoto-shi, Kumamoto 860-8556, Japan (T.N., H.T., N.K., K.S., N.Y., Y.N., H.U., M.K., T.H.). Electronic address: [email protected].
Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Chuo-ku, Kumamoto-shi, Kumamoto 860-8556, Japan (T.N., H.T., N.K., K.S., N.Y., Y.N., H.U., M.K., T.H.).
Department of Medical Physics, Faculty of Life Sciences, Kumamoto University, Kumamoto, Japan (Y.F.).

Abstract

This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

View Source Full Text PDF

Topics

Journal Article

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Authors

Affiliations (3)

Abstract

Tags

Topics