Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Authors

Nakaura T,Takamure H,Kobayashi N,Shiraishi K,Yoshida N,Nagayama Y,Uetani H,Kidoh M,Funama Y,Hirai T

Affiliations (3)

  • Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Chuo-ku, Kumamoto-shi, Kumamoto 860-8556, Japan (T.N., H.T., N.K., K.S., N.Y., Y.N., H.U., M.K., T.H.). Electronic address: [email protected].
  • Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Chuo-ku, Kumamoto-shi, Kumamoto 860-8556, Japan (T.N., H.T., N.K., K.S., N.Y., Y.N., H.U., M.K., T.H.).
  • Department of Medical Physics, Faculty of Life Sciences, Kumamoto University, Kumamoto, Japan (Y.F.).

Abstract

This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

Topics

Journal Article
Get Started

Upload your X-ray image and get interpretation.

Upload now →

Disclaimer: X-ray Interpreter's AI-generated results are for informational purposes only and not a substitute for professional medical advice. Always consult a healthcare professional for medical diagnosis and treatment.