Performance of State-of-the-Art Multimodal Large Language Models on an Image-Rich Radiology Board Examination: Comparison to Human Examinees.
Authors
Affiliations (4)
Affiliations (4)
- Department of Central Radiology, Kumamoto University Hospital, 1-1-1 Honjo, Kumamoto 860-8556, Japan (T.N., N.K., Y.N., H.U., M.K., S.O., T.H.). Electronic address: [email protected].
- Department of Central Radiology, Kumamoto University Hospital, 1-1-1 Honjo, Kumamoto 860-8556, Japan (T.N., N.K., Y.N., H.U., M.K., S.O., T.H.).
- Department of Radiological Technology, Faculty of Health Science and Technology, Kawasaki University of Medical Welfare, Okayama, Japan (T.M.).
- Department of Medical Physics, Faculty of Life Sciences, Kumamoto University, Kumamoto, Japan (Y.F.).
Abstract
This study aimed to assess the current multimodal capabilities of leading multimodal large language models (MLLMs) using a 2024 radiology board examination, evaluate their proficiency in utilizing medical image content, compare their performance against human examinees, and consider their cost-effectiveness. Six contemporary MLLMs (GPT-4.1, o3, Claude 3.7 Sonnet, Claude 3.7 Sonnet-thinking, Gemini 2.5 Pro Preview, and Gemini 2.5 Flash Preview-thinking) were evaluated using the 100 multiple-choice questions (96 image-based) from the 2024 official board examination of the Japan Radiological Society. Questions, originally in Japanese, were instructed to be translated into English by the MLLMs. Performance was also analyzed with and without images for certain models to assess multimodal utility. Gemini 2.5 Pro Preview achieved the highest accuracy (76.0%), followed by o3 (75.0%), both surpassing the average human examinee score (72.9%). Gemini 2.5 Pro Preview showed 75.0% accuracy with images versus 63.5% without (p = 0.035), and Gemini 2.5 Flash Preview-thinking demonstrated 68.8% accuracy with images versus 57.3% without (p = 0.019), indicating significant performance gains with image inclusion. Notably, Gemini models demonstrated top-tier performance at a highly competitive cost. The latest generation of MLLMs, particularly Gemini 2.5 Pro Preview and o3, can exceed average human performance on radiology board examinations and effectively leverage image information. The Gemini series, in particular, shows rapid improvements and offers a compelling combination of high performance and cost-efficiency for potential applications in radiology. Modern multimodal large language models, notably Gemini 2.5 Pro Preview and o3, surpassed average human performance on the 2024 Japanese Radiology Board Examination. Gemini models showed significant score improvements when utilizing image data and offer top-tier performance at a competitive cost, indicating rapid advancements and excellent cost-effectiveness for radiology applications.