Sort by:
Page 1 of 12 results

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y., Miki, S., Yamagishi, Y., Hanaoka, S., Nakao, T., Kikuchi, T., Nakamura, Y., Nomura, Y., Yoshikawa, T., Abe, O.

medrxiv logopreprintJun 23 2025
PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Deep learning-enabled MRI phenotyping uncovers regional body composition heterogeneity and disease associations in two European population cohorts

Mertens, C. J., Haentze, H., Ziegelmayer, S., Kather, J. N., Truhn, D., Kim, S. H., Busch, F., Weller, D., Wiestler, B., Graf, M., Bamberg, F., Schlett, C. L., Weiss, J. B., Ringhof, S., Can, E., Schulz-Menger, J., Niendorf, T., Lammert, J., Molwitz, I., Kader, A., Hering, A., Meddeb, A., Nawabi, J., Schulze, M. B., Keil, T., Willich, S. N., Krist, L., Hadamitzky, M., Hannemann, A., Bassermann, F., Rueckert, D., Pischon, T., Hapfelmeier, A., Makowski, M. R., Bressem, K. K., Adams, L. C.

medrxiv logopreprintJun 6 2025
Body mass index (BMI) does not account for substantial inter-individual differences in regional fat and muscle compartments, which are relevant for the prevalence of cardiometabolic and cancer conditions. We applied a validated deep learning pipeline for automated segmentation of whole-body MRI scans in 45,851 adults from the UK Biobank and German National Cohort, enabling harmonized quantification of visceral (VAT), gluteofemoral (GFAT), and abdominal subcutaneous adipose tissue (ASAT), liver fat fraction (LFF), and trunk muscle volume. Associations with clinical conditions were evaluated using compartment measures adjusted for age, sex, height, and BMI. Our analysis demonstrates that regional adiposity and muscle volume show distinct associations with cardiometabolic and cancer prevalence, and that substantial disease heterogeneity exists within BMI strata. The analytic framework and reference data presented here will support future risk stratification efforts and facilitate the integration of automated MRI phenotyping into large-scale population and clinical research.
Page 1 of 12 results
Show
per page
1

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.