Zero-shot multimodal large language models underperform a domain-trained CNN baseline in pediatric wrist fracture detection.

June 17, 2026

papers

DOI: 10.1038/s41598-026-58763-w PMID: 42310365

Authors

Haupt M,Weiß D,Bellersen T,Maurer MH

Affiliations (2)

Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany. [email protected].
Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany.

Abstract

Multimodal large language models (LLMs) that process text and images are increasingly discussed for medical imaging, yet their diagnostic performance on radiographs remains poorly characterized. We evaluated three commercially available multimodal LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) in a strict zero-shot setting for pediatric wrist fracture detection and compared them with a domain-trained Inception v3 convolutional neural network (CNN) on the same GRAZPEDWRI-DX dataset. We constructed a balanced patient-level test cohort of 1,000 children (2,298 radiographs; 500 fracture, 500 non-fracture). The CNN achieved high diagnostic performance (AUROC 0.905, AUPRC 0.920), whereas all LLMs performed close to chance (accuracies < 0.55, Matthews correlation coefficients ≈ 0) and produced bounding boxes often inconsistent with expert annotations. These findings indicate that, in a strict zero-shot setting, the three commercial multimodal LLMs evaluated here lack reliable diagnostic ability for pediatric wrist fracture detection and should therefore be regarded as exploratory research tools rather than clinically dependable systems for pediatric radiograph interpretation.

View Source Full Text PDF

Topics

Wrist FracturesJournal Article

Zero-shot multimodal large language models underperform a domain-trained CNN baseline in pediatric wrist fracture detection.

Authors

Affiliations (2)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?