Back to all papers

Zero-shot multimodal large language models underperform a domain-trained CNN baseline in pediatric wrist fracture detection.

June 17, 2026pubmed logopapers

Authors

Haupt M,Weiß D,Bellersen T,Maurer MH

Affiliations (2)

  • Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany. [email protected].
  • Department of Diagnostic and Interventional Radiology, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany.

Abstract

Multimodal large language models (LLMs) that process text and images are increasingly discussed for medical imaging, yet their diagnostic performance on radiographs remains poorly characterized. We evaluated three commercially available multimodal LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) in a strict zero-shot setting for pediatric wrist fracture detection and compared them with a domain-trained Inception v3 convolutional neural network (CNN) on the same GRAZPEDWRI-DX dataset. We constructed a balanced patient-level test cohort of 1,000 children (2,298 radiographs; 500 fracture, 500 non-fracture). The CNN achieved high diagnostic performance (AUROC 0.905, AUPRC 0.920), whereas all LLMs performed close to chance (accuracies < 0.55, Matthews correlation coefficients ≈ 0) and produced bounding boxes often inconsistent with expert annotations. These findings indicate that, in a strict zero-shot setting, the three commercial multimodal LLMs evaluated here lack reliable diagnostic ability for pediatric wrist fracture detection and should therefore be regarded as exploratory research tools rather than clinically dependable systems for pediatric radiograph interpretation.

Topics

Wrist FracturesJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.