Zero-shot performance of a general-purpose vision-language model for pediatric appendicitis diagnosis.
Authors
Affiliations (4)
Affiliations (4)
- Department of Pediatrics, Haydarpasa Numune Training and Research Hospital, Istanbul, Turkey.
- Department of Radiology, Uskudar State Hospital, Barbaros, Uskudar, Istanbul, Turkey. [email protected].
- Division of Diagnostic and Interventional Neuroradiology, Department of Radiology, University Hospital of Basel, Basel, Switzerland.
- Department of Pediatric Radiology, University Children's Hospital Basel, Basel, Switzerland.
Abstract
General-purpose vision-language models can analyze medical images without task-specific training, but their value for pediatric abdominal ultrasound is unknown. To investigate the zero-shot performance of a general-purpose vision-language model to diagnose pediatric appendicitis using multimodal inputs, including images, report text, and clinical information. In this retrospective study, diagnostic capabilities of Llama 4 Maverick were evaluated on the Regensburg Pediatric Appendicitis Dataset. Five experiments were conducted using the following inputs as well as input combinations: images only, ultrasound report text only, images and ultrasound text, images and clinical data, entire multimodal input. Reference standard of appendicitis was defined based on histopathology in patients who underwent surgical resection and on clinical follow-up in patients managed conservatively. Performance was assessed at the patient level using accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. Area under the receiver operating characteristic curve (AUROC) was evaluated secondarily as a measure of discrimination. Of 782 patients, 294 met inclusion criteria; based on data availability, four experiments are conducted using 293 patients and in the fifth experiment, 284. Images-only experiment showed very low specificity (13.5%) and had the poorest discrimination (AUROC, 0.567). Ultrasound report text improved classification performance, with specificity increasing to 63.5% while maintaining high sensitivity; discrimination was also strong (AUROC, 0.883). Adding images to ultrasound text or combining all available inputs did not yield further meaningful improvement. Differences between images-only experiment and text-containing experiments were statistically significant (P≤0.002). Zero-shot visual interpretation of pediatric abdominal ultrasound by a general-purpose vision-language model is inadequate for safe appendicitis diagnosis. Performance is driven primarily by a structured sonographic imaging report, supporting a role for such models as text-based decision support tools rather than autonomous ultrasound image interpreters.