Back to all papers

Cautionary lessons from real-world testing of GPT-4.1 AI for pediatric foreign body aspiration.

November 23, 2025pubmed logopapers

Authors

Hack S,Attal R,Elazar D,Alon Y,Meyuchas R,Livne A,Madgar O,Saban M

Affiliations (9)

  • City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel. [email protected].
  • City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel.
  • Touro College of Osteopathic Medicine - Montana, MT, Great Falls, USA.
  • Nursing Department, The Stanley Steyer School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Ramat Aviv, 69978, Israel.
  • Department of Otolaryngology, Head and Neck Surgery, Sheba Medical Center, Tel Hashomer, Israel.
  • Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
  • DeepVision Lab, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
  • Department of Diagnostic Imaging, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
  • Gray Faculty of Medical & Health Sciences, Tel-Aviv University, Tel-Aviv, Israel.

Abstract

To evaluate the feasibility and diagnostic performance of a multimodal large language model (GPT-4.1) in detecting pediatric airway foreign body aspiration (FBA) using real-world clinical and radiographic data. This retrospective cohort study included 58 pediatric patients evaluated for suspected airway FBA at a tertiary academic hospital between 2015 and 2024. Each case combined structured clinical data and chest radiographs obtained at the time of emergency-department presentation, with bronchoscopy serving as the diagnostic reference standard. GPT-4.1, a vision-enabled large language model, classified cases as right-bronchus aspiration, left-bronchus aspiration, or no aspiration. Model performance was assessed using accuracy, precision, recall, and F1-score. The model achieved an overall accuracy of 62.3%, with precision of 23.3%, recall of 19.0%, and an F1-score of 0.21. While it correctly identified 34 of 46 cases without aspiration, it detected only 4 of 12 confirmed bronchial-aspiration cases and missed all left-bronchus aspirations. This proof-of-concept feasibility study highlights substantial limitations of a general-purpose multimodal AI model in pediatric airway triage. The low recall and high misclassification rates suggest that vision-enabled language models require task-specific training and rigorous validation before clinical implementation. Nevertheless, when used as an adjunct rather than a replacement for bronchoscopy, such models may eventually support triage decisions in resource-limited settings if further optimized and prospectively validated.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,200+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.