Cautionary lessons from real-world testing of GPT-4.1 AI for pediatric foreign body aspiration.

November 23, 2025

papers

DOI: 10.1007/s00405-025-09856-1 PMID: 41276667

Authors

Hack S,Attal R,Elazar D,Alon Y,Meyuchas R,Livne A,Madgar O,Saban M

Affiliations (9)

City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel. [email protected].
City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel.
Touro College of Osteopathic Medicine - Montana, MT, Great Falls, USA.
Nursing Department, The Stanley Steyer School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Ramat Aviv, 69978, Israel.
Department of Otolaryngology, Head and Neck Surgery, Sheba Medical Center, Tel Hashomer, Israel.
Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
DeepVision Lab, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
Department of Diagnostic Imaging, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
Gray Faculty of Medical & Health Sciences, Tel-Aviv University, Tel-Aviv, Israel.

Abstract

To evaluate the feasibility and diagnostic performance of a multimodal large language model (GPT-4.1) in detecting pediatric airway foreign body aspiration (FBA) using real-world clinical and radiographic data. This retrospective cohort study included 58 pediatric patients evaluated for suspected airway FBA at a tertiary academic hospital between 2015 and 2024. Each case combined structured clinical data and chest radiographs obtained at the time of emergency-department presentation, with bronchoscopy serving as the diagnostic reference standard. GPT-4.1, a vision-enabled large language model, classified cases as right-bronchus aspiration, left-bronchus aspiration, or no aspiration. Model performance was assessed using accuracy, precision, recall, and F1-score. The model achieved an overall accuracy of 62.3%, with precision of 23.3%, recall of 19.0%, and an F1-score of 0.21. While it correctly identified 34 of 46 cases without aspiration, it detected only 4 of 12 confirmed bronchial-aspiration cases and missed all left-bronchus aspirations. This proof-of-concept feasibility study highlights substantial limitations of a general-purpose multimodal AI model in pediatric airway triage. The low recall and high misclassification rates suggest that vision-enabled language models require task-specific training and rigorous validation before clinical implementation. Nevertheless, when used as an adjunct rather than a replacement for bronchoscopy, such models may eventually support triage decisions in resource-limited settings if further optimized and prospectively validated.

View Source Full Text PDF

Topics

Journal Article

Cautionary lessons from real-world testing of GPT-4.1 AI for pediatric foreign body aspiration.

Authors

Affiliations (9)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?