Cautionary lessons from real-world testing of GPT-4.1 AI for pediatric foreign body aspiration.
Authors
Affiliations (9)
Affiliations (9)
- City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel. [email protected].
- City St. George's University London School of Medicine, Program Delivered by University of Nicosia at the Chaim Sheba Medical Center, Ramat Gan, Israel.
- Touro College of Osteopathic Medicine - Montana, MT, Great Falls, USA.
- Nursing Department, The Stanley Steyer School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Ramat Aviv, 69978, Israel.
- Department of Otolaryngology, Head and Neck Surgery, Sheba Medical Center, Tel Hashomer, Israel.
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
- DeepVision Lab, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Emek Haela St. 1, Ramat Gan, 52621, Israel.
- Gray Faculty of Medical & Health Sciences, Tel-Aviv University, Tel-Aviv, Israel.
Abstract
To evaluate the feasibility and diagnostic performance of a multimodal large language model (GPT-4.1) in detecting pediatric airway foreign body aspiration (FBA) using real-world clinical and radiographic data. This retrospective cohort study included 58 pediatric patients evaluated for suspected airway FBA at a tertiary academic hospital between 2015 and 2024. Each case combined structured clinical data and chest radiographs obtained at the time of emergency-department presentation, with bronchoscopy serving as the diagnostic reference standard. GPT-4.1, a vision-enabled large language model, classified cases as right-bronchus aspiration, left-bronchus aspiration, or no aspiration. Model performance was assessed using accuracy, precision, recall, and F1-score. The model achieved an overall accuracy of 62.3%, with precision of 23.3%, recall of 19.0%, and an F1-score of 0.21. While it correctly identified 34 of 46 cases without aspiration, it detected only 4 of 12 confirmed bronchial-aspiration cases and missed all left-bronchus aspirations. This proof-of-concept feasibility study highlights substantial limitations of a general-purpose multimodal AI model in pediatric airway triage. The low recall and high misclassification rates suggest that vision-enabled language models require task-specific training and rigorous validation before clinical implementation. Nevertheless, when used as an adjunct rather than a replacement for bronchoscopy, such models may eventually support triage decisions in resource-limited settings if further optimized and prospectively validated.