Evaluating GPT-4o for emergency disposition of complex respiratory cases with pulmonology consultation: a diagnostic accuracy study.
Yıldırım C, Aykut A, Günsoy E, Öncül MV
•papers•Oct 2 2025Large Language Models (LLMs), such as GPT-4o, are increasingly investigated for clinical decision support in emergency medicine. However, their real-world performance in disposition prediction remains insufficiently studied. This study evaluated the diagnostic accuracy of GPT-4o in predicting ED disposition-discharge, ward admission, or ICU admission-in complex emergency respiratory cases requiring pulmonology consultation and chest CT, representing a selective high-acuity subgroup of ED patients. We conducted a retrospective observational study in a tertiary ED between November 2024 and February 2025. We retrospectively included ED patients with complex respiratory presentations who underwent pulmonology consultation and chest CT, representing a selective high-acuity subgroup rather than the general ED respiratory population. GPT-4o was prompted to predict the most appropriate ED disposition using three progressively enriched input models: Model 1 (age, sex, oxygen saturation, home oxygen therapy, and venous blood gas parameters); Model 2 (Model 1 plus laboratory data); and Model 3 (Model 2 plus chest CT findings). Model performance was assessed using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score. Among the 221 patients included, 69.2% were admitted to the ward, 9.0% to the intensive care unit (ICU), and 21.7% were discharged. For hospital admission prediction, Model 3 demonstrated the highest sensitivity (91.9%) and overall accuracy (76.5%), but the lowest specificity (20.8%). In contrast, for discharge prediction, Model 3 achieved the highest specificity (91.9%) but the lowest sensitivity (20.8%). Numerical improvements were observed across models, but none reached statistical significance (all p > 0.22). Model 1 therefore performed comparably to Models 2-3 while being less complex. Among patients who were discharged despite GPT-4o predicting admission, the 14-day ED re-presentation rates were 23.8% (5/21) for Model 1, 30.0% (9/30) for Model 2, and 28.9% (11/38) for Model 3. GPT-4o demonstrated high sensitivity in identifying ED patients requiring hospital admission, particularly those needing intensive care, when provided with progressively enriched clinical input. However, its low sensitivity for discharge prediction resulted in frequent overtriage, limiting its utility for autonomous decision-making. This proof-of-concept study demonstrates GPT-4o's capacity to stratify disposition decisions in complex respiratory cases under varying levels of limited input data. However, these findings should be interpreted in light of key limitations, including the selective high-acuity cohort and the absence of vital signs, and require prospective validation before clinical implementation.