Large Language Models Provide Accurate but Potentially Unsafe Answers to Multimodal Critical Care Medicine Board Review Questions.
Authors
Affiliations (13)
Affiliations (13)
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.
- Department of Emergency Medicine, Emory University School of Medicine, Atlanta, GA.
- Emory Critical Care Center, Emory University School of Medicine, Atlanta, GA.
- Department of Medicine, Oregon Health and Science University School of Medicine, Portland, OR.
- Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL.
- Division of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL.
- Department of Medicine, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, Emory University School of Medicine, Atlanta, GA.
- Department of Pharmacy, Emory University Hospital Midtown, Atlanta, GA.
- Department of Internal Medicine, Rush University, Chicago, IL.
- Department of Medicine, Tufts Medicine, Boston, MA.
- Department of Medicine, Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, NY.
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA.
- Department of Surgery, Emory University School of Medicine, Atlanta, GA.
Abstract
To evaluate the performance of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) in answering multimodal critical care board review questions, with a focus on accuracy, image interpretation, reasoning quality, and potential for harm. Observational study using a validated item bank of multiple-choice questions accompanied by clinical images, analyzed through a custom ChatGPT-4o profile built using a validated framework. Simulated environment mimicking critical care board examination conditions, with artificial intelligence responses reviewed by a panel of experienced critical care clinicians. One hundred eighty-three board-style questions from the Society of Critical Care Medicine item bank, representing a range of critical care domains and imaging modalities. ChatGPT-4o was evaluated on its responses, which were assessed by 14 clinical reviewers (physicians, advanced practice providers, and pharmacists). None. ChatGPT-4o answered 74.9% of questions correctly, higher than pooled clinician responses (71.1%; p = 0.03). It showed strengths in question comprehension (87.4% correct) but lower performance in image interpretation (61.7%), reasoning (68.3%), and supporting information (66.1%). ChatGPT-4o excelled in pulmonary disease (91.7%), surgery and trauma (87.5%), and neurologic disorders (81.8%), and underperformed in critical care ultrasound (51.1%). Notably, 33.3% of its responses were associated with potential for clinical harm, often due to incorrect image interpretation and treatment recommendations. ChatGPT-4o demonstrates performance slightly above pooled clinician benchmarks on critical care board-style questions but has substantial limitations in multimodal question interpretation. Despite its high comprehension, deficiencies in ChatGPT-4o's reasoning and image analysis may lead to harmful clinical conclusions in high-stakes clinical decision-making or clinical education.