Back to all papers

Large Language Models Provide Accurate but Potentially Unsafe Answers to Multimodal Critical Care Medicine Board Review Questions.

June 18, 2026pubmed logopapers

Authors

Sethi I,Khan S,Lyons PG,Gao CA,Luo Y,Miltz D,Tovar S,Lohuis CT,Polly DM,Dave SB,Sterling M,Rojas JC,Han X,Sakhuja A,Celi LA,Martin GS,Coopersmith CM,Bhavani SV

Affiliations (13)

  • Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.
  • Department of Emergency Medicine, Emory University School of Medicine, Atlanta, GA.
  • Emory Critical Care Center, Emory University School of Medicine, Atlanta, GA.
  • Department of Medicine, Oregon Health and Science University School of Medicine, Portland, OR.
  • Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL.
  • Division of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL.
  • Department of Medicine, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, Emory University School of Medicine, Atlanta, GA.
  • Department of Pharmacy, Emory University Hospital Midtown, Atlanta, GA.
  • Department of Internal Medicine, Rush University, Chicago, IL.
  • Department of Medicine, Tufts Medicine, Boston, MA.
  • Department of Medicine, Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, NY.
  • Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA.
  • Department of Surgery, Emory University School of Medicine, Atlanta, GA.

Abstract

To evaluate the performance of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) in answering multimodal critical care board review questions, with a focus on accuracy, image interpretation, reasoning quality, and potential for harm. Observational study using a validated item bank of multiple-choice questions accompanied by clinical images, analyzed through a custom ChatGPT-4o profile built using a validated framework. Simulated environment mimicking critical care board examination conditions, with artificial intelligence responses reviewed by a panel of experienced critical care clinicians. One hundred eighty-three board-style questions from the Society of Critical Care Medicine item bank, representing a range of critical care domains and imaging modalities. ChatGPT-4o was evaluated on its responses, which were assessed by 14 clinical reviewers (physicians, advanced practice providers, and pharmacists). None. ChatGPT-4o answered 74.9% of questions correctly, higher than pooled clinician responses (71.1%; p = 0.03). It showed strengths in question comprehension (87.4% correct) but lower performance in image interpretation (61.7%), reasoning (68.3%), and supporting information (66.1%). ChatGPT-4o excelled in pulmonary disease (91.7%), surgery and trauma (87.5%), and neurologic disorders (81.8%), and underperformed in critical care ultrasound (51.1%). Notably, 33.3% of its responses were associated with potential for clinical harm, often due to incorrect image interpretation and treatment recommendations. ChatGPT-4o demonstrates performance slightly above pooled clinician benchmarks on critical care board-style questions but has substantial limitations in multimodal question interpretation. Despite its high comprehension, deficiencies in ChatGPT-4o's reasoning and image analysis may lead to harmful clinical conclusions in high-stakes clinical decision-making or clinical education.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.