Comparison of emergency physicians and artificial intelligence models in pneumothorax detection: A multi-reader retrospective study.

June 19, 2026

papers

DOI: 10.1097/MD.0000000000049412 PMID: 42332528

Authors

Cakiroglu OF,Arac B,Calisir SN,Gurbuz J,Balci EB,Taslidere B,Cander B

Affiliations (4)

Department of Emergency Medicine, Kartal Doktor Lutfi Kirdar City Hospital, Istanbul, Turkiye.
Basaksehir State Hospital, Istanbul, Turkiye.
Department of Emergency Medicine, Basaksehir Cam Sakura City Hospital, Istanbul, Turkey.
Department of Emergency Medicine, Bezmialem Vakif University, Istanbul, Turkiye.

Abstract

This study aims to compare emergency physicians' performance with that of general-purpose large language models (LLMs), such as ChatGPT and Gemini, for pneumothorax (PTX) detection on chest radiographs (CXRs). This single-center, retrospective study of adults was conducted between January 2015 and February 2025 and included 265 PTX cases and 267 non-PTX controls. Exclusions included diagnoses made only by computed tomography, absence of CXR, initial treatment at another center, or incomplete data. Thirteen emergency physicians independently and blindly reviewed CXRs and recorded a binary decision. ChatGPT and Gemini evaluated the same images with a standardized yes/no prompt, with memory cleared between cases to prevent carryover. The primary outcome was LLM diagnostic performance for PTX, while the secondary outcome compared LLMs with physicians. ChatGPT and Gemini exhibited distinct diagnostic performance profiles for PTX detection on CXRs. Gemini demonstrated a sensitivity of 52.5%, whereas ChatGPT demonstrated a sensitivity of 44.5%. Conversely, ChatGPT achieved a specificity of 95.5% and an overall accuracy of 70.1%, while Gemini demonstrated a specificity of 79.0% and an accuracy of 65.8%. Agreement with the reference standard was moderate for ChatGPT, with a kappa value of 0.401, and fair for Gemini, with a kappa value of 0.315. Increasing case difficulty was associated with a reduction in diagnostic accuracy for both models, with correlation coefficients of - 0.438 for ChatGPT and - 0.274 for Gemini. For contextual clinical comparison, emergency physicians demonstrated a sensitivity of 64.5%, a specificity of 99.6%, and an overall accuracy of 82.1%. This study demonstrates model-specific differences in PTX detection by general-purpose AI systems, with Gemini showing higher sensitivity and ChatGPT showing superior specificity and accuracy, both declining with increasing case difficulty. Physician performance remained higher, but was secondary for context. Despite their accessibility and low cost, these models should be considered only adjunctive tools until task-specific optimization and clinical validation are achieved.

View Source Full Text PDF

Topics

Artificial IntelligencePneumothoraxPhysiciansJournal ArticleComparative Study

Comparison of emergency physicians and artificial intelligence models in pneumothorax detection: A multi-reader retrospective study.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?