Diagnostic Performance of a Large Language Model (ChatGPT-4o) in Chronic Rhinosinusitis CT Scan Interpretation.
Authors
Affiliations (4)
Affiliations (4)
- College of Medicine King Saud University Riyadh Saudi Arabia.
- Radiology Department of King Abdulaziz University Hospital King Saud University Riyadh Saudi Arabia.
- Otolaryngology-Head and Neck Surgery Department King Saud University Medical City Riyadh Saudi Arabia.
- Security Forces Hospital Program, the General Directorate of Medical Services Ministry of Interior Riyadh Saudi Arabia.
Abstract
Large language models (LLMs), such as ChatGPT, are increasingly utilized by physicians for clinical decision support due to their ease of use and versatility. However, their performance in diagnostic imaging remains largely untested. This study prospectively evaluates ChatGPT's ability to interpret sinus computed tomography (CT) scans for chronic rhinosinusitis (CRS), using radiologist assessment as the reference standard. In this prospective cohort study, 102 coronal sinus CT scans were evaluated by both a board-certified radiologist and ChatGPT-4o. Each scan was screen recorded and uploaded twice to ChatGPT to assess repeatability, resulting in 306 total interpretations. The radiologist reviewed the same screen recordings provided to ChatGPT. Both raters assessed 11 predefined binary anatomical features and generated Lund-Mackay scores. Diagnostic performance was assessed using standard accuracy metrics, and inter-rater agreement was evaluated using established reliability coefficients. ChatGPT demonstrated variable performance across anatomical features. Sensitivity ranged from 0.00 to 0.89, and specificity from 0.26 to 0.95. The model demonstrated relatively high sensitivity for mucosal thickening (0.84) and sinus expansion (0.73), as well as strong agreement with the radiologist for the lamina papyracea (AC1 = 0.92) and anterior ethmoid artery (AC1 = 0.77). However, performance was poor for air-fluid levels and bone thinning. Agreement with the radiologist was low across most features (AC1 < 0.4 in 82% of variables), and repeatability between ChatGPT versions was limited (mean AC1 = 0.29). Correlation between runs for Lund-Mackay scores was weak (<i>r</i> = 0.11), and agreement with the radiologist was poor (ICC < 0.07). ChatGPT demonstrates partial capability in identifying specific sinus CT findings; however, it lacks overall diagnostic consistency. Human radiologists remain essential, and the clinical use of LLMs in imaging should be approached with caution.