Back to all papers

Evaluating the role of ChatGPT in structured radiology reporting: A systematic review.

February 6, 2026pubmed logopapers

Authors

Alalawi S,Alchoghari R,Hakami A,Alhazmi A,Almutari E,Asseiri R,Alzahrani A,Aladhyani M,Alqarni S,Al-Sharydah AM

Affiliations (8)

  • College of Medicine, Imam Abdulrahman Bin Faisal University, Khobar, Eastern Province, Saudi Arabia.
  • College of Medicine, Umm Al-Qura University, Makkah, Western Province, Saudi Arabia.
  • Radiology Department, Hayat National Hospital, Jazan, Saudi Arabia.
  • College of Applied Medical Sciences, Taif University, Taif, Western Province, Saudi Arabia.
  • College of Medicine, Tabuk University, Tabuk, Saudi Arabia.
  • College of Medicine, King Abdulaziz University, Jeddah, Western Province, Saudi Arabia.
  • College of Medicine, Shaqra University, Shaqra, Riyadh Province, Saudi Arabia.
  • Diagnostic and Interventional Radiology Department, Imam Abdulrahman Bin Faisal University, King Fahd Hospital of the University, Khobar, Eastern Province, Saudi Arabia.

Abstract

The integration of large language models (LLMs) such as ChatGPT into radiology has introduced new possibilities for structured reporting. While these models are designed to improve the clarity, accuracy, and efficiency of radiology workflows, their diagnostic performance and clinical reliability are still not well established. We aimed to systematically review the diagnostic accuracy, sensitivity, specificity, and clinical utility of ChatGPT and related LLMs in generating structured radiology reports. A systematic review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines and registered with the International Prospective Register of Systematic Reviews (CRD42025639804). PubMed and Google Scholar were searched for retrospective diagnostic accuracy studies involving ChatGPT or similar LLMs applied to structured radiology reporting. Risk of bias and applicability were assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. A narrative synthesis summarized performance metrics across imaging modalities and artificial intelligence model types. Due to the variability in methodologies and outcome reporting, a meta-analysis was not conducted. Twenty-eight out of the 1428 studies were included in this review, which were published between 2023 and 2024. GPT-4 consistently outperformed earlier models, achieving up to 99% accuracy in liver magnetic resonance imaging and 94% in brain magnetic resonance imaging interpretation. GPT-4o showed higher sensitivity in chest imaging (75%) with a specificity of 95%. Other domain-specific models also demonstrated high performance, including augmented transformer assisted radiology intelligence (98% accuracy) and Vicuna (96% accuracy). However, variability in diagnostic capability was observed, with models like GPT-4V underperforming in musculoskeletal and gastrointestinal imaging. The overall risk of bias according to the Quality Assessment of Diagnostic Accuracy Studies 2 tool was moderate, with common issues in patient selection and index test standardization. ChatGPT and similar LLMs show promising accuracy and applicability in structured radiology reporting, particularly for chest, brain, and liver imaging. However, their performance remains inconsistent across modalities, and further prospective studies with standardized protocols are needed before routine clinical adoption.

Topics

Journal ArticleSystematic Review

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.