Evaluating the role of ChatGPT in structured radiology reporting: A systematic review.

February 6, 2026

papers

DOI: 10.1097/MD.0000000000047541 PMID: 41650069

Authors

Alalawi S,Alchoghari R,Hakami A,Alhazmi A,Almutari E,Asseiri R,Alzahrani A,Aladhyani M,Alqarni S,Al-Sharydah AM

Affiliations (8)

College of Medicine, Imam Abdulrahman Bin Faisal University, Khobar, Eastern Province, Saudi Arabia.
College of Medicine, Umm Al-Qura University, Makkah, Western Province, Saudi Arabia.
Radiology Department, Hayat National Hospital, Jazan, Saudi Arabia.
College of Applied Medical Sciences, Taif University, Taif, Western Province, Saudi Arabia.
College of Medicine, Tabuk University, Tabuk, Saudi Arabia.
College of Medicine, King Abdulaziz University, Jeddah, Western Province, Saudi Arabia.
College of Medicine, Shaqra University, Shaqra, Riyadh Province, Saudi Arabia.
Diagnostic and Interventional Radiology Department, Imam Abdulrahman Bin Faisal University, King Fahd Hospital of the University, Khobar, Eastern Province, Saudi Arabia.

Abstract

The integration of large language models (LLMs) such as ChatGPT into radiology has introduced new possibilities for structured reporting. While these models are designed to improve the clarity, accuracy, and efficiency of radiology workflows, their diagnostic performance and clinical reliability are still not well established. We aimed to systematically review the diagnostic accuracy, sensitivity, specificity, and clinical utility of ChatGPT and related LLMs in generating structured radiology reports. A systematic review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines and registered with the International Prospective Register of Systematic Reviews (CRD42025639804). PubMed and Google Scholar were searched for retrospective diagnostic accuracy studies involving ChatGPT or similar LLMs applied to structured radiology reporting. Risk of bias and applicability were assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. A narrative synthesis summarized performance metrics across imaging modalities and artificial intelligence model types. Due to the variability in methodologies and outcome reporting, a meta-analysis was not conducted. Twenty-eight out of the 1428 studies were included in this review, which were published between 2023 and 2024. GPT-4 consistently outperformed earlier models, achieving up to 99% accuracy in liver magnetic resonance imaging and 94% in brain magnetic resonance imaging interpretation. GPT-4o showed higher sensitivity in chest imaging (75%) with a specificity of 95%. Other domain-specific models also demonstrated high performance, including augmented transformer assisted radiology intelligence (98% accuracy) and Vicuna (96% accuracy). However, variability in diagnostic capability was observed, with models like GPT-4V underperforming in musculoskeletal and gastrointestinal imaging. The overall risk of bias according to the Quality Assessment of Diagnostic Accuracy Studies 2 tool was moderate, with common issues in patient selection and index test standardization. ChatGPT and similar LLMs show promising accuracy and applicability in structured radiology reporting, particularly for chest, brain, and liver imaging. However, their performance remains inconsistent across modalities, and further prospective studies with standardized protocols are needed before routine clinical adoption.

View Source Full Text PDF

Topics

Journal ArticleSystematic Review

Evaluating the role of ChatGPT in structured radiology reporting: A systematic review.

Authors

Affiliations (8)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?