Benchmarking large language models for quality control of chest radiographs and CT reports: a retrospective multimodal study.

March 31, 2026

papers

DOI: 10.1016/j.ejrad.2026.112840 PMID: 41932016

Authors

Qin Z,Gui Q,Bian M,Wang R,Ge H,Yao D,Sun Z,Zhao Y,Zhang Y,Shi H,Wang D,Song C,Liu L,He J,Xu J,Ju S,Wang YC

Affiliations (17)

Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, The Fifth Clinical Medical College of Henan University of Chinese Medicine, (Zhengzhou People's Hospital), Zhengzhou, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].

Abstract

This study aims to establish a retrospective, single-centre, feasibility-oriented benchmark for medical imaging quality control(QC) and to evaluate the potential of multiple large language models for chest X-ray radiograph(CXR) technical QC and CT report consistency assessment, based on a relatively small, radiologist-annotated dataset derived from routine clinical practice. This retrospective, single-centre study included 161 CXRs and 219 structured CT reports from routine clinical practice. Twelve labels were used for CXR QC, including eleven radiologist-defined error categories and one error-free label, while nine labels were used for CT report evaluation, including eight inconsistency categories and one error-free label. All cases were annotated using a radiologist consensus reference standard. Multiple large language models(LLMs) and multimodal large language models(MLLMs) were evaluated using Micro-F1 and Macro-F1 for CXR QC and expert-based Micro-F1 for CT report QC. For CXR QC, Gemini 2.0 Flash showed the strongest performance, achieving robust category-level generalization, while GPT-4o and Qwen2.5-VL-72B-Instruct demonstrated more balanced but weaker performance. In CT report QC, DeepSeek-R1 achieved the highest recall (62.23%) and the best overall performance. Across models, protocol-report mismatches and metric inconsistencies were the most common error types. This study presents an initial, feasibility-oriented multimodal benchmark for medical imaging quality control, showing that LLM's performance is highly task- and modality-dependent. Given the limited sample size, single-centre design, and Chinese-language scope, the findings support future multicentral validation and workflow-integrated evaluation, rather than immediate clinical deployment.

View Source Full Text PDF

Topics

Journal Article

Benchmarking large language models for quality control of chest radiographs and CT reports: a retrospective multimodal study.

Authors

Affiliations (17)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?