Benchmarking large language models for quality control of chest radiographs and CT reports: a retrospective multimodal study.
Authors
Affiliations (17)
Affiliations (17)
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, The Fifth Clinical Medical College of Henan University of Chinese Medicine, (Zhengzhou People's Hospital), Zhengzhou, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
- Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
Abstract
This study aims to establish a retrospective, single-centre, feasibility-oriented benchmark for medical imaging quality control(QC) and to evaluate the potential of multiple large language models for chest X-ray radiograph(CXR) technical QC and CT report consistency assessment, based on a relatively small, radiologist-annotated dataset derived from routine clinical practice. This retrospective, single-centre study included 161 CXRs and 219 structured CT reports from routine clinical practice. Twelve labels were used for CXR QC, including eleven radiologist-defined error categories and one error-free label, while nine labels were used for CT report evaluation, including eight inconsistency categories and one error-free label. All cases were annotated using a radiologist consensus reference standard. Multiple large language models(LLMs) and multimodal large language models(MLLMs) were evaluated using Micro-F1 and Macro-F1 for CXR QC and expert-based Micro-F1 for CT report QC. For CXR QC, Gemini 2.0 Flash showed the strongest performance, achieving robust category-level generalization, while GPT-4o and Qwen2.5-VL-72B-Instruct demonstrated more balanced but weaker performance. In CT report QC, DeepSeek-R1 achieved the highest recall (62.23%) and the best overall performance. Across models, protocol-report mismatches and metric inconsistencies were the most common error types. This study presents an initial, feasibility-oriented multimodal benchmark for medical imaging quality control, showing that LLM's performance is highly task- and modality-dependent. Given the limited sample size, single-centre design, and Chinese-language scope, the findings support future multicentral validation and workflow-integrated evaluation, rather than immediate clinical deployment.