Back to all papers

Benchmarking large language models for quality control of chest radiographs and CT reports: a retrospective multimodal study.

March 31, 2026pubmed logopapers

Authors

Qin Z,Gui Q,Bian M,Wang R,Ge H,Yao D,Sun Z,Zhao Y,Zhang Y,Shi H,Wang D,Song C,Liu L,He J,Xu J,Ju S,Wang YC

Affiliations (17)

  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, The Fifth Clinical Medical College of Henan University of Chinese Medicine, (Zhengzhou People's Hospital), Zhengzhou, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
  • Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
  • Shanghai Artificial Intelligence Laboratory, Shanghai, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].
  • Department of Radiology, Zhongda Hospital, Nurturing Center of Jiangsu Province for State Laboratory of AI Imaging & Interventional Radiology, School of Medicine, Southeast University, Nanjing, China. Electronic address: [email protected].

Abstract

This study aims to establish a retrospective, single-centre, feasibility-oriented benchmark for medical imaging quality control(QC) and to evaluate the potential of multiple large language models for chest X-ray radiograph(CXR) technical QC and CT report consistency assessment, based on a relatively small, radiologist-annotated dataset derived from routine clinical practice. This retrospective, single-centre study included 161 CXRs and 219 structured CT reports from routine clinical practice. Twelve labels were used for CXR QC, including eleven radiologist-defined error categories and one error-free label, while nine labels were used for CT report evaluation, including eight inconsistency categories and one error-free label. All cases were annotated using a radiologist consensus reference standard. Multiple large language models(LLMs) and multimodal large language models(MLLMs) were evaluated using Micro-F1 and Macro-F1 for CXR QC and expert-based Micro-F1 for CT report QC. For CXR QC, Gemini 2.0 Flash showed the strongest performance, achieving robust category-level generalization, while GPT-4o and Qwen2.5-VL-72B-Instruct demonstrated more balanced but weaker performance. In CT report QC, DeepSeek-R1 achieved the highest recall (62.23%) and the best overall performance. Across models, protocol-report mismatches and metric inconsistencies were the most common error types. This study presents an initial, feasibility-oriented multimodal benchmark for medical imaging quality control, showing that LLM's performance is highly task- and modality-dependent. Given the limited sample size, single-centre design, and Chinese-language scope, the findings support future multicentral validation and workflow-integrated evaluation, rather than immediate clinical deployment.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.