Intelligent Head and Neck CTA Report Quality Detection with Large Language Models.
Authors
Affiliations (5)
Affiliations (5)
- Department of Radiology and Nuclear Medicine, Xuanwu Hospital, Capital Medical University, Beijing, 100053, China.
- Beijing Key Laboratory of Magnetic Resonance Imaging and Brain Informatics, Beijing, 100053, China.
- Information Center, Xuanwu Hospital, Capital Medical University, Beijing, 100053, China. [email protected].
- Department of Radiology and Nuclear Medicine, Xuanwu Hospital, Capital Medical University, Beijing, 100053, China. [email protected].
- Beijing Key Laboratory of Magnetic Resonance Imaging and Brain Informatics, Beijing, 100053, China. [email protected].
Abstract
This study aims to identify common errors in head and neck CTA reports using GPT-4, ERNIE Bot, and SparkDesk, evaluating their potential for supporting quality control in Chinese radiological reports. This study collected 10,000 head and neck CTA imaging reports from Xuanwu Hospital (Dataset 1) and 5000 multi-center reports (Dataset 2). We identified six common types of errors and detected them using three large language models: GPT-4, ERNIE Bot, and SparkDesk. The overall quality of the reports was assessed using a 5-point Likert scale. We conducted a Wilcoxon rank-sum test and Friedman test to compare error detection rates and evaluate the models' performance on different error types and overall scores. For Dataset 2, after manual review, we annotated the six error types and provided overall scoring, while also recording the time taken for manual scoring and model detection. Model performance was evaluated using accuracy, precision, recall, and F1 score. The intraclass correlation coefficient measured consistency between manual and model scores, and ANOVA compared evaluation times. In Dataset 1, the error detection rates for final reports were significantly lower than those for preliminary reports across all three model types. Friedman's test indicated significant differences in error rates among the three models. In Dataset 2, the detection accuracy of the three LLMs for the six error types was above 95%. GPT-4 had a moderate consistency with manual scores (ICC = 0.517), while ERNIE Bot and SparkDesk showed slightly lower consistency (ICC = 0.431 and 0.456, respectively; P < 0.001). The models evaluated one hundred radiology reports significantly faster than human reviewers. LLMs can differentiate the quality of radiology reports and identify error types, significantly enhancing the efficiency of quality control reviews and providing substantial research and practical value in this field.