Who labels best? Radiologists, rules, or large language models for CT reports on pulmonary embolism.
Authors
Affiliations (5)
Affiliations (5)
- Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany. [email protected].
- Translational Lung Research Center Heidelberg, member of the German Center for Lung Research, Heidelberg, Germany. [email protected].
- Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany.
- Translational Lung Research Center Heidelberg, member of the German Center for Lung Research, Heidelberg, Germany.
- Diagnostic and Interventional Radiology with Nuclear Medicine, Heidelberg Thoracic Clinic, University of Heidelberg, Heidelberg, Germany.
Abstract
To compare open-weight and proprietary large language models (LLMs), a rule-based extractor (RBE) and radiologists for labelling pulmonary embolism CT reports, and to test whether a hybrid RBE-LLM workflow improves labelling performance. This single-centre retrospective study included structured CT reports from October 2021 to March 2025. Three labelling pipelines were evaluated: an RBE; a model-agnostic LLM extractor (18 open-weight, four GPT-4 variants); and a hybrid pipeline routing only RBE failures to an LLM. Ground truth was defined at the report-text level by deterministic schema matching for initially RBE-valid fields and blinded adjudication of RBE-invalid fields by two attending radiologists. Eight radiologists provided a human baseline. Outcomes included F1 scores, accuracy, LLM-based salvage of RBE failures, and labelling time. In total, 2,923 reports from 2,923 patients (mean age 66 ± 17 years; 1,465 women) were included. Falcon3-10b and GPT-4.1-mini achieved similar item-level performance (F1 0.98 [95% CI, 0.97-0.98] for both; p = 0.70) and both exceeded the RBE (F1 0.81 [95% CI, 0.80-0.82]; p < 0.001). Salvage of RBE failures was comparable between open-weight and proprietary models (88.1% vs 91.9%; p = 0.12). The hybrid RBE-LLM workflow achieved 99.8% accuracy and F1 0.99 (0.98-0.99), exceeding both the RBE and pooled radiologists (F1 0.92 [95% CI, 0.90-0.93]; all p < 0.001). Schema-constrained open-weight and proprietary LLMs exceeded rule-based extraction and, at the upper end of performance, matched a pooled radiologist label-transfer baseline. A rules-first, targeted LLM workflow enabled near-perfect extraction from finalised structured pulmonary embolism CT reports. A rules-first LLM workflow can automate high-fidelity extraction of structured CT findings from finalised radiology reports, enabling scalable, auditable, and more consistent cohort curation for clinical research, registries, and quality improvement. A hybrid rules-first workflow combining a rule-based extractor (RBE) with targeted large language model (LLM) salvage achieved the highest overall performance for labelling of pulmonary embolism CT reports (F1, 0.99; accuracy, 99.8%). The top standalone open-weight and proprietary LLMs (Falcon3-10b and GPT-4.1-mini) both exceeded the RBE and, at the upper end of performance, matched a pooled radiologist label-transfer baseline. The hybrid workflow reduced cohort-curation time from 32.2 h for radiologists to 1.0 h while reducing LLM calls by 85.6%, because the LLM was only triggered for rule-failed fields.