Who labels best? Radiologists, rules, or large language models for CT reports on pulmonary embolism.

May 27, 2026

papers

DOI: 10.1186/s41747-026-00738-7 PMID: 42201590

Authors

Fink MA,Bischoff A,Atsiatorme E,Kremer A,Kroschke J,Moll M,Stein P,Riebl V,Leichenich T,Kauczor HU,Schlamp K

Affiliations (5)

Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany. [email protected].
Translational Lung Research Center Heidelberg, member of the German Center for Lung Research, Heidelberg, Germany. [email protected].
Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany.
Translational Lung Research Center Heidelberg, member of the German Center for Lung Research, Heidelberg, Germany.
Diagnostic and Interventional Radiology with Nuclear Medicine, Heidelberg Thoracic Clinic, University of Heidelberg, Heidelberg, Germany.

Abstract

To compare open-weight and proprietary large language models (LLMs), a rule-based extractor (RBE) and radiologists for labelling pulmonary embolism CT reports, and to test whether a hybrid RBE-LLM workflow improves labelling performance. This single-centre retrospective study included structured CT reports from October 2021 to March 2025. Three labelling pipelines were evaluated: an RBE; a model-agnostic LLM extractor (18 open-weight, four GPT-4 variants); and a hybrid pipeline routing only RBE failures to an LLM. Ground truth was defined at the report-text level by deterministic schema matching for initially RBE-valid fields and blinded adjudication of RBE-invalid fields by two attending radiologists. Eight radiologists provided a human baseline. Outcomes included F1 scores, accuracy, LLM-based salvage of RBE failures, and labelling time. In total, 2,923 reports from 2,923 patients (mean age 66 ± 17 years; 1,465 women) were included. Falcon3-10b and GPT-4.1-mini achieved similar item-level performance (F1 0.98 [95% CI, 0.97-0.98] for both; p = 0.70) and both exceeded the RBE (F1 0.81 [95% CI, 0.80-0.82]; p < 0.001). Salvage of RBE failures was comparable between open-weight and proprietary models (88.1% vs 91.9%; p = 0.12). The hybrid RBE-LLM workflow achieved 99.8% accuracy and F1 0.99 (0.98-0.99), exceeding both the RBE and pooled radiologists (F1 0.92 [95% CI, 0.90-0.93]; all p < 0.001). Schema-constrained open-weight and proprietary LLMs exceeded rule-based extraction and, at the upper end of performance, matched a pooled radiologist label-transfer baseline. A rules-first, targeted LLM workflow enabled near-perfect extraction from finalised structured pulmonary embolism CT reports. A rules-first LLM workflow can automate high-fidelity extraction of structured CT findings from finalised radiology reports, enabling scalable, auditable, and more consistent cohort curation for clinical research, registries, and quality improvement. A hybrid rules-first workflow combining a rule-based extractor (RBE) with targeted large language model (LLM) salvage achieved the highest overall performance for labelling of pulmonary embolism CT reports (F1, 0.99; accuracy, 99.8%). The top standalone open-weight and proprietary LLMs (Falcon3-10b and GPT-4.1-mini) both exceeded the RBE and, at the upper end of performance, matched a pooled radiologist label-transfer baseline. The hybrid workflow reduced cohort-curation time from 32.2 h for radiologists to 1.0 h while reducing LLM calls by 85.6%, because the LLM was only triggered for rule-failed fields.

View Source Full Text PDF

Topics

Pulmonary EmbolismTomography, X-Ray ComputedRadiologistsJournal Article

Who labels best? Radiologists, rules, or large language models for CT reports on pulmonary embolism.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?