How green are large language models for radiology report labelling? Comparing human, rule-based and hybrid workflows.

May 27, 2026

papers

DOI: 10.1186/s13244-026-02289-2 PMID: 42201579

Authors

Fink MA,Bischoff A,Atsiatorme E,Kremer A,Kroschke J,Moll M,Stein P,Riebl V,Leichenich T,Kauczor HU,Schlamp K

Affiliations (5)

Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany. [email protected].
Translational Lung Research Center Heidelberg, Member of the German Center for Lung Research, Heidelberg, Germany. [email protected].
Clinic for Diagnostic and Interventional Radiology, University Hospital Heidelberg, Heidelberg, Germany.
Translational Lung Research Center Heidelberg, Member of the German Center for Lung Research, Heidelberg, Germany.
Diagnostic and Interventional Radiology with Nuclear Medicine, Heidelberg Thoracic Clinic, University of Heidelberg, Heidelberg, Germany.

Abstract

To address limited quantitative data on sustainable use of large language models (LLMs) in radiology, we quantified the resource footprint of LLMs for labelling CT pulmonary embolism reports and assessed how a hybrid rule-based-LLM workflow changes time, cost and carbon emissions compared with manual labelling. In this single-centre retrospective study, 2923 structured CT reports were labelled using four workflows: a rule-based extractor (RBE), an LLM-only pipeline using 18 open-weight and four proprietary models, a hybrid RBE-LLM pipeline that routed RBE failures to an LLM, and full manual labelling by radiologists. Ground truth was based on radiologist adjudication. For each LLM, we measured per-report latency, estimated CO<sub>2</sub> emissions and cost. Radiologists recorded the labelling time per report. Manual labelling required 32.8 h for 2923 reports (40.4 s/report; €0.42/report) with 95.0% accuracy (95% CI: 93.7-96.2). LLM-only pipelines were less accurate (85.1%; 95% CI: 84.9-85.5) but reduced labelling time to 12.4 h and cost to €2.60 (both p < 0.001). Hybrid RBE-LLM workflows yielded the highest accuracy (98.5%) and lowest resource use: across 22 models, switching from LLM-only to hybrid reduced time (6.7 to 0.97 h), cost (€1.19 to €0.17), and CO<sub>2</sub> (0.82 to 0.12 kg; all p < 0.001). LLM-only labelling reduced labour time and direct costs compared with manual annotation. A hybrid RBE-LLM pipeline that forwards rule-based failures to an LLM concentrated compute where needed and markedly decreased time, cost and emissions, supporting targeted deployment of LLMs for sustainable data-annotation workflows in radiology. By quantifying time, cost and carbon emissions of manual, rule-based, LLM and hybrid report labelling, this study identifies sustainable workflows for deploying LLMs in routine radiology reporting. Manual expert labelling of CT pulmonary embolism reports is time-intensive and costly. Mid-sized LLM configurations provide favourable trade-offs between performance and resource use. Hybrid rule-based-LLM workflows sustain accuracy while reducing resource demands.

View Source Full Text PDF

Topics

Journal Article

How green are large language models for radiology report labelling? Comparing human, rule-based and hybrid workflows.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?