Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.

July 14, 2025

papers DOI: 10.1016/j.ejrad.2025.112316 PMID: 40674943

Authors

Lindholz M,Burdenski A,Ruppel R,Schulze-Weddige S,Baumgärtner GL,Schobert I,Haack AM,Eminovic S,Milnik A,Hamm CA,Frisch A,Penzkofer T

Affiliations (12)

Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Division of Molecular Neuroscience, University of Basel, Basel, Switzerland; Research Cluster Molecular and Cognitive Neurosciences, Department of Biomedicine, University of Basel, Basel, Switzerland. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany; Berlin Institute of Health, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany. Electronic address: [email protected].
Department of Radiology, Charité Universitätsmedizin Berlin, Berlin, Germany; Berlin Institute of Health, Berlin, Germany. Electronic address: [email protected].

Abstract

Radiology reports can change during workflows, especially when residents draft preliminary versions that attending physicians finalize. We explored how large language models (LLMs) and embedding techniques can categorize these changes into textual, semantic, or clinically actionable types. We evaluated 400 adult CT reports drafted by residents against finalized versions by attending physicians. Changes were rated on a five-point scale from no changes to critical ones. We examined open-source LLMs alongside traditional metrics like normalized word differences, Levenshtein and Jaccard similarity, and text embedding similarity. Model performance was assessed using quadratic weighted Cohen's kappa (κ), (balanced) accuracy, F<sub>1</sub>, precision, and recall. Inter-rater reliability among evaluators was excellent (κ = 0.990). Of the reports analyzed, 1.3 % contained critical changes. The tested methods showed significant performance differences (P < 0.001). The Qwen3-235B-A22B model using a zero-shot prompt, most closely aligned with human assessments of changes in clinical reports, achieving a κ of 0.822 (SD 0.031). The best conventional metric, word difference, had a κ of 0.732 (SD 0.048), the difference between the two showed statistical significance in unadjusted post-hoc tests (P = 0.038) but lost significance after adjusting for multiple testing (P = 0.064). Embedding models underperformed compared to LLMs and classical methods, showing statistical significance in most cases. Large language models like Qwen3-235B-A22B demonstrated moderate to strong alignment with expert evaluations of the clinical significance of changes in radiology reports. LLMs outperformed embedding methods and traditional string and word approaches, achieving statistical significance in most instances. This demonstrates their potential as tools to support peer review.

View Source Full Text PDF

Topics

Journal ArticleReview

Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.

Authors

Affiliations (12)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?