Simplifying radiology reports with large language models: privacy-compliant open- versus closed-weight models.

February 12, 2026

DOI: 10.1007/s00330-026-12329-6 PMID: 41677855

Authors

Proff AK,Salam B,Hayawi M,Kravchenko D,Mesropyan N,Aziz-Safaie T,Dell T,Theis M,Pieper CC,Sprinkart AM,Kütting D,Luetkens JA,Nowak S,Isaak A

Affiliations (5)

Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany.
Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Bonn, Germany.
Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany. [email protected].
Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Bonn, Germany. [email protected].

Abstract

Large language models (LLMs) like generative pre-trained transformer (GPT) can simplify radiology reports for medical laypersons, but privacy concerns limit their clinical applicability. This study compares closed-weight and in-hospital deployed privacy-compliant open-weight LLMs in generating patient-friendly radiology reports. A total of 60 radiology reports containing indication and impression sections (15 each from X-ray, ultrasound, CT, and MRI) were translated into lay-friendly versions using different LLMs: one commercial closed-weight model (GPT-4o) and two in-hospital deployed open-weight models (Llama-3-70b, Mixtral-8x22B). All reports were evaluated for readability (Flesch reading ease, reading time, word and sentence count). 21 medical laypeople assessed understandability using a 5-point Likert scale. Linear mixed-effects models and H-Kruskal-Wallis test were used for statistical analysis. LLM-generated reports demonstrated significantly improved readability, achieving higher Flesch reading ease scores (GPT-4o: 46 ± 7, Llama-3-70b: 44 ± 6, Mixtral-8x22B: 44 ± 6, original: 17 ± 13; p < 0.001). All three LLM reports yielded markedly higher layperson-understandability ratings than the original reports (GPT-4o: 4.4 ± 0.1; Llama-3-70B: 4.3 ± 0.1; Mixtral-8x22B: 4.1 ± 0.1 vs. 1.5 ± 0.1; p < 0.001 for each), with no significant difference between GPT-4o and Llama-3-70B (p = 0.136). Mixtral-8x22B and Llama-3-70B produced more errors with potential for patient harm than GPT-4o (p = 0.005 and p = 0.025, respectively). Imaging modality did not influence understandability (all p > 0.05). LLMs substantially improved layperson understanding of radiology reports. Open-weight, on-premises LLMs like Llama-3-70B show strong potential for real-world clinical use, though human oversight is still required. Question Can locally deployed open-weight large language models (LLMs) improve the readability and understandability of radiology reports for medical laypersons at a level comparable to closed-weight models? Findings LLMs significantly improved quantitative readability scores and qualitative ratings of layperson understandability; Llama-3-70B and GPT-4o showed comparable performance, and although the open-source models exhibited a higher error rate, they still performed well overall. Clinical relevance Open-weight LLMs provide a privacy-compliant way to generate a template for patient-friendly radiology reports suitable for real-world clinical use.

View Source Full Text PDF

Topics

Journal Article

Simplifying radiology reports with large language models: privacy-compliant open- versus closed-weight models.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?