Simplifying radiology reports with large language models: privacy-compliant open- versus closed-weight models.
Authors
Affiliations (5)
Affiliations (5)
- Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany.
- Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Bonn, Germany.
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
- Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Bonn, Germany. [email protected].
- Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Bonn, Germany. [email protected].
Abstract
Large language models (LLMs) like generative pre-trained transformer (GPT) can simplify radiology reports for medical laypersons, but privacy concerns limit their clinical applicability. This study compares closed-weight and in-hospital deployed privacy-compliant open-weight LLMs in generating patient-friendly radiology reports. A total of 60 radiology reports containing indication and impression sections (15 each from X-ray, ultrasound, CT, and MRI) were translated into lay-friendly versions using different LLMs: one commercial closed-weight model (GPT-4o) and two in-hospital deployed open-weight models (Llama-3-70b, Mixtral-8x22B). All reports were evaluated for readability (Flesch reading ease, reading time, word and sentence count). 21 medical laypeople assessed understandability using a 5-point Likert scale. Linear mixed-effects models and H-Kruskal-Wallis test were used for statistical analysis. LLM-generated reports demonstrated significantly improved readability, achieving higher Flesch reading ease scores (GPT-4o: 46βΒ±β7, Llama-3-70b: 44βΒ±β6, Mixtral-8x22B: 44βΒ±β6, original: 17βΒ±β13; pβ<β0.001). All three LLM reports yielded markedly higher layperson-understandability ratings than the original reports (GPT-4o: 4.4βΒ±β0.1; Llama-3-70B: 4.3βΒ±β0.1; Mixtral-8x22B: 4.1βΒ±β0.1 vs. 1.5βΒ±β0.1; pβ<β0.001 for each), with no significant difference between GPT-4o and Llama-3-70B (pβ=β0.136). Mixtral-8x22B and Llama-3-70B produced more errors with potential for patient harm than GPT-4o (pβ=β0.005 and pβ=β0.025, respectively). Imaging modality did not influence understandability (all pβ>β0.05). LLMs substantially improved layperson understanding of radiology reports. Open-weight, on-premises LLMs like Llama-3-70B show strong potential for real-world clinical use, though human oversight is still required. Question Can locally deployed open-weight large language models (LLMs) improve the readability and understandability of radiology reports for medical laypersons at a level comparable to closed-weight models? Findings LLMs significantly improved quantitative readability scores and qualitative ratings of layperson understandability; Llama-3-70B and GPT-4o showed comparable performance, and although the open-source models exhibited a higher error rate, they still performed well overall. Clinical relevance Open-weight LLMs provide a privacy-compliant way to generate a template for patient-friendly radiology reports suitable for real-world clinical use.