Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.

Authors

Prucker P,Busch F,Dorfner F,Mertens CJ,Bayerl N,Makowski MR,Bressem KK,Adams LC

Affiliations (5)

  • Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany.
  • Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Radiology, Charitéplatz 1, 10117 Berlin, Germany.
  • Institute of Radiology, Friedrich-Alexander-Universität Erlangen-Nürnberg and Uniklinikum Erlangen, Erlangen, Germany.
  • Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany; Department of Radiology and Nuclear Medicine, German Heart Center Munich, Lazarettstraße 36, 80636 Munich, Germany.
  • Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany. Electronic address: [email protected].

Abstract

Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists. Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed. Inter-rater reliability was substantial to near perfect (κ = 0.76-0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (p = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, p < 0.001). Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.