Back to all papers

Comparative Evaluation of Large Language Models in Explaining Radiology Reports: Expert Assessment of Readability, Understandability, and Communication Features.

October 29, 2025pubmed logopapers

Authors

Bozer A,Pekçevik Y

Affiliations (3)

  • Department of Radiology, Ministry of Health Izmir City Hospital, Izmir, Turkey. [email protected].
  • Department of Radiology, Ministry of Health Izmir City Hospital, Izmir, Turkey.
  • Department of Radiology, Izmir Faculty of Medicine, University of Health Sciences, Izmir, Turkey.

Abstract

To compare understandability, readability, and communication characteristics of radiology report explanations generated by three freely accessible large language models-ChatGPT, Gemini, and Copilot-based on a standardized prompt, as assessed by expert reviewers. In this retrospective single-center study, 100 anonymized radiology reports were randomly selected from five subspecialties. Each report was submitted to ChatGPT (GPT-3.5), Gemini, and Copilot between May 23 and May 30, 2025, using the prompt, "Can you explain my radiology report?". Responses were evaluated for medical correctness on a 3-point scale (0-2), understandability using the patient education materials assessment tool for understandability (PEMAT-U), and readability using Flesch Reading Ease (FRE), Automated Readability Index (ARI), and Gunning Fog Index (GFI). Communicative features-including uncertainty language, patient guidance, and clinical suggestions-were also assessed. Anxiety-inducing potential was rated on a 3-point Likert scale. All models demonstrated high medical correctness (mean: 1.97 ± 0.17/2). ChatGPT produced the most readable (FRE: 60.33 ± 3.65; ARI: 9.66 ± 1.01; GFI: 9.1 ± 1.04) and understandable (PEMAT-U: 89.58 ± 3.90%) responses (p < 0.01). Copilot included the most uncertainty language (1.62 ± 0.62) and clinical suggestions (1.69 ± 0.60), while Gemini provided the strongest patient guidance (1.62 ± 0.58) (all p < 0.01). Only Copilot showed subspecialty-related variation in readability (GFI; p = 0.048). Anxiety potential was low across all models (mean: 0.07 ± 0.33). ChatGPT offered superior readability and understandability. Copilot delivered more clinical detail with cautious language, while Gemini emphasized patient-centered guidance. These differences support context-specific use of language models in radiology communication. This study shows that freely accessible large language models produce radiology report explanations with varying levels of readability, understandability, and communication quality. Expert-based findings may help inform future strategies to optimize patient-facing applications of AI in radiological communication. This study compared how freely available AI chatbots respond to patient queries about radiology reports. Significant differences were found in understandability, readability, patient guidance, and use of uncertainty or clinical suggestions. Findings support context-specific use of AI tools to improve radiology communication and patient understanding.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.