Integrating Large language models into radiology workflow: Impact of generating personalized report templates from summary.

May 25, 2025

papers DOI: 10.1016/j.ejrad.2025.112198 PMID: 40435550

Authors

Gupta A,Hussain M,Nikhileshwar K,Rastogi A,Rangarajan K

Affiliations (5)

Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].

Abstract

To evaluate feasibility of large language models (LLMs) to convert radiologist-generated report summaries into personalized report templates, and assess its impact on scan reporting time and quality. In this retrospective study, 100 CT scans from oncology patients were randomly divided into two equal sets. Two radiologists generated conventional reports for one set and summary reports for the other, and vice versa. Three LLMs - GPT-4, Google Gemini, and Claude Opus - generated complete reports from the summaries using institution-specific generic templates. Two expert radiologists qualitatively evaluated the radiologist summaries and LLM-generated reports using the ACR RADPEER scoring system, using conventional radiologist reports as reference. Reporting time for conventional versus summary-based reports was compared, and LLM-generated reports were analyzed for errors. Quantitative similarity and linguistic metrics were computed to assess report alignment across models with the original radiologist-generated report summaries. Statistical analyses were performed using Python 3.0 to identify significant differences in reporting times, error rates and quantitative metrics. The average reporting time was significantly shorter for summary method (6.76 min) compared to conventional method (8.95 min) (p < 0.005). Among the 100 radiologist summaries, 10 received RADPEER scores worse than 1, with three deemed to have clinically significant discrepancies. Only one LLM-generated report received a worse RADPEER score than its corresponding summary. Error frequencies among LLM-generated reports showed no significant differences across models, with template-related errors being most common (χ<sup>2</sup> = 1.146, p = 0.564). Quantitative analysis indicated significant differences in similarity and linguistic metrics among the three LLMs (p < 0.05), reflecting unique generation patterns. Summary-based scan reporting along with use of LLMs to generate complete personalized report templates can shorten reporting time while maintaining the report quality. However, there remains a need for human oversight to address errors in the generated reports. Summary-based reporting of radiological studies along with the use of large language models to generate tailored reports using generic templates has the potential to make the workflow more efficient by shortening the reporting time while maintaining the quality of reporting.

View Source Full Text PDF

Topics

Journal Article

Integrating Large language models into radiology workflow: Impact of generating personalized report templates from summary.

Authors

Affiliations (5)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?