Integrating Large language models into radiology workflow: Impact of generating personalized report templates from summary.

Authors

Gupta A,Hussain M,Nikhileshwar K,Rastogi A,Rangarajan K

Affiliations (5)

  • Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
  • Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
  • Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
  • Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].
  • Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India. Electronic address: [email protected].

Abstract

To evaluate feasibility of large language models (LLMs) to convert radiologist-generated report summaries into personalized report templates, and assess its impact on scan reporting time and quality. In this retrospective study, 100 CT scans from oncology patients were randomly divided into two equal sets. Two radiologists generated conventional reports for one set and summary reports for the other, and vice versa. Three LLMs - GPT-4, Google Gemini, and Claude Opus - generated complete reports from the summaries using institution-specific generic templates. Two expert radiologists qualitatively evaluated the radiologist summaries and LLM-generated reports using the ACR RADPEER scoring system, using conventional radiologist reports as reference. Reporting time for conventional versus summary-based reports was compared, and LLM-generated reports were analyzed for errors. Quantitative similarity and linguistic metrics were computed to assess report alignment across models with the original radiologist-generated report summaries. Statistical analyses were performed using Python 3.0 to identify significant differences in reporting times, error rates and quantitative metrics. The average reporting time was significantly shorter for summary method (6.76 min) compared to conventional method (8.95 min) (p < 0.005). Among the 100 radiologist summaries, 10 received RADPEER scores worse than 1, with three deemed to have clinically significant discrepancies. Only one LLM-generated report received a worse RADPEER score than its corresponding summary. Error frequencies among LLM-generated reports showed no significant differences across models, with template-related errors being most common (χ<sup>2</sup> = 1.146, p = 0.564). Quantitative analysis indicated significant differences in similarity and linguistic metrics among the three LLMs (p < 0.05), reflecting unique generation patterns. Summary-based scan reporting along with use of LLMs to generate complete personalized report templates can shorten reporting time while maintaining the report quality. However, there remains a need for human oversight to address errors in the generated reports. Summary-based reporting of radiological studies along with the use of large language models to generate tailored reports using generic templates has the potential to make the workflow more efficient by shortening the reporting time while maintaining the quality of reporting.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.