Back to all papers

Automated Prediction of Radiological Protocols Using Retrieval Augmented Generation.

March 16, 2026pubmed logopapers

Authors

Testagrose C,Korfiatis P,Benfield J,Cook CJ,Kline TL,Merkel P,Demirer M,White RD,Bolan CW,Erdal BS

Affiliations (4)

  • Center for Augmented Intelligence in Imaging, Mayo Clinic, Jacksonville, FL, 32224, USA. [email protected].
  • Radiology, Mayo Clinic, Rochester, MN, 55905, USA.
  • Center for Augmented Intelligence in Imaging, Mayo Clinic, Jacksonville, FL, 32224, USA.
  • Center for Augmented Intelligence in Imaging, Mayo Clinic, Jacksonville, FL, 32224, USA. [email protected].

Abstract

Radiological protocol selection is a critical but time-consuming step in clinical workflow, requiring radiologists to match patient indications with an appropriate MRI or CT protocol. Manual selection can be prone to delays or potential errors, and automated approaches must contend with substantial class imbalance, site-specific variation, and evolving nomenclature. We investigated whether a large language model (LLM) can support reliable protocol selection at scale and whether retrievalaugmented generation (RAG) offers operational advantages over direct fine-tuning. Using patient reports collected across three Mayo Clinic sites (Arizona, Florida, and Rochester) spanning six radiological divisions, we trained site-specific Llama 3.2 3B models for use with and without retrieval augmentation. Division-scoped Facebook AI Similarity Search (FAISS) indexes constructed from procedure and diagnosis text were used to supply contextual evidence in the RAG framework. Both fine-tuned non-RAG and RAG-augmented models achieved strong baseline performance across sites. Paired bootstrap analyses revealed that RAG improved macro F1 at two of three sites (Arizona:: <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Δ</mi></math> =0.0306, p=0.0074; Florida: <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Δ</mi></math> =0.0245, p=0.0217) while maintaining equivalent weighted F1. However, at Rochester, RAG showed no macro F1 improvement and significantly degraded weighted F1 ( <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Δ</mi></math> =-0.0180, p=1.0000), indicating site-specific heterogeneity in RAG effectiveness. RAG introduced an interpretable abstention mechanism with low baseline rates (1-2.5protocol classification without sacrificing common protocol accuracy at most sites, though site-specific tuning may be necessary. Retrieval indexes can be refreshed far more easily than retraining LLMs, enabling continual adaptation to evolving clinical workflows. Future prospective deployment should evaluate real-time accuracy, investigate site-specific performance drivers, and assess abstention as a safety mechanism in clinical decision support.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.