Back to all papers

Locally deployed context-aware chatbot outperforms generic large language models for guideline-concordant pediatric imaging recommendations.

November 13, 2025pubmed logopapers

Authors

Gupta A,Rangarajan K,Krishna Kumar RG,Anshal S

Affiliations (3)

  • Room No 48, Department of Diagnostic and Interventional Onco-radiology, Dr BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, New Delhi, 110029, India. [email protected].
  • Room No 48, Department of Diagnostic and Interventional Onco-radiology, Dr BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, New Delhi, 110029, India.
  • Department of Radiodiagnosis and Interventional Radiology, All India Institute of Medical Sciences, New Delhi, India.

Abstract

Accurate modality selection in pediatric imaging is critical, yet adherence to the American College of Radiology (ACR) Appropriateness Criteria remains limited. Large language models (LLMs) offer potential as decision support tools but often lack domain-specific accuracy. To evaluate the performance of a locally run, context-aware chatbot (ped-Llama) based on an open-source LLM for providing personalized pediatric imaging recommendations grounded in ACR guidelines. A simulation-based study was conducted using 50 pediatric clinical scenarios derived from ACR guideline variants. The ped-Llama chatbot, built using a retrieval-augmented generation (RAG) approach with a locally deployed Llama-3.1-8B model, was compared against three generic LLMs (Llama-3.1-8B without RAG, Generative pre-trained transformer-4o [GPT-4o], Claude Opus) and three radiologists (junior resident, senior resident, specialist pediatric radiologist). Each provided imaging recommendations, which were classified as "Usually appropriate," "May be appropriate," or "Usually not appropriate" per ACR guidelines. Modal outputs were analyzed, and consistency across three independent LLM runs was assessed. ped-Llama achieved 80% (40/50) "Usually appropriate" recommendations, outperforming Llama-3.1-8B (46%), GPT-4o (54%), and Claude Opus (46%), and matching the specialist radiologist (76%). When both "Usually appropriate" and "May be appropriate" were accepted, ped-Llama achieved 90% accuracy. Consistency across runs was 72% for ped-Llama versus 44-50% for generic LLMs. This study shows that a locally run, RAG-enabled chatbot based on an open-source LLM can provide guideline-concordant pediatric imaging recommendations with expert-level accuracy. Such systems offer a practical and scalable approach to artificial intelligence-assisted decision support in radiology.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.