Locally deployed context-aware chatbot outperforms generic large language models for guideline-concordant pediatric imaging recommendations.
Authors
Affiliations (3)
Affiliations (3)
- Room No 48, Department of Diagnostic and Interventional Onco-radiology, Dr BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, New Delhi, 110029, India. [email protected].
- Room No 48, Department of Diagnostic and Interventional Onco-radiology, Dr BRAIRCH, All India Institute of Medical Sciences, Ansari Nagar, New Delhi, 110029, India.
- Department of Radiodiagnosis and Interventional Radiology, All India Institute of Medical Sciences, New Delhi, India.
Abstract
Accurate modality selection in pediatric imaging is critical, yet adherence to the American College of Radiology (ACR) Appropriateness Criteria remains limited. Large language models (LLMs) offer potential as decision support tools but often lack domain-specific accuracy. To evaluate the performance of a locally run, context-aware chatbot (ped-Llama) based on an open-source LLM for providing personalized pediatric imaging recommendations grounded in ACR guidelines. A simulation-based study was conducted using 50 pediatric clinical scenarios derived from ACR guideline variants. The ped-Llama chatbot, built using a retrieval-augmented generation (RAG) approach with a locally deployed Llama-3.1-8B model, was compared against three generic LLMs (Llama-3.1-8B without RAG, Generative pre-trained transformer-4o [GPT-4o], Claude Opus) and three radiologists (junior resident, senior resident, specialist pediatric radiologist). Each provided imaging recommendations, which were classified as "Usually appropriate," "May be appropriate," or "Usually not appropriate" per ACR guidelines. Modal outputs were analyzed, and consistency across three independent LLM runs was assessed. ped-Llama achieved 80% (40/50) "Usually appropriate" recommendations, outperforming Llama-3.1-8B (46%), GPT-4o (54%), and Claude Opus (46%), and matching the specialist radiologist (76%). When both "Usually appropriate" and "May be appropriate" were accepted, ped-Llama achieved 90% accuracy. Consistency across runs was 72% for ped-Llama versus 44-50% for generic LLMs. This study shows that a locally run, RAG-enabled chatbot based on an open-source LLM can provide guideline-concordant pediatric imaging recommendations with expert-level accuracy. Such systems offer a practical and scalable approach to artificial intelligence-assisted decision support in radiology.